1
00:00:01,550 --> 00:00:03,920
The following content is
provided under a Creative

2
00:00:03,920 --> 00:00:05,310
Commons license.

3
00:00:05,310 --> 00:00:07,520
Your support will help
MIT OpenCourseWare

4
00:00:07,520 --> 00:00:11,610
continue to offer high quality
educational resources for free.

5
00:00:11,610 --> 00:00:14,180
To make a donation, or to
view additional materials

6
00:00:14,180 --> 00:00:18,140
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,140 --> 00:00:19,026
at ocw.mit.edu.

8
00:00:23,018 --> 00:00:23,810
GILBERT STRANG: OK.

9
00:00:23,810 --> 00:00:27,070
So what I promised, and
now I'm going to do it,

10
00:00:27,070 --> 00:00:34,010
to talk about gradient
descent and its descendants.

11
00:00:34,010 --> 00:00:37,190
So from the basic
gradient descent formula,

12
00:00:37,190 --> 00:00:38,390
which we all know--

13
00:00:38,390 --> 00:00:40,640
let me just write that down--

14
00:00:40,640 --> 00:00:47,150
the new point is the old point.

15
00:00:47,150 --> 00:00:49,280
We're going downwards,
so with a minus

16
00:00:49,280 --> 00:00:52,520
sign that's the step size.

17
00:00:52,520 --> 00:00:57,260
And we compute the
gradient at XK.

18
00:00:57,260 --> 00:01:00,110
So we're descending
in the direction

19
00:01:00,110 --> 00:01:02,810
of the negative gradient.

20
00:01:02,810 --> 00:01:09,470
And that's the basic formula,
and is in every book studied.

21
00:01:09,470 --> 00:01:18,980
So my main reference for
some of these lectures

22
00:01:18,980 --> 00:01:24,910
is the book by Stephen Boyd
and Lieven Vandenberghe.

23
00:01:24,910 --> 00:01:30,350
And I mention again, Professor
Boyd is talking, in this room,

24
00:01:30,350 --> 00:01:35,360
next week Wednesday,
Thursday and he's speaking

25
00:01:35,360 --> 00:01:38,870
somewhere on Friday at 4:30--

26
00:01:38,870 --> 00:01:41,310
and of course,
about optimization.

27
00:01:41,310 --> 00:01:43,950
And he's a good lecturer,
yeah, very good.

28
00:01:43,950 --> 00:01:45,200
OK.

29
00:01:45,200 --> 00:01:49,790
So there's steepest descent,
and I've redrawn my picture

30
00:01:49,790 --> 00:01:50,780
from last time.

31
00:01:50,780 --> 00:01:53,840
Now I'll go over there
and look at that picture.

32
00:01:53,840 --> 00:01:56,390
But let me say what's coming.

33
00:01:56,390 --> 00:01:59,350
So that's pretty standard--

34
00:01:59,350 --> 00:02:02,800
very standard, you could say.

35
00:02:02,800 --> 00:02:09,889
Then this is the improvement
that is widely used.

36
00:02:09,889 --> 00:02:11,980
Adding in something
called momentum

37
00:02:11,980 --> 00:02:15,920
to avoid the zigzag that
we're going to see over there.

38
00:02:15,920 --> 00:02:17,500
And there's another
way to do it.

39
00:02:17,500 --> 00:02:20,640
There's a Russian
guy named Nesterov.

40
00:02:20,640 --> 00:02:22,920
His papers are not easy
to read, but they've

41
00:02:22,920 --> 00:02:25,330
got serious content.

42
00:02:25,330 --> 00:02:29,370
And one thing he did
was find an alternative

43
00:02:29,370 --> 00:02:37,110
to momentum that also
accelerated the descent.

44
00:02:37,110 --> 00:02:42,870
So this produces-- these
both produce faster descent

45
00:02:42,870 --> 00:02:45,600
than the ordinary one.

46
00:02:45,600 --> 00:02:46,410
OK.

47
00:02:46,410 --> 00:02:48,930
And then you know,
looking ahead,

48
00:02:48,930 --> 00:02:52,410
that for problems
of machine learning,

49
00:02:52,410 --> 00:02:56,750
they're so large
that the gradient--

50
00:02:56,750 --> 00:02:58,440
we have so many variables--

51
00:02:58,440 --> 00:03:01,740
all those weights are variables.

52
00:03:01,740 --> 00:03:07,420
And that could-- hundreds of
thousands is not uncommon.

53
00:03:07,420 --> 00:03:12,690
So then the gradient becomes
a pretty big calculation,

54
00:03:12,690 --> 00:03:15,070
and we just don't have
to do it all at once.

55
00:03:15,070 --> 00:03:16,440
We don't have to change--

56
00:03:16,440 --> 00:03:22,900
so XK is a vector of all the
weights, or-- and using--

57
00:03:22,900 --> 00:03:29,740
and our equations are
matching the training data.

58
00:03:29,740 --> 00:03:33,030
So we don't have to use all
the training data at once,

59
00:03:33,030 --> 00:03:34,290
and we don't.

60
00:03:34,290 --> 00:03:38,970
We take a batch of
training data, like one.

61
00:03:38,970 --> 00:03:42,240
But that's sort of inefficient
in the opposite direction,

62
00:03:42,240 --> 00:03:43,920
to do them one at a time.

63
00:03:43,920 --> 00:03:46,210
So we don't know want to
do them one at a time,

64
00:03:46,210 --> 00:03:49,710
but we don't want to do
all million at a time.

65
00:03:49,710 --> 00:03:53,580
So the compromise
is a mini batch.

66
00:03:53,580 --> 00:03:59,490
So stochastic gradient descent
does a mini batch at a time--

67
00:03:59,490 --> 00:04:04,710
a mini batch of
training, of samples

68
00:04:04,710 --> 00:04:08,730
training data each step.

69
00:04:12,620 --> 00:04:15,330
And it can choose
them stochastically--

70
00:04:15,330 --> 00:04:19,560
meaning randomly, or
more systematically--

71
00:04:19,560 --> 00:04:23,190
but we do a batch at a time.

72
00:04:23,190 --> 00:04:26,266
And that will come after the--

73
00:04:26,266 --> 00:04:32,340
it'll come next week after a
marathon, of course, on Monday.

74
00:04:32,340 --> 00:04:33,510
OK.

75
00:04:33,510 --> 00:04:36,810
So let me just go back to
that picture for a moment,

76
00:04:36,810 --> 00:04:39,420
but then the real
content of today

77
00:04:39,420 --> 00:04:43,970
is this one with momentum added.

78
00:04:43,970 --> 00:04:44,550
OK.

79
00:04:44,550 --> 00:04:48,720
I just-- I probably haven't
got the picture perfect yet.

80
00:04:48,720 --> 00:04:56,620
I'm just not an artist,
but I think I'm closer.

81
00:04:56,620 --> 00:05:01,120
So this is-- those
are the level sets.

82
00:05:01,120 --> 00:05:06,550
Those are the sets f
of x equal constant.

83
00:05:06,550 --> 00:05:11,590
And in our model problem,
f of x is x1 squared--

84
00:05:11,590 --> 00:05:16,750
or let's say x squared plus
b-y squared equal constant with

85
00:05:16,750 --> 00:05:18,280
small b--

86
00:05:18,280 --> 00:05:22,240
b below 1 and maybe far below 1.

87
00:05:22,240 --> 00:05:23,260
So those are ellipses.

88
00:05:26,277 --> 00:05:27,860
Those are the equations
of an ellipse,

89
00:05:27,860 --> 00:05:30,050
and that's what I tried to draw.

90
00:05:30,050 --> 00:05:32,240
And if b is small,
then the ellipses

91
00:05:32,240 --> 00:05:35,300
are long and thin like that.

92
00:05:35,300 --> 00:05:38,510
And now, what's the picture?

93
00:05:38,510 --> 00:05:42,470
You start with a point x
nought, and you descend

94
00:05:42,470 --> 00:05:44,400
in the steepest direction.

95
00:05:44,400 --> 00:05:47,740
So the steepest direction is
perpendicular to the level set,

96
00:05:47,740 --> 00:05:48,240
right?

97
00:05:48,240 --> 00:05:49,650
Perpendicular to the ellipse.

98
00:05:49,650 --> 00:05:51,270
So you're down, down, down.

99
00:05:51,270 --> 00:05:53,100
You're passing
through more ellipses,

100
00:05:53,100 --> 00:05:55,470
more ellipses, more ellipses.

101
00:05:55,470 --> 00:05:58,620
Eventually, your tangent to a--

102
00:05:58,620 --> 00:06:00,840
it seems to me it
has to be tangent.

103
00:06:00,840 --> 00:06:05,180
I didn't read this, but
looks reasonable to me that

104
00:06:05,180 --> 00:06:11,760
the farthest in level
set-- farthest in ellipse--

105
00:06:11,760 --> 00:06:15,220
you're tangent to, and then
you start going up again.

106
00:06:15,220 --> 00:06:19,440
So that's the optimal point
to stop to end that step.

107
00:06:19,440 --> 00:06:21,810
And then where does
the next step go?

108
00:06:21,810 --> 00:06:23,420
Well, you're here.

109
00:06:23,420 --> 00:06:25,080
You're on an ellipse.

110
00:06:25,080 --> 00:06:26,520
That's a level set.

111
00:06:26,520 --> 00:06:28,980
You want to move in
the gradient direction.

112
00:06:28,980 --> 00:06:31,350
That's perpendicular
to the level set.

113
00:06:31,350 --> 00:06:33,870
So you're going
down somewhere here,

114
00:06:33,870 --> 00:06:37,620
and you're passing again
through more and more ellipses,

115
00:06:37,620 --> 00:06:44,760
until you're tangent to
a smaller ellipse here.

116
00:06:44,760 --> 00:06:48,280
And you see the zigzag pattern.

117
00:06:48,280 --> 00:06:50,460
And that zigzag
pattern is what we

118
00:06:50,460 --> 00:07:00,240
see, by formula, in Boyd's book,
and many other places, too.

119
00:07:00,240 --> 00:07:06,990
The formula has those
powers of the magic number.

120
00:07:06,990 --> 00:07:08,850
So we start at the--

121
00:07:08,850 --> 00:07:21,300
start at the point b1,
and follow this path.

122
00:07:21,300 --> 00:07:27,120
Then the X's are the same
b times this quantity

123
00:07:27,120 --> 00:07:29,670
to the kth power.

124
00:07:29,670 --> 00:07:34,180
And here is that quantity,
b minus 1 over b plus 1.

125
00:07:34,180 --> 00:07:38,410
So you see, for a small b,
that's a negative number.

126
00:07:38,410 --> 00:07:42,720
So it's flipping sine in the
X's, as we saw in the picture.

127
00:07:42,720 --> 00:07:45,690
At least that part of
the picture is correct.

128
00:07:45,690 --> 00:07:48,390
The y's don't flip sine.

129
00:07:48,390 --> 00:07:55,380
So this was XK, and
when k is 0, we got b.

130
00:07:55,380 --> 00:08:03,600
YK is, I think, is
not flipping sine.

131
00:08:03,600 --> 00:08:05,400
So that looks good.

132
00:08:05,400 --> 00:08:09,780
And then FK-- the value of f--

133
00:08:09,780 --> 00:08:12,690
also was the same quantity.

134
00:08:12,690 --> 00:08:20,910
FK is that same quantity
to the kth times f0.

135
00:08:20,910 --> 00:08:24,810
So that quantity's
all important.

136
00:08:24,810 --> 00:08:27,930
And so the purpose
of today's lecture,

137
00:08:27,930 --> 00:08:33,159
is to tell you what
the momentum term--

138
00:08:33,159 --> 00:08:35,090
what improvement--
what change that

139
00:08:35,090 --> 00:08:39,350
brings in the basic
steepest descent formula.

140
00:08:39,350 --> 00:08:41,870
I'm going to add on
another term, which

141
00:08:41,870 --> 00:08:43,940
is going to have some--

142
00:08:43,940 --> 00:08:48,120
give us some memory
of the previous step.

143
00:08:48,120 --> 00:08:58,080
And so when I do that, I want
to track that kind of descent

144
00:08:58,080 --> 00:09:00,830
for the new--

145
00:09:00,830 --> 00:09:06,410
for the accelerated descent,
and track it and see

146
00:09:06,410 --> 00:09:11,190
what improvement the
momentum term brings.

147
00:09:11,190 --> 00:09:14,450
And so the final result
will be to tell you

148
00:09:14,450 --> 00:09:15,860
the improvement in the--

149
00:09:15,860 --> 00:09:18,380
produced by the momentum term.

150
00:09:18,380 --> 00:09:21,740
Maybe while I have
your attention,

151
00:09:21,740 --> 00:09:24,140
I'll tell you what it is now.

152
00:09:24,140 --> 00:09:29,430
And then will come the
details, the algebra.

153
00:09:29,430 --> 00:09:33,610
And to me-- so this
is as my own thought--

154
00:09:36,360 --> 00:09:38,940
it's a miracle that
the algebra, which

155
00:09:38,940 --> 00:09:43,740
is straightforward-- you really
see the value of eigenvectors.

156
00:09:43,740 --> 00:09:49,410
We explained eigenvectors in
class, but here you see why--

157
00:09:49,410 --> 00:09:51,310
how to use them.

158
00:09:51,310 --> 00:09:54,180
That is really a good exercise.

159
00:09:54,180 --> 00:10:03,940
But to me it's a miracle that
the expression with momentum

160
00:10:03,940 --> 00:10:07,450
is very much like that
expression, but different,

161
00:10:07,450 --> 00:10:08,770
of course.

162
00:10:08,770 --> 00:10:18,100
The decay-- the term that tells
you how fast the decay is--

163
00:10:18,100 --> 00:10:19,900
is smaller.

164
00:10:19,900 --> 00:10:21,520
So you're taking kth power.

165
00:10:21,520 --> 00:10:25,070
So let me-- I'll write that
down, if that's all right.

166
00:10:25,070 --> 00:10:26,560
I didn't plan to do--

167
00:10:26,560 --> 00:10:34,010
to reveal the final result at
the beginning of the lecture.

168
00:10:34,010 --> 00:10:37,560
But I think you want to
see where we're going.

169
00:10:37,560 --> 00:10:48,500
So with momentum-- and we
have to see what that means--

170
00:10:48,500 --> 00:10:56,810
this term of 1 minus b
over 1 plus b becomes--

171
00:10:56,810 --> 00:11:06,090
changes to-- 1 minus
square root of b over 1

172
00:11:06,090 --> 00:11:08,820
plus square root of b.

173
00:11:08,820 --> 00:11:12,930
So I mentioned that
before, but I don't think

174
00:11:12,930 --> 00:11:16,950
I wrote it down as clearly.

175
00:11:16,950 --> 00:11:20,940
So the miracle to me is to
get such a nice expression

176
00:11:20,940 --> 00:11:22,080
for the--

177
00:11:22,080 --> 00:11:31,030
because you'll see the algebra
is-- it works, but it involves

178
00:11:31,030 --> 00:11:34,150
more terms because
of momentum, involves

179
00:11:34,150 --> 00:11:37,420
doing a minimization
of eigenvalues,

180
00:11:37,420 --> 00:11:39,460
and yet it comes out nicely.

181
00:11:39,460 --> 00:11:42,860
And then you have to see
the importance of that.

182
00:11:42,860 --> 00:11:45,340
So let me-- I will just
take the same example

183
00:11:45,340 --> 00:11:47,290
that I mentioned before.

184
00:11:47,290 --> 00:11:55,750
If b is 1 over 100, then
this is 0.99 over 1.01.

185
00:11:55,750 --> 00:11:58,490
I think that these--

186
00:11:58,490 --> 00:12:05,470
there's a square here, 2k.

187
00:12:05,470 --> 00:12:10,830
So if we're-- so I'll just
keep the square there,

188
00:12:10,830 --> 00:12:14,460
no big change, but I'm looking
at-- now here-- at the square.

189
00:12:17,450 --> 00:12:21,700
Maybe squares are everywhere.

190
00:12:21,700 --> 00:12:22,200
OK.

191
00:12:22,200 --> 00:12:26,730
So that's close to 1.

192
00:12:26,730 --> 00:12:29,540
And now let's compare
that with what we have.

193
00:12:29,540 --> 00:12:32,510
So if b is 1 over 100--

194
00:12:32,510 --> 00:12:35,120
so I'm taking b
to be 1 over 100--

195
00:12:35,120 --> 00:12:37,460
and square root
of b is 1 over 10.

196
00:12:37,460 --> 00:12:45,090
So this is 0.9 over 1.1 squared.

197
00:12:45,090 --> 00:12:46,790
And there's a
tremendous-- that's

198
00:12:46,790 --> 00:12:49,950
a lot smaller than that is.

199
00:12:49,950 --> 00:12:51,180
Right.

200
00:12:51,180 --> 00:12:57,120
9/10-- 9 over 11,
compared to 99 over 101.

201
00:12:57,120 --> 00:13:03,690
This one is
definitely-- oh, sorry.

202
00:13:03,690 --> 00:13:10,560
Yeah, this reduction factor
is well below that one.

203
00:13:10,560 --> 00:13:11,520
So it's a good thing.

204
00:13:11,520 --> 00:13:14,040
It's worth doing.

205
00:13:14,040 --> 00:13:17,020
And now what does it involve?

206
00:13:17,020 --> 00:13:25,240
So I'll write down the
expression for the stochastic--

207
00:13:25,240 --> 00:13:26,200
here we go.

208
00:13:26,200 --> 00:13:27,070
OK.

209
00:13:27,070 --> 00:13:29,530
So here's one way to see it.

210
00:13:29,530 --> 00:13:36,130
The new X is the old
X minus the gradient.

211
00:13:40,330 --> 00:13:47,300
And now comes an extra term,
which gives us a little memory.

212
00:13:47,300 --> 00:13:50,640
Well, sorry.

213
00:13:50,640 --> 00:13:54,700
The algebra is slightly nicer
if I write it a little bit

214
00:13:54,700 --> 00:13:56,100
differently.

215
00:13:56,100 --> 00:14:06,480
I'll create a new quantity,
ZK, with a step size.

216
00:14:06,480 --> 00:14:06,980
OK.

217
00:14:15,800 --> 00:14:19,100
So if I took ZK to
be just the gradient,

218
00:14:19,100 --> 00:14:20,540
that would be steepest descent.

219
00:14:20,540 --> 00:14:22,370
Nothing has changed.

220
00:14:22,370 --> 00:14:26,240
But instead, I'm
going to take ZK--

221
00:14:26,240 --> 00:14:28,400
well, it's leading term
will be the gradient.

222
00:14:31,530 --> 00:14:34,080
But here comes
the momentum term.

223
00:14:34,080 --> 00:14:36,930
I add on a multiple beta.

224
00:14:36,930 --> 00:14:40,050
One way to do it is
of the previous Z.

225
00:14:40,050 --> 00:14:42,600
So the Z is the
search direction.

226
00:14:42,600 --> 00:14:44,305
Z is the gradient
you're traveling.

227
00:14:44,305 --> 00:14:47,730
It is the direction
you're moving.

228
00:14:47,730 --> 00:14:51,630
So it's different from
that direction there.

229
00:14:51,630 --> 00:14:56,430
That direction was the gradient.

230
00:14:56,430 --> 00:15:02,310
This direction is the
gradient corrected by a memory

231
00:15:02,310 --> 00:15:05,020
term, a momentum term.

232
00:15:05,020 --> 00:15:09,780
And one way to interpret that
is to say that that ball--

233
00:15:09,780 --> 00:15:14,730
is to think of a heavy ball,
instead of just a point.

234
00:15:14,730 --> 00:15:17,610
I think of a heavy ball.

235
00:15:17,610 --> 00:15:27,060
It, instead of bouncing back and
forth as uselessly as this one,

236
00:15:27,060 --> 00:15:28,500
it tends to--

237
00:15:28,500 --> 00:15:32,640
it still bounces, of course,
on the sides of the level set--

238
00:15:32,640 --> 00:15:35,820
but it comes down
the valley faster.

239
00:15:35,820 --> 00:15:37,840
And that's the effect of this.

240
00:15:37,840 --> 00:15:43,710
So you could play with
different adjustment

241
00:15:43,710 --> 00:15:45,210
terms, different corrections.

242
00:15:45,210 --> 00:15:48,780
So I'll follow through this one.

243
00:15:48,780 --> 00:15:53,760
Nesterov had another way to
make a change in the formula,

244
00:15:53,760 --> 00:15:57,140
and there are certainly
others beyond that.

245
00:15:57,140 --> 00:16:01,260
OK, so how do we
analyze that one?

246
00:16:01,260 --> 00:16:07,610
Well, the real point is,
we've sort of, by taking--

247
00:16:07,610 --> 00:16:11,900
by involving the
previous step, we now

248
00:16:11,900 --> 00:16:16,620
have a three level method
instead of a two level method,

249
00:16:16,620 --> 00:16:18,000
you could say.

250
00:16:18,000 --> 00:16:21,360
This involves only
level K plus 1

251
00:16:21,360 --> 00:16:28,320
and level K. The formulas
now involve K plus 1K,

252
00:16:28,320 --> 00:16:30,240
and K minus 1.

253
00:16:30,240 --> 00:16:35,490
It's just like going from
a first order differential

254
00:16:35,490 --> 00:16:38,680
equation to a second order
differential equation.

255
00:16:42,540 --> 00:16:46,300
I'm not really thinking
that K is a time variable.

256
00:16:46,300 --> 00:16:50,290
But in the analogy, K
could be a time variable.

257
00:16:50,290 --> 00:16:54,480
So that here we had a
first order equation.

258
00:16:54,480 --> 00:16:57,150
If I wanted to model
that, it's sort

259
00:16:57,150 --> 00:17:01,840
of a DXDT coming in
there, equal gradient.

260
00:17:01,840 --> 00:17:04,260
And these models
are highly useful

261
00:17:04,260 --> 00:17:09,390
and developed for sort
of a continuous model

262
00:17:09,390 --> 00:17:12,369
of steepest descent--

263
00:17:12,369 --> 00:17:19,740
a continuous motion instead
of the discrete motion.

264
00:17:19,740 --> 00:17:20,520
OK.

265
00:17:20,520 --> 00:17:25,140
So that would-- that
continuous model for that guy

266
00:17:25,140 --> 00:17:27,540
would be a first order in time.

267
00:17:27,540 --> 00:17:30,590
For this one, it'll be
second order in time.

268
00:17:30,590 --> 00:17:33,990
And second order
equations, of course,

269
00:17:33,990 --> 00:17:35,730
and there'd be
constant coefficients

270
00:17:35,730 --> 00:17:37,560
in our model problem.

271
00:17:37,560 --> 00:17:41,730
And the thing about a second
order equation that we all know

272
00:17:41,730 --> 00:17:45,420
is, there is a momentum term--

273
00:17:45,420 --> 00:17:49,440
a damping term, you could say--

274
00:17:49,440 --> 00:17:53,960
in multiplying the
first derivative.

275
00:17:53,960 --> 00:17:59,400
So that's what a second
order equation offers--

276
00:17:59,400 --> 00:18:02,670
is the inclusion
of a damping term

277
00:18:02,670 --> 00:18:07,800
which isn't present in
the original first order.

278
00:18:07,800 --> 00:18:09,300
OK.

279
00:18:09,300 --> 00:18:11,010
So how do we analyze this?

280
00:18:13,590 --> 00:18:17,010
I have to-- so how do
you analyze second order

281
00:18:17,010 --> 00:18:19,080
differential equations?

282
00:18:19,080 --> 00:18:22,860
You write them as a system
of two first order equations.

283
00:18:22,860 --> 00:18:25,170
So that's exactly what
we're going to do here,

284
00:18:25,170 --> 00:18:27,000
in the discrete case.

285
00:18:27,000 --> 00:18:29,880
We're going to see--

286
00:18:29,880 --> 00:18:31,890
because we have two equations.

287
00:18:31,890 --> 00:18:34,590
And they're first
order, and we can--

288
00:18:34,590 --> 00:18:38,940
let me play with them for
a moment to make them good.

289
00:18:38,940 --> 00:18:39,690
OK.

290
00:18:39,690 --> 00:18:40,920
So I'm going to have--

291
00:18:40,920 --> 00:18:46,680
so this will go to two
first order equations,

292
00:18:46,680 --> 00:18:48,180
in which the first one--

293
00:18:48,180 --> 00:18:51,720
I'm just going to
copy, XK plus 1

294
00:18:51,720 --> 00:18:56,620
is XK minus that step size ZK.

295
00:19:00,330 --> 00:19:00,880
Yeah.

296
00:19:00,880 --> 00:19:02,220
OK.

297
00:19:02,220 --> 00:19:03,050
Yeah.

298
00:19:03,050 --> 00:19:03,550
OK.

299
00:19:03,550 --> 00:19:07,150
Time the previous
times step here--

300
00:19:07,150 --> 00:19:09,320
the next time step on the left.

301
00:19:09,320 --> 00:19:09,820
OK.

302
00:19:09,820 --> 00:19:11,500
So I just copied that.

303
00:19:11,500 --> 00:19:17,090
Now this one I'm going
to increase K by 1.

304
00:19:17,090 --> 00:19:21,190
So in order to have that
looking to match this,

305
00:19:21,190 --> 00:19:27,160
I'll write that as ZK plus 1,
and I'll bring the K, saying,

306
00:19:27,160 --> 00:19:35,920
grad FK plus 1 equal beta ZK.

307
00:19:35,920 --> 00:19:37,610
That work with you?

308
00:19:37,610 --> 00:19:42,850
I just, in this thing,
instead of looking at it at K,

309
00:19:42,850 --> 00:19:47,260
I went to K plus 1.

310
00:19:47,260 --> 00:19:50,080
And I put the K plus
1 terms on one side.

311
00:19:50,080 --> 00:19:50,920
OK.

312
00:19:50,920 --> 00:19:52,440
So now I have a--

313
00:19:56,600 --> 00:19:57,100
let's see.

314
00:19:57,100 --> 00:19:58,680
Let's remember, we're doing--

315
00:19:58,680 --> 00:20:04,060
the model we're doing is F
equal a half X transpose SX.

316
00:20:04,060 --> 00:20:09,250
So the gradient of F is SX.

317
00:20:09,250 --> 00:20:13,450
So what I've written there,
for gradient, is really--

318
00:20:13,450 --> 00:20:15,980
I know what that gradient is.

319
00:20:15,980 --> 00:20:20,350
So that's really SX K plus 1.

320
00:20:23,190 --> 00:20:23,690
OK.

321
00:20:27,700 --> 00:20:30,040
How to analyze that.

322
00:20:30,040 --> 00:20:36,940
What happens as K travels
forward 1, 2, 3, 4, 5?

323
00:20:36,940 --> 00:20:40,870
We have a constant coefficient
problem at every step.

324
00:20:40,870 --> 00:20:46,190
The XZ variable is getting
multiplied by a matrix.

325
00:20:46,190 --> 00:20:52,480
So here's XZ at K plus 1.

326
00:20:52,480 --> 00:20:58,750
And over here will
be XZ at step K.

327
00:20:58,750 --> 00:21:02,620
And I just have to
figure out what matrix

328
00:21:02,620 --> 00:21:05,580
is multiplying here and here.

329
00:21:05,580 --> 00:21:07,220
OK.

330
00:21:07,220 --> 00:21:09,310
And I guess here I see it.

331
00:21:09,310 --> 00:21:12,730
For the first equation
has a 1 and a minus S,

332
00:21:12,730 --> 00:21:14,740
looks like, in the first row.

333
00:21:14,740 --> 00:21:17,590
And it has a beta
in the second row.

334
00:21:17,590 --> 00:21:24,100
And here the first equation
has a 1, 0 in that row.

335
00:21:24,100 --> 00:21:27,070
And then a minus
S. So I'll put in

336
00:21:27,070 --> 00:21:31,840
minus S, multiplying
XK plus 1, and then

337
00:21:31,840 --> 00:21:34,902
the 1 that multiplies ZK plus 1.

338
00:21:34,902 --> 00:21:35,700
Is that all right?

339
00:21:39,400 --> 00:21:39,900
Sorry.

340
00:21:39,900 --> 00:21:42,412
I've got two S's, and I
didn't draw that one--

341
00:21:42,412 --> 00:21:43,995
didn't write that
one in large enough,

342
00:21:43,995 --> 00:21:47,370
and I'd planned to
erase it anyway.

343
00:21:47,370 --> 00:21:49,800
This is the step sizes.

344
00:21:49,800 --> 00:21:51,620
This is the matrix.

345
00:21:51,620 --> 00:21:57,050
But it's not quite
fitting its place.

346
00:21:57,050 --> 00:22:01,730
This is the point where I'm
going to use eigenvalues.

347
00:22:01,730 --> 00:22:05,450
I'm going to follow
each eigenvalue.

348
00:22:05,450 --> 00:22:06,800
That's the whole point.

349
00:22:06,800 --> 00:22:10,070
When I follow each
eigenvalue-- each eigenvector,

350
00:22:10,070 --> 00:22:11,000
I should say--

351
00:22:11,000 --> 00:22:17,510
I'll follow each eigenvector
of S. So let's do that.

352
00:22:17,510 --> 00:22:23,450
So eigenvectors of S-- what
are we going to call those?

353
00:22:23,450 --> 00:22:26,000
Lambda, probably.

354
00:22:26,000 --> 00:22:30,850
So SX equal lambda X. I
think that's what's coming.

355
00:22:34,930 --> 00:22:41,560
Or Q. To do things
right, I want to remember

356
00:22:41,560 --> 00:22:45,350
that S is a positive,
definite symmetric matrix.

357
00:22:45,350 --> 00:22:47,620
That's why I call
it S, instead of A.

358
00:22:47,620 --> 00:22:49,240
So I really should
call the eigen--

359
00:22:49,240 --> 00:22:54,940
it doesn't matter,
but to be on the ball,

360
00:22:54,940 --> 00:22:58,780
let me call the eigenvector
Q, and the eigenvalue lambda.

361
00:23:02,224 --> 00:23:02,724
OK.

362
00:23:06,280 --> 00:23:10,720
So now I want to follow
this eigenvector.

363
00:23:10,720 --> 00:23:17,930
So I'm supposing that
XK is sum CK times Q.

364
00:23:17,930 --> 00:23:21,200
I'm assuming that X is in the--

365
00:23:21,200 --> 00:23:23,390
tracking this eigenvector.

366
00:23:23,390 --> 00:23:30,350
And I'm going to assume that ZK
is some other constant times Q.

367
00:23:30,350 --> 00:23:31,670
Everybody, do you see?

368
00:23:31,670 --> 00:23:33,390
That's a vector and
that's a vector.

369
00:23:33,390 --> 00:23:35,330
And I want scalars.

370
00:23:35,330 --> 00:23:40,580
I want to attract
just scalar CK and DK.

371
00:23:40,580 --> 00:23:43,010
So that's really
what I have here.

372
00:23:43,010 --> 00:23:47,630
This was a little tricky,
because X here is a vector,

373
00:23:47,630 --> 00:23:50,180
and two components
are N components.

374
00:23:50,180 --> 00:23:51,630
I didn't want that.

375
00:23:51,630 --> 00:23:56,030
I really wanted just to
track an eigenvector.

376
00:23:56,030 --> 00:23:59,210
Once I've settled
on the direction Q,

377
00:23:59,210 --> 00:24:02,690
everything is-- all vectors
are in the direction of Q.

378
00:24:02,690 --> 00:24:06,840
So we just have numbers
C and D to track.

379
00:24:06,840 --> 00:24:07,340
OK.

380
00:24:07,340 --> 00:24:15,660
So I'm going to rewrite
this correctly, as, yeah.

381
00:24:15,660 --> 00:24:19,310
Well, let me keep going
with this little formula.

382
00:24:19,310 --> 00:24:20,650
Then what will--

383
00:24:20,650 --> 00:24:22,320
I needed an SX.

384
00:24:22,320 --> 00:24:26,020
What will SXK be?

385
00:24:26,020 --> 00:24:33,000
If XK is in the direction of
the eigenvector Q, and it's CK--

386
00:24:33,000 --> 00:24:34,890
what happens when
I multiply by S?

387
00:24:37,740 --> 00:24:39,170
Q was an eigenvector.

388
00:24:39,170 --> 00:24:42,035
So the multiplying
by S gives me a--

389
00:24:42,035 --> 00:24:42,910
AUDIENCE: Eigenvalue.

390
00:24:42,910 --> 00:24:44,327
GILBERT STRANG:
Eigenvalue, right?

391
00:24:44,327 --> 00:24:50,810
So it's CK lambda Q.
Everything is a multiple of Q.

392
00:24:50,810 --> 00:24:53,810
And it's only those
multiples I'm looking

393
00:24:53,810 --> 00:24:56,000
for, the C's and the D's.

394
00:24:56,000 --> 00:24:59,340
And then the lambda
comes into the S term.

395
00:24:59,340 --> 00:24:59,840
Yeah.

396
00:24:59,840 --> 00:25:05,970
I think that's probably
all I need to do this.

397
00:25:05,970 --> 00:25:07,350
And then the gradient-- yeah.

398
00:25:07,350 --> 00:25:09,840
So that's the
gradient, of course.

399
00:25:09,840 --> 00:25:15,170
This is the gradient of F at K--

400
00:25:15,170 --> 00:25:15,820
is that one.

401
00:25:15,820 --> 00:25:16,830
OK.

402
00:25:16,830 --> 00:25:19,520
So instead of this,
let me just write

403
00:25:19,520 --> 00:25:26,400
what's happening if I'm tracking
the coefficients CK plus 1

404
00:25:26,400 --> 00:25:29,400
and DK plus 1.

405
00:25:29,400 --> 00:25:34,200
Then what I really meant
to have there is 1, 0.

406
00:25:34,200 --> 00:25:39,060
And minus S is a minus lambda.

407
00:25:43,450 --> 00:25:44,810
Is that right?

408
00:25:44,810 --> 00:25:45,310
Yeah.

409
00:25:45,310 --> 00:25:50,150
When I multiply the eigenvector
by S, I'm just getting--

410
00:25:50,150 --> 00:25:54,040
oh, it's a lambda times a CK.

411
00:25:54,040 --> 00:25:54,970
Yeah.

412
00:25:54,970 --> 00:25:57,280
Lambda times the
CK-- that's good.

413
00:25:57,280 --> 00:26:02,160
I think that that's the left
hand side of my equation.

414
00:26:02,160 --> 00:26:10,540
And on the right hand
side, I have here.

415
00:26:10,540 --> 00:26:11,260
That's 1.

416
00:26:11,260 --> 00:26:14,320
And this was the
scalar, the step size.

417
00:26:14,320 --> 00:26:16,300
And this was the
other coefficient.

418
00:26:16,300 --> 00:26:17,710
It's the beta.

419
00:26:17,710 --> 00:26:18,900
So I want to choose--

420
00:26:18,900 --> 00:26:22,550
what's my purpose now?

421
00:26:22,550 --> 00:26:26,330
That gives me the--

422
00:26:26,330 --> 00:26:30,770
what happens at every
step to the C and D.

423
00:26:30,770 --> 00:26:34,820
So I want to choose the
two things that I have--

424
00:26:34,820 --> 00:26:36,740
I'm free to choose
are S and beta.

425
00:26:36,740 --> 00:26:39,770
So that's my big job--

426
00:26:39,770 --> 00:26:41,510
choose S and beta.

427
00:26:44,140 --> 00:26:45,180
OK.

428
00:26:45,180 --> 00:26:46,980
Now I-- to make--

429
00:26:46,980 --> 00:26:52,350
oh, let me just shape this
by multiplying the inverse

430
00:26:52,350 --> 00:26:54,090
of that, and get it over here.

431
00:26:54,090 --> 00:26:55,410
So that will really--

432
00:26:55,410 --> 00:26:57,060
you'll see everything.

433
00:26:57,060 --> 00:27:02,520
So CK plus 1, DK plus 1.

434
00:27:02,520 --> 00:27:05,190
What's the inverse of 1, 0?

435
00:27:05,190 --> 00:27:07,410
Oh, I don't think
I want to-- that

436
00:27:07,410 --> 00:27:12,030
would have a tough time
finding an inverse.

437
00:27:12,030 --> 00:27:13,390
It was a 1, wasn't it?

438
00:27:19,180 --> 00:27:20,050
Yeah.

439
00:27:20,050 --> 00:27:20,660
OK.

440
00:27:20,660 --> 00:27:23,630
So I'm going to multiply by
the inverse of that matrix

441
00:27:23,630 --> 00:27:26,330
to get it over here.

442
00:27:26,330 --> 00:27:29,320
And what's the inverse
of 1, 1 minus lambda?

443
00:27:29,320 --> 00:27:32,420
It's 1, 1 plus lambda.

444
00:27:32,420 --> 00:27:34,640
So that the inverse
brought it over here,

445
00:27:34,640 --> 00:27:41,040
times this matrix, 1, 0 beta,
and minus the step size.

446
00:27:41,040 --> 00:27:42,520
That's what multiply CK DK.

447
00:27:45,770 --> 00:27:49,310
So we have these
simple, beautiful steps

448
00:27:49,310 --> 00:27:53,950
which come from tracking
one eigenvector--

449
00:27:53,950 --> 00:27:55,790
makes the whole problem scalar.

450
00:27:55,790 --> 00:27:58,880
So I multiply those two
matrices and I finally

451
00:27:58,880 --> 00:28:01,610
get the matrix that I
really have to think about.

452
00:28:01,610 --> 00:28:06,110
1, 0 times that'll be 1 minus
S. Lambda 1 times that'll

453
00:28:06,110 --> 00:28:08,030
be a lambda there.

454
00:28:08,030 --> 00:28:11,840
And minus lambda S plus beta.

455
00:28:11,840 --> 00:28:15,950
Beta minus lambda
S. That's the matrix

456
00:28:15,950 --> 00:28:18,970
that we see at every step.

457
00:28:18,970 --> 00:28:27,422
Let me call that matrix R.

458
00:28:27,422 --> 00:28:32,320
So I've done some algebra--
more than I would always do

459
00:28:32,320 --> 00:28:33,440
in a lecture--

460
00:28:33,440 --> 00:28:35,300
but it's really my--

461
00:28:35,300 --> 00:28:37,670
I wouldn't do it if it
wasn't nice algebra.

462
00:28:37,670 --> 00:28:39,680
What's the conclusion?

463
00:28:39,680 --> 00:28:44,060
That conclusion is that
with the momentum term--

464
00:28:44,060 --> 00:28:49,610
with this number beta available
to choose, as well as S,

465
00:28:49,610 --> 00:28:51,200
the step--

466
00:28:51,200 --> 00:28:57,080
the coefficient
of the eigenvector

467
00:28:57,080 --> 00:29:00,665
is multiplied at every
step by that matrix R.

468
00:29:00,665 --> 00:29:04,200
R is that matrix.

469
00:29:04,200 --> 00:29:06,910
And of course, that matrix
involves the eigenvalue.

470
00:29:10,120 --> 00:29:14,610
So we have to think about--

471
00:29:14,610 --> 00:29:17,110
what do we want to do now?

472
00:29:17,110 --> 00:29:23,780
We want to choose
beta and S to make

473
00:29:23,780 --> 00:29:26,740
R as small as possible, right?

474
00:29:26,740 --> 00:29:29,350
We want to make R as
small as possible.

475
00:29:29,350 --> 00:29:34,055
And we are free to choose beta
and S, but R depends on lambda.

476
00:29:36,780 --> 00:29:39,360
So I'm going to make
it as small as possible

477
00:29:39,360 --> 00:29:42,240
over the whole range
of possible lambdas.

478
00:29:42,240 --> 00:29:45,840
So let me-- so now
here we really go.

479
00:29:49,410 --> 00:29:55,740
So we have lambda between sum.

480
00:29:55,740 --> 00:30:04,520
These are the eigenvalue
of S. And what we know--

481
00:30:04,520 --> 00:30:09,100
what's reasonable to
know-- is a lower bound.

482
00:30:09,100 --> 00:30:10,160
It's a positive.

483
00:30:10,160 --> 00:30:13,250
This is a symmetric
positive definite matrix.

484
00:30:13,250 --> 00:30:20,880
A lower bound and an upper
bound, for example, m was B,

485
00:30:20,880 --> 00:30:25,880
and M was 1, in
that 2 by 2 problem.

486
00:30:25,880 --> 00:30:28,310
And this is what we know,
that the eigenvalues

487
00:30:28,310 --> 00:30:38,850
are between m and M. And
the ratio of m to M--

488
00:30:38,850 --> 00:30:42,000
well, if I write--

489
00:30:45,380 --> 00:30:50,880
this is the key quantity.

490
00:30:50,880 --> 00:30:53,020
And what's it called?

491
00:30:53,020 --> 00:30:55,675
Lambda max divided by
lambda min is the--

492
00:30:55,675 --> 00:30:56,800
AUDIENCE: Condition number.

493
00:30:56,800 --> 00:30:57,730
GILBERT STRANG:
Condition number.

494
00:30:57,730 --> 00:30:58,230
Right.

495
00:30:58,230 --> 00:31:00,910
This is all sometimes
written kappa--

496
00:31:00,910 --> 00:31:10,420
Greek letter kappa-- the
condition number of S.

497
00:31:10,420 --> 00:31:14,830
And when that's big, then the
problem is going to be harder.

498
00:31:14,830 --> 00:31:19,780
When that's 1, then my
matrix is just a multiple

499
00:31:19,780 --> 00:31:21,260
of the identity matrix.

500
00:31:21,260 --> 00:31:22,480
And the problem is trivial.

501
00:31:22,480 --> 00:31:27,710
When capital M and
small m are the same,

502
00:31:27,710 --> 00:31:31,810
then that's saying that
the largest and smallest

503
00:31:31,810 --> 00:31:34,840
eigenvalues are identical,
that the matrix is

504
00:31:34,840 --> 00:31:36,730
a multiple of the identity.

505
00:31:36,730 --> 00:31:39,310
That's the condition number one.

506
00:31:39,310 --> 00:31:47,980
But the bad one is when it's
1 over b, in our example,

507
00:31:47,980 --> 00:31:51,790
and that could be very large.

508
00:31:51,790 --> 00:31:52,540
OK.

509
00:31:52,540 --> 00:31:56,680
That's where we
have our problem.

510
00:31:56,680 --> 00:32:05,830
Let me just insert about the
ordinary gradient descent.

511
00:32:05,830 --> 00:32:11,470
Of course, the textbooks find a
estimate for how fast that is.

512
00:32:11,470 --> 00:32:15,590
And of course, it
depends on that number.

513
00:32:15,590 --> 00:32:16,090
Yeah.

514
00:32:16,090 --> 00:32:19,810
So it depends on that
number, and you exactly

515
00:32:19,810 --> 00:32:23,070
saw how it depended
on that number.

516
00:32:23,070 --> 00:32:25,210
Right.

517
00:32:25,210 --> 00:32:27,070
But now we have a
different problem.

518
00:32:27,070 --> 00:32:29,570
And we're going to finish it.

519
00:32:29,570 --> 00:32:30,070
OK.

520
00:32:30,070 --> 00:32:31,000
So what's my job?

521
00:32:31,000 --> 00:32:38,650
I'm going to choose S and beta
to keep the eigenvalues of R.

522
00:32:38,650 --> 00:32:42,450
So let's give the
eigenvalues of R a name.

523
00:32:42,450 --> 00:32:50,490
So R-- let's say R has
eigenvalues e1, that

524
00:32:50,490 --> 00:32:56,840
depends on the lambda and
the S and the beta and e2.

525
00:33:00,400 --> 00:33:03,700
So those are the
eigenvalues of R--

526
00:33:03,700 --> 00:33:07,210
just giving a letter to them.

527
00:33:07,210 --> 00:33:09,430
So what's our job?

528
00:33:09,430 --> 00:33:14,680
We want to choose S and beta
to make those eigenvalues as

529
00:33:14,680 --> 00:33:16,900
small as possible.

530
00:33:16,900 --> 00:33:17,680
Right?

531
00:33:17,680 --> 00:33:24,770
Small eigenvalues-- if R has
small eigenvalues, its powers--

532
00:33:24,770 --> 00:33:29,930
every step multiplies by
R. So the convergence rate

533
00:33:29,930 --> 00:33:32,450
with momentum is--

534
00:33:32,450 --> 00:33:36,410
depends on the powers
of R getting small fast.

535
00:33:36,410 --> 00:33:39,350
It depends on the
eigenvalues being small.

536
00:33:39,350 --> 00:33:48,500
We want to minimize
the largest eigenvalue.

537
00:33:48,500 --> 00:33:56,000
So I'll say the
maximum of e1 and e2--

538
00:33:56,000 --> 00:33:57,650
that's our job.

539
00:33:57,650 --> 00:34:01,430
Minimize-- we want to choose
S and beta to minimize

540
00:34:01,430 --> 00:34:03,550
the largest eigenvalue.

541
00:34:03,550 --> 00:34:05,560
Because if there's
one small eigenvalue,

542
00:34:05,560 --> 00:34:08,679
but the other is big, then the
other one is going to kill us.

543
00:34:08,679 --> 00:34:12,670
So we have to get
both eigenvalues down.

544
00:34:12,670 --> 00:34:16,239
And of course, those
depend on lambda.

545
00:34:16,239 --> 00:34:18,050
E1 depends on lambda.

546
00:34:18,050 --> 00:34:20,620
So we have a little
algebra problem.

547
00:34:20,620 --> 00:34:23,679
And this is what I
described as a miracle--

548
00:34:23,679 --> 00:34:26,770
the fact that this
little algebra problem--

549
00:34:26,770 --> 00:34:30,969
the eigenvalues of that
matrix, e1 and e2, which

550
00:34:30,969 --> 00:34:35,080
depend on lambda in some way.

551
00:34:35,080 --> 00:34:39,159
And we want to make
both e1 and e2 small--

552
00:34:39,159 --> 00:34:42,040
the maximum of those-- of them.

553
00:34:42,040 --> 00:34:47,050
And we have to do it for
all the eigenvalues lambda,

554
00:34:47,050 --> 00:34:48,639
because we have to--

555
00:34:48,639 --> 00:34:54,370
we're now thinking-- we've
been tracking each eigenvector.

556
00:34:54,370 --> 00:34:56,020
So that gave us 1--

557
00:34:56,020 --> 00:34:59,930
so this is for all
possible lambda.

558
00:34:59,930 --> 00:35:03,350
So we have to decide, what do
I mean by all possible lambda?

559
00:35:03,350 --> 00:35:12,910
And I mean all lambda that
are between some m and M.

560
00:35:12,910 --> 00:35:17,200
There is a beautiful problem.

561
00:35:17,200 --> 00:35:18,790
You have a 2 by 2 matrix.

562
00:35:18,790 --> 00:35:22,960
You can find its eigenvalues.

563
00:35:22,960 --> 00:35:24,610
They depend on lambda.

564
00:35:24,610 --> 00:35:27,815
And what we-- all we know
about lambda is it's between m

565
00:35:27,815 --> 00:35:32,920
and cap M. And also, they
also depend on S and beta--

566
00:35:32,920 --> 00:35:35,380
the two parameters
we can choose.

567
00:35:35,380 --> 00:35:37,780
And we want to choose
those parameters,

568
00:35:37,780 --> 00:35:43,060
so that for all the
possible eigenvalues,

569
00:35:43,060 --> 00:35:45,910
the larger of the
two eigenvalues

570
00:35:45,910 --> 00:35:47,490
will be as small as possible.

571
00:35:47,490 --> 00:35:51,040
That's-- it's a
little bit of algebra,

572
00:35:51,040 --> 00:35:54,730
but do you see that
that's the tricky--

573
00:35:54,730 --> 00:35:59,680
that-- I shouldn't say
tricky, because it comes out--

574
00:35:59,680 --> 00:36:03,760
this is the one that is a
miracle in the simplicity

575
00:36:03,760 --> 00:36:05,270
of the solution.

576
00:36:05,270 --> 00:36:05,930
OK.

577
00:36:05,930 --> 00:36:07,150
And I'm going to--

578
00:36:07,150 --> 00:36:10,120
in fact, maybe I'll move over
here to write the answer.

579
00:36:13,930 --> 00:36:16,570
OK.

580
00:36:16,570 --> 00:36:19,690
And I just want to
say that miracles

581
00:36:19,690 --> 00:36:22,440
don't happen so often in math.

582
00:36:22,440 --> 00:36:26,470
There is-- all of mathematics--
the whole point of math

583
00:36:26,470 --> 00:36:28,810
is to explain miracles.

584
00:36:28,810 --> 00:36:33,850
So there is something
to explain here,

585
00:36:33,850 --> 00:36:37,390
and I don't have my
finger on it yet.

586
00:36:37,390 --> 00:36:41,230
Because-- anyway, it happens.

587
00:36:41,230 --> 00:36:45,550
So let me tell you what the
right S, and the right beta,

588
00:36:45,550 --> 00:36:53,500
and the resulting
minimum eigenvalue are.

589
00:36:53,500 --> 00:37:00,300
So again, they depend
on little m and big M.

590
00:37:00,300 --> 00:37:05,230
That's a very nice
feature, which we expect.

591
00:37:05,230 --> 00:37:07,680
And they depend on the ratio.

592
00:37:07,680 --> 00:37:08,190
OK.

593
00:37:08,190 --> 00:37:09,540
So that ratio-- all right.

594
00:37:09,540 --> 00:37:11,340
Let's see it.

595
00:37:11,340 --> 00:37:12,300
OK.

596
00:37:12,300 --> 00:37:13,275
So the best S--

597
00:37:18,750 --> 00:37:29,470
the S optimal has the formula 2
over square root of lambda max.

598
00:37:29,470 --> 00:37:37,290
That's the square root of M
and the squared of m squared.

599
00:37:37,290 --> 00:37:38,730
Amazing OK.

600
00:37:38,730 --> 00:37:49,020
And beta optimal turns out
to be the square root of M

601
00:37:49,020 --> 00:37:53,760
minus the square of little
m, over the square root of M

602
00:37:53,760 --> 00:37:57,592
plus the square root of
little m, all squared.

603
00:37:57,592 --> 00:37:59,550
And of course, we know
what these numbers are--

604
00:37:59,550 --> 00:38:02,430
1 and beta, in
our model problem.

605
00:38:02,430 --> 00:38:06,720
That's where I'm going to
get this square root of--

606
00:38:06,720 --> 00:38:09,660
this is 1 minus the
square root-- oh sorry, b.

607
00:38:09,660 --> 00:38:13,050
This is 1 minus the
square root of b.

608
00:38:13,050 --> 00:38:17,520
In fact, for our example--

609
00:38:17,520 --> 00:38:19,670
well, let me just write
what they would be.

610
00:38:19,670 --> 00:38:25,080
2 over 1 plus square
root of b squared,

611
00:38:25,080 --> 00:38:29,700
and 1 minus square root
of b over 1 plus square--

612
00:38:29,700 --> 00:38:33,530
you see where this is--

613
00:38:33,530 --> 00:38:36,510
1 minus square root of b is
beginning to appear in that.

614
00:38:36,510 --> 00:38:38,910
It appears in this
solution to this problem.

615
00:38:38,910 --> 00:38:41,775
And then I have to
tell you what the--

616
00:38:45,090 --> 00:38:49,700
how small do these
optimal choices

617
00:38:49,700 --> 00:38:52,520
make the eigenvalues
of R, right?

618
00:38:52,520 --> 00:38:57,600
This is what we're really
paying attention to, because

619
00:38:57,600 --> 00:38:59,210
if the eigenvalues--

620
00:38:59,210 --> 00:39:02,600
that matrix tells us what
happens at every step.

621
00:39:02,600 --> 00:39:06,860
And its eigenvalues have to be
small to get fast convergence.

622
00:39:06,860 --> 00:39:08,570
So how small are they?

623
00:39:08,570 --> 00:39:09,830
Well they involve this--

624
00:39:13,480 --> 00:39:13,980
yeah.

625
00:39:13,980 --> 00:39:17,150
So it's the number
that I've seen.

626
00:39:17,150 --> 00:39:21,630
So in this case, the e's--

627
00:39:21,630 --> 00:39:29,300
the eigenvalues of R--

628
00:39:29,300 --> 00:39:32,090
that's the iterating matrix--

629
00:39:32,090 --> 00:39:36,560
are below-- now you're going
to see the 1 minus square root

630
00:39:36,560 --> 00:39:41,060
of b over 1 plus
square root of b--

631
00:39:41,060 --> 00:39:43,220
I think, maybe, squared.

632
00:39:43,220 --> 00:39:44,480
Let me just see.

633
00:39:44,480 --> 00:39:45,590
Yeah.

634
00:39:45,590 --> 00:39:50,450
It happens to come
out that number again.

635
00:39:50,450 --> 00:39:53,470
So that's the conclusion.

636
00:39:53,470 --> 00:39:57,790
That with the right
choice of S and beta,

637
00:39:57,790 --> 00:40:03,490
by adding this look back
term-- look back one step--

638
00:40:03,490 --> 00:40:05,920
you get this improvement.

639
00:40:05,920 --> 00:40:13,490
And it happens, and you see
it in practice, of course.

640
00:40:13,490 --> 00:40:15,650
You'll see it exactly.

641
00:40:15,650 --> 00:40:26,310
And so you do the
job to use momentum.

642
00:40:26,310 --> 00:40:30,290
Now I'm going to mention
what the Nesterov--

643
00:40:30,290 --> 00:40:33,600
Nesterov had a slightly
different way to do it,

644
00:40:33,600 --> 00:40:37,170
and I'll tell you what that is.

645
00:40:37,170 --> 00:40:40,320
But it's the same idea--
get a second thing.

646
00:40:40,320 --> 00:40:42,540
So let's see if I can find that.

647
00:40:42,540 --> 00:40:44,300
Yeah, Nesterov.

648
00:40:44,300 --> 00:40:44,800
OK.

649
00:40:51,250 --> 00:40:53,040
Here we go.

650
00:40:53,040 --> 00:40:55,770
So let me bring
Nesterov's name down.

651
00:41:01,740 --> 00:41:07,320
So that's basically what I
wanted to say about number 1.

652
00:41:07,320 --> 00:41:09,300
And when you see
Nesterov, you'll

653
00:41:09,300 --> 00:41:14,910
see that it's a similar idea
of involving the previous time

654
00:41:14,910 --> 00:41:16,140
value.

655
00:41:16,140 --> 00:41:17,550
OK.

656
00:41:17,550 --> 00:41:24,720
There are very popular
methods in use now

657
00:41:24,720 --> 00:41:28,500
for machine learning
that involve--

658
00:41:28,500 --> 00:41:29,940
by a simple formula--

659
00:41:29,940 --> 00:41:34,020
all the previous
values, by sort of a--

660
00:41:34,020 --> 00:41:36,970
just by an addition
of a bunch of terms.

661
00:41:36,970 --> 00:41:44,160
So it's really-- so it
goes under the names

662
00:41:44,160 --> 00:41:50,970
adagrad, or others.

663
00:41:50,970 --> 00:41:54,510
Those of you who already
know about machine learning

664
00:41:54,510 --> 00:41:55,980
will know what I'm
speaking about.

665
00:41:55,980 --> 00:41:58,020
And I'll say more about those.

666
00:41:58,020 --> 00:41:59,910
Yeah.

667
00:41:59,910 --> 00:42:02,790
But it doesn't involve
a separate coefficient

668
00:42:02,790 --> 00:42:05,490
for each previous
value, or that would

669
00:42:05,490 --> 00:42:08,880
be a momentous amount of work.

670
00:42:08,880 --> 00:42:12,120
So now I just want to tell
you what Nesterov is, and then

671
00:42:12,120 --> 00:42:13,240
we're good.

672
00:42:13,240 --> 00:42:13,740
OK.

673
00:42:13,740 --> 00:42:14,880
Nesterov's idea.

674
00:42:18,366 --> 00:42:20,820
Let me bring that down.

675
00:42:20,820 --> 00:42:22,660
Shoot this up.

676
00:42:22,660 --> 00:42:23,972
Bring down Nesterov.

677
00:42:31,060 --> 00:42:35,170
Because he had an idea that
you might not have thought of.

678
00:42:35,170 --> 00:42:38,790
Somehow the momentum
idea was pretty natural--

679
00:42:38,790 --> 00:42:41,770
to use that previous value.

680
00:42:41,770 --> 00:42:43,780
And actually, I
would like to know

681
00:42:43,780 --> 00:42:46,810
what happens if you use two
previous values, or three

682
00:42:46,810 --> 00:42:47,890
previous values.

683
00:42:47,890 --> 00:42:57,310
Can you then get improvements
on this convergence rate

684
00:42:57,310 --> 00:43:00,550
by going back two
steps or three steps?

685
00:43:00,550 --> 00:43:05,170
If I'd use the analogy
with ordinary differential

686
00:43:05,170 --> 00:43:07,870
equations, maybe you know.

687
00:43:07,870 --> 00:43:12,720
So there are backward
difference formulas.

688
00:43:12,720 --> 00:43:14,800
Do you know about those for--

689
00:43:14,800 --> 00:43:18,380
those would be in
MATLAB software,

690
00:43:18,380 --> 00:43:20,440
and all other software.

691
00:43:20,440 --> 00:43:22,750
Backward differences--
so maybe you

692
00:43:22,750 --> 00:43:27,040
go back two steps or four steps.

693
00:43:27,040 --> 00:43:29,800
If you're doing
planetary calculations,

694
00:43:29,800 --> 00:43:33,460
if you're an astronomer, you go
back maybe seven or eight steps

695
00:43:33,460 --> 00:43:35,950
to get super high accuracy.

696
00:43:35,950 --> 00:43:40,050
So that doesn't seem
to have happened yet,

697
00:43:40,050 --> 00:43:42,110
but it's should happen here--

698
00:43:42,110 --> 00:43:43,150
to go back more.

699
00:43:43,150 --> 00:43:48,010
But Nesterov has this
different way to go back.

700
00:43:48,010 --> 00:43:52,870
So his formula is XK
plus 1-- the new X--

701
00:43:52,870 --> 00:43:58,360
is YK-- so he's introducing
something a little different--

702
00:43:58,360 --> 00:44:03,790
minus S gradient f at YK.

703
00:44:09,100 --> 00:44:10,930
I'm a little surprised
about that YK,

704
00:44:10,930 --> 00:44:13,330
but this is the point, here--

705
00:44:13,330 --> 00:44:15,940
that the gradient
is being evaluated

706
00:44:15,940 --> 00:44:18,010
at some different point.

707
00:44:18,010 --> 00:44:22,750
And then he has to give a
formula for that to track those

708
00:44:22,750 --> 00:44:23,950
Y's.

709
00:44:23,950 --> 00:44:27,760
So the Y's are like
the X's, but they

710
00:44:27,760 --> 00:44:33,230
are shifted a little bit by some
term-- and beta would be fine.

711
00:44:33,230 --> 00:44:35,830
Oh no.

712
00:44:35,830 --> 00:44:39,830
Yeah-- beta-- have
we got Nesterov here?

713
00:44:39,830 --> 00:44:40,330
Yes.

714
00:44:40,330 --> 00:44:45,150
Nesterov has a factor gamma in.

715
00:44:45,150 --> 00:44:45,650
Yeah.

716
00:44:45,650 --> 00:44:47,240
So all right.

717
00:44:47,240 --> 00:44:50,170
Let me try to get this right.

718
00:44:50,170 --> 00:44:52,870
OK.

719
00:44:52,870 --> 00:44:53,540
All right.

720
00:44:53,540 --> 00:44:56,890
On a previous line, I've written
the whole Nesterov thing.

721
00:44:56,890 --> 00:44:59,240
Here, let's see a
Nesterov completely.

722
00:44:59,240 --> 00:45:00,230
And then it'll break--

723
00:45:00,230 --> 00:45:04,010
then this is the step that
breaks it into two first order.

724
00:45:04,010 --> 00:45:06,780
But you'll see the
main formula here.

725
00:45:06,780 --> 00:45:08,230
XK plus 1 is XK.

726
00:45:10,750 --> 00:45:19,600
And then a beta times
XK minus XK minus 1.

727
00:45:19,600 --> 00:45:22,570
So that's a momentum term.

728
00:45:22,570 --> 00:45:26,560
And then a typical gradient.

729
00:45:26,560 --> 00:45:29,950
But now here is
Nesterov speaking up.

730
00:45:29,950 --> 00:45:35,710
Nesterov evaluates the gradient
not at XK, not at XK minus 1.

731
00:45:35,710 --> 00:45:38,650
But it his own, Nesterov point.

732
00:45:38,650 --> 00:45:41,950
So this is Nesterov's
favorite point.

733
00:45:41,950 --> 00:45:46,210
Gamma XK minus XK minus 1.

734
00:45:46,210 --> 00:45:54,950
Some point, part
way along that step.

735
00:45:54,950 --> 00:46:01,190
So this point-- because gamma is
going to be some non-integer--

736
00:46:01,190 --> 00:46:04,900
this evaluation point
for the gradient of f

737
00:46:04,900 --> 00:46:07,570
is a little
unexpected and weird,

738
00:46:07,570 --> 00:46:09,970
because it's not a mesh point.

739
00:46:09,970 --> 00:46:13,470
It's somewhere between.

740
00:46:13,470 --> 00:46:15,190
OK.

741
00:46:15,190 --> 00:46:17,170
Yeah.

742
00:46:17,170 --> 00:46:29,410
And then that-- so that involves
XK plus 1, XK, and XK minus 1.

743
00:46:29,410 --> 00:46:33,260
So it's a second order--

744
00:46:33,260 --> 00:46:35,580
there's a second
order method here.

745
00:46:35,580 --> 00:46:39,350
We're going to-- to analyze it,
we're going to go through this

746
00:46:39,350 --> 00:46:45,260
same process of writing it
as two first order steps--

747
00:46:45,260 --> 00:46:48,590
two first-- two single step--

748
00:46:48,590 --> 00:46:58,460
two one step from K to K plus
1 coupled with one step thing.

749
00:46:58,460 --> 00:47:03,230
Follow that same thing
through, and then the result

750
00:47:03,230 --> 00:47:08,280
is, the same factor
appears for him.

751
00:47:08,280 --> 00:47:11,810
The same factor-- this is also--

752
00:47:11,810 --> 00:47:24,140
so the point is, this is
for momentum and Nesterov,

753
00:47:24,140 --> 00:47:33,530
with some constant--
different by some constant.

754
00:47:33,530 --> 00:47:41,840
But the key quantity is that
one and that appears in both.

755
00:47:41,840 --> 00:47:49,550
So I don't propose, of
course, to repeat these steps

756
00:47:49,550 --> 00:47:50,660
for Nesterov.

757
00:47:50,660 --> 00:47:54,770
But you see what you could do.

758
00:47:54,770 --> 00:47:59,720
You see that it involves
K minus 1, KNK plus 1.

759
00:47:59,720 --> 00:48:01,550
You write it as--

760
00:48:01,550 --> 00:48:03,890
you follow an eigenvector.

761
00:48:03,890 --> 00:48:08,900
You write it as a coupled
system of-- that's a one step.

762
00:48:08,900 --> 00:48:10,570
That has a matrix.

763
00:48:10,570 --> 00:48:12,320
You find the matrix.

764
00:48:12,320 --> 00:48:14,840
You find the eigenvalues
of the matrix.

765
00:48:14,840 --> 00:48:17,210
You make those eigenvalues
as small as possible.

766
00:48:17,210 --> 00:48:22,320
And you have optimized the
coefficients in Nesterov.

767
00:48:22,320 --> 00:48:22,820
OK.

768
00:48:22,820 --> 00:48:27,800
That's sort of a lot
of algebra that's

769
00:48:27,800 --> 00:48:32,840
at the heart of accelerated
gradient descent.

770
00:48:32,840 --> 00:48:37,670
And of course, it's
worth doing because it's

771
00:48:37,670 --> 00:48:42,590
a tremendous saving in
the convergence rate.

772
00:48:42,590 --> 00:48:44,630
OK.

773
00:48:44,630 --> 00:48:49,640
Anybody running in the
marathon or just watching?

774
00:48:49,640 --> 00:48:53,480
It's possible to run, you know.

775
00:48:53,480 --> 00:48:57,350
Anyway, I'll see you after
the marathon, next Wednesday.

776
00:48:57,350 --> 00:49:01,300
And Professor Boyd
will also see you.