1
00:00:01,550 --> 00:00:03,920
The following content is
provided under a Creative

2
00:00:03,920 --> 00:00:05,310
Commons license.

3
00:00:05,310 --> 00:00:07,520
Your support will help
MIT OpenCourseWare

4
00:00:07,520 --> 00:00:11,610
continue to offer high-quality
educational resources for free.

5
00:00:11,610 --> 00:00:14,180
To make a donation or to
view additional materials

6
00:00:14,180 --> 00:00:18,140
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,140 --> 00:00:19,026
at ocw.mit.edu.

8
00:00:22,500 --> 00:00:25,020
GILBERT STRANG: OK, so
this is an important day,

9
00:00:25,020 --> 00:00:27,730
and Friday was an important day.

10
00:00:27,730 --> 00:00:32,800
I hope you enjoyed Professor
Sra's terrific lecture as much

11
00:00:32,800 --> 00:00:33,610
as I did.

12
00:00:33,610 --> 00:00:39,430
You probably saw me taking notes
like mad for the section that's

13
00:00:39,430 --> 00:00:44,020
now to be written about
stochastic gradient descent.

14
00:00:44,020 --> 00:00:49,020
And he promised a
theorem, if you remember,

15
00:00:49,020 --> 00:00:50,710
and there wasn't time.

16
00:00:50,710 --> 00:00:52,720
And so he was going
to send it to me

17
00:00:52,720 --> 00:00:54,760
or still is going
to send it to me.

18
00:00:54,760 --> 00:00:57,520
I'll report, I
haven't got it yet,

19
00:00:57,520 --> 00:01:04,599
but I'll bring it to class,
waiting to see, hopefully.

20
00:01:04,599 --> 00:01:09,280
And that will give us a chance
to review stochastic gradient

21
00:01:09,280 --> 00:01:13,930
descent, the central
algorithm of deep learning.

22
00:01:13,930 --> 00:01:20,980
And then this today is
about the central structure

23
00:01:20,980 --> 00:01:22,780
of deep neural nets.

24
00:01:22,780 --> 00:01:28,780
And some of you will know
already how they're connected,

25
00:01:28,780 --> 00:01:36,800
what the function F,
the learning function--

26
00:01:36,800 --> 00:01:40,360
you could call it the
learning function--

27
00:01:40,360 --> 00:01:41,980
that's constructed.

28
00:01:41,980 --> 00:01:46,990
The whole system is aiming
at constructing this function

29
00:01:46,990 --> 00:01:52,180
F which learns the
training data and then

30
00:01:52,180 --> 00:01:55,480
applying it to the test data.

31
00:01:55,480 --> 00:02:01,570
And the miracle is that it
does so well in practice.

32
00:02:01,570 --> 00:02:07,150
That's what has transformed
deep learning into such

33
00:02:07,150 --> 00:02:11,290
an important application.

34
00:02:17,230 --> 00:02:22,270
Chapter 7 has been
up for months on the

35
00:02:22,270 --> 00:02:27,340
math.mit.edu/learningfromdata
site,

36
00:02:27,340 --> 00:02:28,525
and I'll add it to Stellar.

37
00:02:28,525 --> 00:02:33,270
Of course, that's where
you'll be looking for it.

38
00:02:33,270 --> 00:02:37,630
OK, and then the second,
the back propagation,

39
00:02:37,630 --> 00:02:42,940
the way to compute
the gradient, I'll

40
00:02:42,940 --> 00:02:45,940
probably reach that idea today.

41
00:02:45,940 --> 00:02:51,070
And you'll see it's the chain
rule, but how is it organized.

42
00:02:51,070 --> 00:02:53,290
OK, so what's the structure?

43
00:02:53,290 --> 00:02:57,360
What's the plan for
deep neural nets?

44
00:02:57,360 --> 00:02:58,830
Good.

45
00:02:58,830 --> 00:03:02,820
Starting here, so what
we have is training data.

46
00:03:06,200 --> 00:03:13,640
So we have vectors, x1 to x--

47
00:03:13,640 --> 00:03:17,840
what should I use for
the number of samples

48
00:03:17,840 --> 00:03:21,590
that we have in
the training data?

49
00:03:21,590 --> 00:03:24,480
Well, let's say D for Data.

50
00:03:24,480 --> 00:03:26,120
OK.

51
00:03:26,120 --> 00:03:30,980
And each vector, those are
called feature vectors,

52
00:03:30,980 --> 00:03:37,055
so equals feature vectors.

53
00:03:40,200 --> 00:03:48,010
So each one, each x,
has like m features.

54
00:03:48,010 --> 00:03:55,230
So maybe my notation
isn't so hot here.

55
00:03:55,230 --> 00:03:58,640
I have a whole lot of vectors.

56
00:03:58,640 --> 00:04:03,510
Let me not use the subscript
for those right away.

57
00:04:03,510 --> 00:04:08,010
So vectors, feature
vectors, and each vector

58
00:04:08,010 --> 00:04:13,170
has got maybe shall
we say m features?

59
00:04:13,170 --> 00:04:17,250
Like, if we were measuring
height and age and weight

60
00:04:17,250 --> 00:04:19,594
and so on, those
would be features.

61
00:04:23,100 --> 00:04:28,140
The job of the neural
network is to create--

62
00:04:28,140 --> 00:04:29,730
and we're going to classify.

63
00:04:29,730 --> 00:04:35,220
Maybe we're going to classify
men and women or boys

64
00:04:35,220 --> 00:04:36,080
and girls.

65
00:04:36,080 --> 00:04:41,330
So let's make it a
classification problem,

66
00:04:41,330 --> 00:04:43,720
just binary.

67
00:04:43,720 --> 00:04:47,610
So the classification
problem is--

68
00:04:47,610 --> 00:04:48,600
what shall we say?

69
00:04:48,600 --> 00:04:59,040
Minus 1 or 1, or 0
or 1, or boy or girl,

70
00:04:59,040 --> 00:05:05,760
or cat or dog, or truck or car,
or anyway, just two classes.

71
00:05:08,290 --> 00:05:12,360
So I'm just going to do
two-class classification.

72
00:05:15,240 --> 00:05:18,480
We know which class the
training data is in.

73
00:05:18,480 --> 00:05:21,970
For each vector x, we
know the right answer.

74
00:05:21,970 --> 00:05:25,660
So we want to create a function
that gives the right answer,

75
00:05:25,660 --> 00:05:30,460
and then we'll use that
function on other data.

76
00:05:30,460 --> 00:05:33,420
So let me write that down.

77
00:05:33,420 --> 00:05:42,600
Create a function F
of x so that gets--

78
00:05:42,600 --> 00:05:50,140
most of gets the class correct.

79
00:05:50,140 --> 00:05:58,010
In other words, F of x
should be negative for when

80
00:05:58,010 --> 00:06:04,960
the classification is
minus 1, and F of x

81
00:06:04,960 --> 00:06:11,470
should be positive when the
classification is plus 1.

82
00:06:11,470 --> 00:06:14,080
And as we know, we
don't necessarily

83
00:06:14,080 --> 00:06:17,860
have to get every x,
every sample, right.

84
00:06:17,860 --> 00:06:19,990
That may be over-fitting.

85
00:06:19,990 --> 00:06:24,850
If there's some sample that's
just truly weird, by getting

86
00:06:24,850 --> 00:06:26,380
that right we're
going to be looking

87
00:06:26,380 --> 00:06:32,100
for a truly weird
data in the test set,

88
00:06:32,100 --> 00:06:33,830
and that's not a good idea.

89
00:06:39,620 --> 00:06:41,650
We're trying to
discover the rule that

90
00:06:41,650 --> 00:06:46,420
covers almost all cases but
not every crazy, weird case.

91
00:06:46,420 --> 00:06:47,290
OK?

92
00:06:47,290 --> 00:06:50,020
So that's our job,
to create a function

93
00:06:50,020 --> 00:06:58,000
F of x that is correct on
almost all of the training data.

94
00:07:01,860 --> 00:07:02,360
Yeah.

95
00:07:05,900 --> 00:07:12,760
So before I draw the
picture of the network,

96
00:07:12,760 --> 00:07:23,750
let me just remember to
mention the site Playground.

97
00:07:23,750 --> 00:07:26,840
I don't know if you've looked at
that, so I'm going to ask you,

98
00:07:26,840 --> 00:07:28,640
playground@tensorflow.org.

99
00:07:36,650 --> 00:07:40,140
How many of you know that
site or have met with it?

100
00:07:40,140 --> 00:07:42,530
Just a few, OK.

101
00:07:42,530 --> 00:07:46,140
OK, so it's not a very
sophisticated site.

102
00:07:46,140 --> 00:07:49,680
It's got only four
examples, four examples.

103
00:07:57,060 --> 00:08:03,590
So one example is a whole
lot of points that are blue,

104
00:08:03,590 --> 00:08:08,830
B for Blue, inside
a bunch of points

105
00:08:08,830 --> 00:08:18,760
that are another set that are
O for Orange, orange, blue.

106
00:08:18,760 --> 00:08:23,350
OK, so those are the two
classes, orange and blue.

107
00:08:23,350 --> 00:08:34,480
So the points x and the feature
vector here is just the xy,

108
00:08:34,480 --> 00:08:42,820
the coordinates features are the
xy coordinates of the points.

109
00:08:42,820 --> 00:08:46,420
And our job is to
find a function that's

110
00:08:46,420 --> 00:08:51,400
positive on these points and
negative on those points.

111
00:08:51,400 --> 00:08:55,910
So there is a simple model
problem, and I recommend--

112
00:08:55,910 --> 00:08:58,060
well, just partly--

113
00:08:58,060 --> 00:09:00,910
if you're a expert
in deep learning.

114
00:09:00,910 --> 00:09:07,150
This is for children, but
morally here, I certainly

115
00:09:07,150 --> 00:09:10,960
learned from playing
in this playground.

116
00:09:10,960 --> 00:09:19,320
So you set the step size.

117
00:09:22,930 --> 00:09:25,980
Do you set it, or
does it set it?

118
00:09:25,980 --> 00:09:27,570
I guess you can change it.

119
00:09:27,570 --> 00:09:30,310
I don't think I've changed it.

120
00:09:30,310 --> 00:09:31,320
What else do you set?

121
00:09:31,320 --> 00:09:41,460
Oh, you set the nonlinear
activation, the nonlinear

122
00:09:41,460 --> 00:09:46,590
activation function,
active I'll say function.

123
00:09:46,590 --> 00:09:51,570
And let me just go over here
and say what function people now

124
00:09:51,570 --> 00:09:52,500
mostly use.

125
00:09:52,500 --> 00:09:59,520
The activation
function is called

126
00:09:59,520 --> 00:10:03,570
ReLU pronounced different ways.

127
00:10:03,570 --> 00:10:06,470
I don't know how we got
into that crazy thing.

128
00:10:06,470 --> 00:10:13,430
For this function,
that is 0 and x.

129
00:10:13,430 --> 00:10:15,680
So the function,
ReLU is a function

130
00:10:15,680 --> 00:10:20,935
of x is the maximum
the larger of 0 and x.

131
00:10:24,080 --> 00:10:27,400
The point is, it's not
linear, and the point

132
00:10:27,400 --> 00:10:31,690
is that if we didn't allow
nonlinearity in here somewhere,

133
00:10:31,690 --> 00:10:34,630
we couldn't even solve
this playground problem.

134
00:10:34,630 --> 00:10:38,980
Because if our classifiers
were all linear classifiers,

135
00:10:38,980 --> 00:10:44,410
like support vector machines,
I couldn't separate the blue

136
00:10:44,410 --> 00:10:48,685
from the orange with a plane.

137
00:10:48,685 --> 00:10:53,130
It's got to somehow create some
nonlinear function which maybe

138
00:10:53,130 --> 00:10:55,610
the function is trying to be--

139
00:10:55,610 --> 00:11:03,430
a good function would be a
function of r and theta maybe,

140
00:11:03,430 --> 00:11:07,060
maybe r minus 5.

141
00:11:07,060 --> 00:11:10,210
So maybe the
distance out to that.

142
00:11:10,210 --> 00:11:13,970
Let's suppose that
distance is 5.

143
00:11:13,970 --> 00:11:17,680
Then, r minus 5 will be
negative on the blues,

144
00:11:17,680 --> 00:11:19,630
because r is small.

145
00:11:19,630 --> 00:11:22,720
And r minus 5 will be
positive on the oranges,

146
00:11:22,720 --> 00:11:24,340
because r is bigger.

147
00:11:24,340 --> 00:11:30,705
And therefore, we will have
the right signs, less than 0

148
00:11:30,705 --> 00:11:35,310
or greater than 0,
and it'll classify

149
00:11:35,310 --> 00:11:39,960
this data, this training data.

150
00:11:39,960 --> 00:11:41,450
Yeah.

151
00:11:41,450 --> 00:11:43,230
So it has to do that.

152
00:11:43,230 --> 00:11:45,060
This is not a hard one to do.

153
00:11:45,060 --> 00:11:51,450
There are four examples,
as I say, two are trivial.

154
00:11:51,450 --> 00:11:53,430
It finds a good function.

155
00:11:53,430 --> 00:11:56,760
Well yeah, I've forgotten,
they're so trivial,

156
00:11:56,760 --> 00:12:07,920
they shouldn't be mentioned, and
then this is the medium test.

157
00:12:07,920 --> 00:12:10,350
And then the hard
test is when you

158
00:12:10,350 --> 00:12:17,370
have a sort of
spiral of oranges,

159
00:12:17,370 --> 00:12:20,880
and inside, you have
a spiral of blues.

160
00:12:20,880 --> 00:12:24,905
That was cooked up by a fiend.

161
00:12:29,590 --> 00:12:35,020
So the system is trying
to find a function

162
00:12:35,020 --> 00:12:39,580
that's positive on one spiral
and negative on the other

163
00:12:39,580 --> 00:12:45,670
spiral, and that takes quite a
bit of time, many, many epochs.

164
00:12:45,670 --> 00:12:47,320
I learned what an epoch is.

165
00:12:50,140 --> 00:12:51,670
Did you know what an epoch is?

166
00:12:51,670 --> 00:12:53,320
I didn't know
whether it was just

167
00:12:53,320 --> 00:12:58,780
a fancy word for counting the
steps in gradient descent.

168
00:12:58,780 --> 00:13:03,250
But it counts the
steps, all right,

169
00:13:03,250 --> 00:13:07,930
but one epoch is the number
of steps that matches

170
00:13:07,930 --> 00:13:11,610
the size of the training data.

171
00:13:11,610 --> 00:13:14,710
So if you have a
million samples--

172
00:13:14,710 --> 00:13:18,760
where ordinary gradient descent
you would be doing a million--

173
00:13:18,760 --> 00:13:23,780
you'd have a million by a
million problem per step.

174
00:13:23,780 --> 00:13:27,040
Of course, stochastic
gradient descent

175
00:13:27,040 --> 00:13:32,650
just does a mini-batch of 1 or
32 or something, but anyway.

176
00:13:37,810 --> 00:13:41,860
So you have to do it
enough mini-batches

177
00:13:41,860 --> 00:13:44,440
so that the total
number you've covered

178
00:13:44,440 --> 00:13:54,563
is the equivalent of one full
run through the training data,

179
00:13:54,563 --> 00:13:55,980
and that was an
interesting point.

180
00:13:55,980 --> 00:13:57,340
Did you pick up that point?

181
00:13:57,340 --> 00:14:01,690
That in stochastic
gradient descent,

182
00:14:01,690 --> 00:14:05,810
you could either
do a mini-batch,

183
00:14:05,810 --> 00:14:11,320
and then put them back in the
soup, so with replacement.

184
00:14:11,320 --> 00:14:17,760
Or you could just put
your data in some order,

185
00:14:17,760 --> 00:14:20,610
from one to a zillion.

186
00:14:20,610 --> 00:14:25,290
So here's a first x
and then more and more

187
00:14:25,290 --> 00:14:30,330
x's, and then just
randomize the order.

188
00:14:30,330 --> 00:14:34,050
So you'd have to randomize the
order for stochastic gradient

189
00:14:34,050 --> 00:14:36,930
descent to be
reasonable, and then

190
00:14:36,930 --> 00:14:38,940
take a mini-batch
and a mini-batch

191
00:14:38,940 --> 00:14:40,600
and a mini-batch
and a mini-batch.

192
00:14:40,600 --> 00:14:44,070
And when you get to the bottom,
you've finished one epoch.

193
00:14:44,070 --> 00:14:47,070
And then you'd probably
randomize again, maybe,

194
00:14:47,070 --> 00:14:51,510
if you wanted to live right.

195
00:14:51,510 --> 00:14:56,040
And go through the
mini-batches again,

196
00:14:56,040 --> 00:15:01,320
and probably do
1,000 times or more.

197
00:15:01,320 --> 00:15:06,060
Anyway, so I haven't said yet
what you do, what this F of x

198
00:15:06,060 --> 00:15:09,870
is like, but you can sort
of see it on the screen.

199
00:15:09,870 --> 00:15:14,410
Because as it creates
this function F,

200
00:15:14,410 --> 00:15:17,150
it kind of plots it.

201
00:15:17,150 --> 00:15:23,680
And what you see on the screen
is the 0 set for that function.

202
00:15:23,680 --> 00:15:27,810
So perfect would be for
it to go through 0--

203
00:15:27,810 --> 00:15:29,640
if I had another color.

204
00:15:29,640 --> 00:15:31,290
Oh, I do have another color.

205
00:15:31,290 --> 00:15:36,160
Look, this is the first time the
whole semester blue is up here.

206
00:15:36,160 --> 00:15:43,310
OK, so if the function
was positive there--

207
00:15:43,310 --> 00:15:45,390
in this part, on the blues--

208
00:15:45,390 --> 00:15:51,560
and negative outside that
region for the oranges, that

209
00:15:51,560 --> 00:15:52,940
would be just what we want.

210
00:15:52,940 --> 00:15:53,840
Right?

211
00:15:53,840 --> 00:15:57,530
That would be what this little
Playground site is creating.

212
00:15:57,530 --> 00:15:59,780
And on the screen,
you'll see it.

213
00:15:59,780 --> 00:16:05,210
You'll see this curve,
where it crosses 0.

214
00:16:05,210 --> 00:16:07,040
So that curve,
where it crosses 0,

215
00:16:07,040 --> 00:16:10,130
is supposed to
separate the two sets.

216
00:16:10,130 --> 00:16:12,610
One set is positive,
one set is negative,

217
00:16:12,610 --> 00:16:15,250
where 0 is in between.

218
00:16:15,250 --> 00:16:19,870
And the point is, it's not a
straight line, because we've

219
00:16:19,870 --> 00:16:22,100
got this nonlinear function.

220
00:16:22,100 --> 00:16:33,490
This is nonlinear, and
it allows us to have

221
00:16:33,490 --> 00:16:36,520
functions like r minus 5.

222
00:16:36,520 --> 00:16:42,640
And so at 5, that's where
the function would be 0,

223
00:16:42,640 --> 00:16:45,290
and you'll see
that on the screen.

224
00:16:45,290 --> 00:16:50,110
You might just go to
playground@tensorflow.

225
00:16:50,110 --> 00:16:53,650
Of course, TensorFlow
is a big system.

226
00:16:53,650 --> 00:17:00,025
This is the child's
department, but I

227
00:17:00,025 --> 00:17:01,150
thought it was pretty good.

228
00:17:01,150 --> 00:17:04,470
And then on this
site, you decide

229
00:17:04,470 --> 00:17:08,800
how many layers there will be,
how many neurons in each layer.

230
00:17:08,800 --> 00:17:13,140
So you create the structure
that I'm about to draw.

231
00:17:13,140 --> 00:17:21,180
And you won't be able to
get to solve this problem

232
00:17:21,180 --> 00:17:25,290
to find a function F
that learns that data

233
00:17:25,290 --> 00:17:31,440
without a number of layers
and a number of neurons.

234
00:17:31,440 --> 00:17:34,080
If you don't give it enough,
you'll see it struggling.

235
00:17:39,250 --> 00:17:46,120
The 0 set tries to follow this,
but it gives up at some point.

236
00:17:46,120 --> 00:17:48,850
This one doesn't
take too many layers,

237
00:17:48,850 --> 00:17:54,910
and the two trivial examples,
just a few neurons do the job.

238
00:17:54,910 --> 00:17:55,720
OK.

239
00:17:55,720 --> 00:18:02,350
So now, that's a little
commented on one website.

240
00:18:02,350 --> 00:18:05,530
If you know other websites
that I should know

241
00:18:05,530 --> 00:18:10,430
and should call attention to,
could you send me an email?

242
00:18:10,430 --> 00:18:15,000
I'm just not aware of
everything that's out there.

243
00:18:15,000 --> 00:18:22,650
Or if you know a good
Convolutional Neural Net, CNN,

244
00:18:22,650 --> 00:18:27,450
that is available
to practice on,

245
00:18:27,450 --> 00:18:32,760
where you could give
it the training set.

246
00:18:32,760 --> 00:18:34,730
That's what I'm
talking about here.

247
00:18:38,680 --> 00:18:41,050
I'd be glad to
know, because I just

248
00:18:41,050 --> 00:18:43,300
don't know all that I should.

249
00:18:43,300 --> 00:18:43,930
OK.

250
00:18:43,930 --> 00:18:47,430
So what does the
function look like?

251
00:18:47,430 --> 00:18:51,230
Well, as I say, linear
isn't going to do it,

252
00:18:51,230 --> 00:18:56,840
but linear is a very important
part of it, of this function

253
00:18:56,840 --> 00:19:01,220
F. So the function F
really has the form--

254
00:19:01,220 --> 00:19:08,000
well, so we start here
with a vector of one,

255
00:19:08,000 --> 00:19:11,840
two, three, four, m is five.

256
00:19:11,840 --> 00:19:19,840
This is the vector
x, five components.

257
00:19:19,840 --> 00:19:23,690
OK, so let me erase that now.

258
00:19:23,690 --> 00:19:36,700
OK, so then we have layer 1
with some number of points.

259
00:19:36,700 --> 00:19:47,670
Let's say, n1 is 6 neurons,
and let me make this simple.

260
00:19:47,670 --> 00:19:51,420
I'll just have that one layer,
and then I'll have the output.

261
00:19:51,420 --> 00:19:57,600
This will be the output
layer, and it's just

262
00:19:57,600 --> 00:19:58,890
going to be one number.

263
00:20:01,490 --> 00:20:08,970
So I'm going to have a matrix,
A1, that takes me from this.

264
00:20:08,970 --> 00:20:18,040
A1 will be 6 by 5, because I
want 6 outputs and 5 inputs.

265
00:20:18,040 --> 00:20:23,570
6 by 5 matrix, so I have
30 weights to choose there.

266
00:20:23,570 --> 00:20:31,230
And so the y that
comes out is going

267
00:20:31,230 --> 00:20:36,730
to be y1 will be A1 times x0.

268
00:20:36,730 --> 00:20:44,140
So x0 is the feature
vector with 5 components.

269
00:20:44,140 --> 00:20:47,740
So that's a purely
linear thing, but we also

270
00:20:47,740 --> 00:20:54,650
want an offset
function, offset vector.

271
00:20:54,650 --> 00:20:56,120
So that's a vector.

272
00:20:56,120 --> 00:21:02,060
Then, this, the y that's
coming out, has 6 components.

273
00:21:02,060 --> 00:21:07,940
The A1 is 6 by 5,
the x0 was 5 by 1,

274
00:21:07,940 --> 00:21:10,550
and then of course,
this is 6 by 1.

275
00:21:10,550 --> 00:21:12,830
So these are the weights.

276
00:21:15,500 --> 00:21:19,460
Yeah, I'll call them all
weights, weights to compute.

277
00:21:25,940 --> 00:21:27,020
So these are connected.

278
00:21:27,020 --> 00:21:34,450
The usual picture is to
show all these connections.

279
00:21:34,450 --> 00:21:38,270
I'll just put in some of them.

280
00:21:38,270 --> 00:21:49,010
So in here, we have 30 plus
6 parameters, 36 parameters,

281
00:21:49,010 --> 00:21:52,070
and then I'm going
to close this.

282
00:21:52,070 --> 00:22:02,960
It's going to be a very shallow
thing, so that will be just 1

283
00:22:02,960 --> 00:22:03,930
by 6.

284
00:22:03,930 --> 00:22:04,910
Yeah.

285
00:22:04,910 --> 00:22:05,410
OK.

286
00:22:08,780 --> 00:22:11,210
Right, so we're just
getting one output.

287
00:22:11,210 --> 00:22:18,290
So that's just a vector at this
final point, but of course,

288
00:22:18,290 --> 00:22:21,170
that the whole idea
of deep neural nets

289
00:22:21,170 --> 00:22:23,810
is that you have many layers.

290
00:22:23,810 --> 00:22:29,420
So 36 more realistically is
in the tens of thousands,

291
00:22:29,420 --> 00:22:32,330
and you have it multiple times.

292
00:22:32,330 --> 00:22:42,000
And the idea seems to be
that you can separate what

293
00:22:42,000 --> 00:22:49,200
layer one learns about the data
and from what layer two learns

294
00:22:49,200 --> 00:22:50,400
about the data.

295
00:22:50,400 --> 00:22:57,130
Layer one-- this A1, apparently
by just looking after

296
00:22:57,130 --> 00:22:58,810
the computation--

297
00:22:58,810 --> 00:23:04,630
this learns some basic
facts about the data.

298
00:23:04,630 --> 00:23:13,930
The next, A2 which would go in
here, would learn more detail,

299
00:23:13,930 --> 00:23:16,000
and then A3 would
learn more details.

300
00:23:16,000 --> 00:23:19,540
So we would have a
number of layers,

301
00:23:19,540 --> 00:23:27,150
and it's that construction that
has made neural net successful.

302
00:23:27,150 --> 00:23:32,670
But I haven't finished, because
right now, it's only linear.

303
00:23:32,670 --> 00:23:36,090
Right now, I just have,
I'll call it A2 in here.

304
00:23:36,090 --> 00:23:38,700
Right now, I would just
have a matrix multiplication

305
00:23:38,700 --> 00:23:47,850
apply A1 and then apply A2,
but in between there is a 1

306
00:23:47,850 --> 00:23:55,930
by 1 action on each
by this function.

307
00:23:58,580 --> 00:24:02,860
So that function
acts on that number

308
00:24:02,860 --> 00:24:06,610
to give that number
back again or to give 0.

309
00:24:06,610 --> 00:24:09,850
So in there is ReLU.

310
00:24:09,850 --> 00:24:18,340
In this comes ReLU on each,
6 copies of ReLU acting

311
00:24:18,340 --> 00:24:20,820
on each of those 6 numbers.

312
00:24:20,820 --> 00:24:21,630
Right?

313
00:24:21,630 --> 00:24:32,460
So really x1 comes from
y1 by applying ReLU to it.

314
00:24:32,460 --> 00:24:34,470
Then, that gives the x.

315
00:24:34,470 --> 00:24:37,830
So here are the y's
from the linear part,

316
00:24:37,830 --> 00:24:39,540
and here are the x--

317
00:24:39,540 --> 00:24:40,480
that's y1.

318
00:24:40,480 --> 00:24:46,920
That's a vector y1 from just
the linear plus an affine map.

319
00:24:46,920 --> 00:24:51,120
Linear plus constant,
that's affine.

320
00:24:51,120 --> 00:24:55,590
And then the next step
is component by component

321
00:24:55,590 --> 00:25:03,240
we apply this function, and
we get x1, and then do it

322
00:25:03,240 --> 00:25:05,100
again and again and again.

323
00:25:05,100 --> 00:25:07,660
So do you see the function?

324
00:25:07,660 --> 00:25:10,830
How do I describe now
the function F of x?

325
00:25:14,150 --> 00:25:26,390
So the learning function
which depends on the weights,

326
00:25:26,390 --> 00:25:28,655
on the A's and b's.

327
00:25:32,150 --> 00:25:39,270
So I start with an
x, I apply A1 to it.

328
00:25:39,270 --> 00:25:41,610
Yeah, let me do this.

329
00:25:41,610 --> 00:25:44,600
This is the function F of x.

330
00:25:44,600 --> 00:25:48,710
F of x is going to
be F3, let's say,

331
00:25:48,710 --> 00:25:58,520
of F2 of F1 of x, one, two,
three, parentheses, right?

332
00:25:58,520 --> 00:26:01,520
OK, so it's a chain,
you could say.

333
00:26:01,520 --> 00:26:06,830
F is a-- what's the right
word for a chain of functions,

334
00:26:06,830 --> 00:26:09,890
if I take a function
of a function?

335
00:26:09,890 --> 00:26:12,800
The reason I use the word
chain is that the chain rule

336
00:26:12,800 --> 00:26:14,360
gives the derivative.

337
00:26:14,360 --> 00:26:19,430
So a function of a function
of a function, that's

338
00:26:19,430 --> 00:26:23,570
called composition,
composing function.

339
00:26:23,570 --> 00:26:27,720
So this is a composition.

340
00:26:27,720 --> 00:26:30,900
I don't know if there's a
standard symbol for starting

341
00:26:30,900 --> 00:26:36,820
with F1 and do some composition
and do some composition.

342
00:26:36,820 --> 00:26:38,970
And now what are
those separate F's?

343
00:26:43,180 --> 00:26:47,670
So the separate F's are the--

344
00:26:47,670 --> 00:26:52,410
F1 of a vector would
be-- it includes

345
00:26:52,410 --> 00:27:01,900
the ReLU part, the nonlinear
part, of A1, x0 plus b1.

346
00:27:01,900 --> 00:27:07,590
So two parts, you do
the linear or affine map

347
00:27:07,590 --> 00:27:13,080
on your feature vector,
and then component

348
00:27:13,080 --> 00:27:18,560
by component you apply
that nonlinear function.

349
00:27:18,560 --> 00:27:22,770
And it took some years before
that nonlinear function

350
00:27:22,770 --> 00:27:27,630
became a big favorite.

351
00:27:27,630 --> 00:27:29,670
People imagined
that it was better,

352
00:27:29,670 --> 00:27:33,180
it was important, to
have a smooth function.

353
00:27:33,180 --> 00:27:42,950
So the original functions
were sigmoids, like S curves,

354
00:27:42,950 --> 00:27:46,560
but of course, it turned
out that experiments showed

355
00:27:46,560 --> 00:27:48,630
that this worked even better.

356
00:27:48,630 --> 00:27:52,140
Yeah, so that would
be F1, and then F2

357
00:27:52,140 --> 00:27:55,780
would have the same form, and
F3 would have the same form.

358
00:27:55,780 --> 00:27:59,640
So maybe this had 36
weights, and the next one

359
00:27:59,640 --> 00:28:04,440
would have another number
and the next another number.

360
00:28:04,440 --> 00:28:08,940
You get quite complicated
functions by composition,

361
00:28:08,940 --> 00:28:14,640
by like e to the sine of x, or
e to the sign of the logarithm

362
00:28:14,640 --> 00:28:18,510
of x, or things like that.

363
00:28:18,510 --> 00:28:23,190
Pure math has asked, what
functions can you get?

364
00:28:23,190 --> 00:28:24,780
Try to think of them all.

365
00:28:24,780 --> 00:28:27,690
Now, what kind of
functions do we have here?

366
00:28:27,690 --> 00:28:33,870
What can I say about F of x as
a function, as a math person?

367
00:28:33,870 --> 00:28:37,140
What kind of a function is it?

368
00:28:37,140 --> 00:28:42,660
So it's created out of
matrices and vectors,

369
00:28:42,660 --> 00:28:49,980
out of a linear or affine
map, followed by a nonlinear,

370
00:28:49,980 --> 00:28:54,490
by that particular
nonlinear function.

371
00:28:54,490 --> 00:28:56,980
So what kind of
a function is it?

372
00:28:56,980 --> 00:29:04,000
Well, I've written those
words down up here, and F of x

373
00:29:04,000 --> 00:29:07,960
is going to be a continuous
piecewise linear function.

374
00:29:10,630 --> 00:29:15,580
Because every step
is continuous,

375
00:29:15,580 --> 00:29:17,800
that's a continuous function.

376
00:29:17,800 --> 00:29:19,900
Linear functions are a
continuous functions,

377
00:29:19,900 --> 00:29:24,700
so we're taking a composition
of continuous function,

378
00:29:24,700 --> 00:29:26,500
so it's continuous.

379
00:29:26,500 --> 00:29:30,940
And it's piecewise linear,
because part of it is linear,

380
00:29:30,940 --> 00:29:32,950
and part of it is
piecewise linear.

381
00:29:35,460 --> 00:29:51,480
So this is some continuous,
piecewise, linear function

382
00:29:51,480 --> 00:29:58,900
of x, x in m dimensions.

383
00:29:58,900 --> 00:29:59,400
OK.

384
00:30:02,830 --> 00:30:07,810
So one little math
question which I think

385
00:30:07,810 --> 00:30:13,810
helps to understand,
to like to swallow

386
00:30:13,810 --> 00:30:19,960
the idea of a chain, of the
kind of chain we have here,

387
00:30:19,960 --> 00:30:22,525
of linear followed by ReLU.

388
00:30:27,850 --> 00:30:29,470
So here's my question.

389
00:30:29,470 --> 00:30:31,450
This is the question
I'm going to ask.

390
00:30:31,450 --> 00:30:34,660
And by the way, back
propagation is certainly

391
00:30:34,660 --> 00:30:37,510
going to come Wednesday
rather than today.

392
00:30:37,510 --> 00:30:41,120
That's a major topic in itself.

393
00:30:41,120 --> 00:30:44,450
So let me keep going
with this function.

394
00:30:47,180 --> 00:30:50,750
Could you get any function
whatsoever this way?

395
00:30:50,750 --> 00:30:53,270
Well, no, you only get
continuous, piecewise,

396
00:30:53,270 --> 00:30:56,500
linear functions.

397
00:30:56,500 --> 00:30:59,060
It's an interesting case.

398
00:30:59,060 --> 00:31:01,840
Let me just ask you.

399
00:31:01,840 --> 00:31:04,810
One of the exercises
says, if I took

400
00:31:04,810 --> 00:31:09,790
two continuous, piecewise,
linear functions--

401
00:31:09,790 --> 00:31:12,640
the next 20 minutes
are an attempt

402
00:31:12,640 --> 00:31:16,780
to give us a picture of
the graph of a piecewise,

403
00:31:16,780 --> 00:31:23,605
linear function in say a
function of two variables.

404
00:31:23,605 --> 00:31:30,160
So I have m equal to 2,
and I draw its graph.

405
00:31:30,160 --> 00:31:32,340
OK, help me to draw this graph.

406
00:31:32,340 --> 00:31:38,330
So this would be a
graph of F of x1, x2,

407
00:31:38,330 --> 00:31:41,000
and it's going to be continuous
and piecewise linear.

408
00:31:41,000 --> 00:31:43,550
So what does its
graph look like?

409
00:31:43,550 --> 00:31:45,680
That's the question.

410
00:31:45,680 --> 00:31:50,540
What's the graph of a piecewise,
linear function looks like?

411
00:31:50,540 --> 00:32:00,220
Well, it's got flat pieces
in between the change from--

412
00:32:00,220 --> 00:32:04,230
I do say piecewise, that means
it's got different pieces.

413
00:32:04,230 --> 00:32:12,060
But within a piece, it's linear,
and the pieces with each other,

414
00:32:12,060 --> 00:32:13,800
because it's continuous.

415
00:32:13,800 --> 00:32:20,050
So I visualize, well,
it's like origami.

416
00:32:20,050 --> 00:32:24,760
This is the theory
of origami almost.

417
00:32:24,760 --> 00:32:27,320
So right, origami,
you take a flat thing,

418
00:32:27,320 --> 00:32:32,580
and you fold it
along straight folds.

419
00:32:32,580 --> 00:32:34,330
So what's different
from origami?

420
00:32:34,330 --> 00:32:35,300
Maybe not much.

421
00:32:38,136 --> 00:32:43,520
Well, maybe origami allows
more than we allow here,

422
00:32:43,520 --> 00:32:46,790
or origami would allow you
to fold it up and over.

423
00:32:46,790 --> 00:32:51,050
So origami would give
you a multi-valued thing,

424
00:32:51,050 --> 00:32:55,880
because it's got a top and
a bottom and other folds.

425
00:32:55,880 --> 00:33:02,980
This is just going out to
infinity in flat pieces,

426
00:33:02,980 --> 00:33:06,020
and the question will
be, how many pieces?

427
00:33:06,020 --> 00:33:07,710
So let me ask you that question.

428
00:33:07,710 --> 00:33:11,890
How many pieces do I have?

429
00:33:11,890 --> 00:33:15,440
Do you see what I
mean by a piece?

430
00:33:15,440 --> 00:33:19,300
So I'm thinking of a graph
that has these flat pieces,

431
00:33:19,300 --> 00:33:23,880
and they're connected
along straight edges.

432
00:33:23,880 --> 00:33:30,870
And those straight edges
come from the ReLU operation.

433
00:33:30,870 --> 00:33:33,520
Well, that's got two pieces.

434
00:33:33,520 --> 00:33:35,290
Actually, we could do it 1D.

435
00:33:35,290 --> 00:33:39,170
In 1D, we could count the
number of pieces pretty easily.

436
00:33:39,170 --> 00:33:41,450
So what would be a
piecewise linear?

437
00:33:41,450 --> 00:33:45,260
Let me put it over here on
the side and erase it soon.

438
00:33:45,260 --> 00:33:48,140
OK.

439
00:33:48,140 --> 00:33:56,350
So here's m equal 1, a
continuous, piecewise, linear

440
00:33:56,350 --> 00:34:00,380
F. I'll just draw its graph.

441
00:34:00,380 --> 00:34:08,159
So OK, so it's got
straight pieces,

442
00:34:08,159 --> 00:34:11,300
straight pieces like so.

443
00:34:11,300 --> 00:34:12,855
Yeah, you've got the idea.

444
00:34:12,855 --> 00:34:14,540
It's a broken line type.

445
00:34:14,540 --> 00:34:16,310
Sometimes, people
say broken line,

446
00:34:16,310 --> 00:34:21,350
but I'm never sure that's
a good description of this.

447
00:34:21,350 --> 00:34:24,500
Piecewise, linear,
continuous, so it's continuous

448
00:34:24,500 --> 00:34:30,070
because the pieces meet,
and it's piecewise,

449
00:34:30,070 --> 00:34:31,900
linear, obviously.

450
00:34:31,900 --> 00:34:34,909
OK, so that's the
kind of picture

451
00:34:34,909 --> 00:34:39,460
I have for a function
of one variable.

452
00:34:39,460 --> 00:34:42,650
Now, my question is--

453
00:34:42,650 --> 00:34:47,630
as an aid to try to visualize
this function in 2D--

454
00:34:47,630 --> 00:34:51,500
is to see if we can
count the pieces,

455
00:34:51,500 --> 00:34:53,540
see if we can count the pieces.

456
00:34:53,540 --> 00:34:55,639
Yes.

457
00:34:55,639 --> 00:34:58,060
So that's in the notes.

458
00:34:58,060 --> 00:35:06,270
I found it in a paper by
five authors for a meeting.

459
00:35:08,946 --> 00:35:16,390
So actually, the whole
world of neural nets,

460
00:35:16,390 --> 00:35:21,150
it's the conferences
every couple of years

461
00:35:21,150 --> 00:35:27,150
that everybody prepares for,
submits more than one paper.

462
00:35:27,150 --> 00:35:30,450
So it's kind of a piecewise,
linear conference,

463
00:35:30,450 --> 00:35:35,460
and those are the
big conferences.

464
00:35:35,460 --> 00:35:36,300
OK.

465
00:35:36,300 --> 00:35:38,550
So this is the back
propagation section,

466
00:35:38,550 --> 00:35:42,800
and I want to look at the--

467
00:35:42,800 --> 00:35:43,300
OK.

468
00:35:46,560 --> 00:35:49,400
So this is a paper by
Kleinberg and four others.

469
00:35:49,400 --> 00:35:54,530
Kleinberg, he's a computer
science guy at Cornell.

470
00:35:54,530 --> 00:35:57,440
He was a PhD from
here in math, and he's

471
00:35:57,440 --> 00:36:04,700
a very cool and
significant person,

472
00:36:04,700 --> 00:36:13,930
not so much on neural networks
as just this whole part

473
00:36:13,930 --> 00:36:15,130
of computer science.

474
00:36:15,130 --> 00:36:15,820
Right.

475
00:36:15,820 --> 00:36:19,390
So anyway, they and
other people too

476
00:36:19,390 --> 00:36:20,740
have asked this same problem.

477
00:36:23,450 --> 00:36:24,910
Suppose I'm in two variables.

478
00:36:28,250 --> 00:36:33,830
So what are you imagining
now for the surface,

479
00:36:33,830 --> 00:36:39,200
the graph of F of x and y?

480
00:36:39,200 --> 00:36:43,520
It has these lines,
fold lines, right?

481
00:36:43,520 --> 00:36:45,600
I'm thinking it has fold lines.

482
00:36:48,840 --> 00:36:51,890
So I can start with a
complete plane, and I fold it

483
00:36:51,890 --> 00:36:53,520
along one line.

484
00:36:53,520 --> 00:36:55,920
So now, it's like ReLU.

485
00:36:55,920 --> 00:37:00,070
It's one half plane there going
into a different half plane

486
00:37:00,070 --> 00:37:00,570
there.

487
00:37:00,570 --> 00:37:03,130
Everybody with it?

488
00:37:03,130 --> 00:37:08,170
And now, I take that
function, that surface

489
00:37:08,170 --> 00:37:14,020
which just has two parts,
and I put in another fold.

490
00:37:14,020 --> 00:37:17,850
OK, how many parts
have I got now?

491
00:37:17,850 --> 00:37:20,450
I think four, am I right?

492
00:37:20,450 --> 00:37:26,430
Four parts, yes, because this
will be different from this,

493
00:37:26,430 --> 00:37:28,840
because it was folded
along that line.

494
00:37:28,840 --> 00:37:31,380
So these will be four
different pieces.

495
00:37:31,380 --> 00:37:34,820
They have the same value
at the center there,

496
00:37:34,820 --> 00:37:39,230
and they match along the lines.

497
00:37:39,230 --> 00:37:43,800
So the number of flat
pieces is four for this.

498
00:37:43,800 --> 00:37:47,640
So that's with two folds, and
now I just want to ask you,

499
00:37:47,640 --> 00:37:51,975
with m folds how many
pieces are there?

500
00:37:51,975 --> 00:37:53,655
Can I get up to three folds?

501
00:37:56,330 --> 00:37:59,390
So I'm going to look
for the number of folds.

502
00:37:59,390 --> 00:38:04,690
So let me just use
a notation, maybe r.

503
00:38:04,690 --> 00:38:22,910
r is the number of flat pieces,
and m is the dimension of x.

504
00:38:22,910 --> 00:38:29,530
In my picture, it's two, and
N is the number of folds.

505
00:38:33,750 --> 00:38:35,340
So let me say it again.

506
00:38:35,340 --> 00:38:36,310
I'm taking a plane.

507
00:38:40,200 --> 00:38:42,240
I'll fold that plane--

508
00:38:42,240 --> 00:38:45,060
because the dimension was two--

509
00:38:45,060 --> 00:38:47,205
I'll fold it N times.

510
00:38:51,260 --> 00:38:52,860
How many pieces?

511
00:38:52,860 --> 00:38:53,930
How many flat pieces?

512
00:39:06,190 --> 00:39:10,150
This would be a central
step in understanding

513
00:39:10,150 --> 00:39:13,750
how close the function--

514
00:39:13,750 --> 00:39:18,730
what freedom you have in
the function F. For example,

515
00:39:18,730 --> 00:39:22,850
can you approximate
any continuous function

516
00:39:22,850 --> 00:39:27,910
by one of these functions
F by taking enough folds?

517
00:39:27,910 --> 00:39:30,910
Seems like the answer should
be yes, and it is yes.

518
00:39:34,020 --> 00:39:38,380
For pure math,
that's one question.

519
00:39:38,380 --> 00:39:41,830
Is this class of
functions universal?

520
00:39:41,830 --> 00:39:44,230
So the universality
theorem would

521
00:39:44,230 --> 00:39:49,060
be to say that any function--

522
00:39:49,060 --> 00:39:53,680
sine x, whatever--
could be approximated

523
00:39:53,680 --> 00:39:59,380
as close as you like by one of
these guys with enough folds.

524
00:39:59,380 --> 00:40:05,010
And over here, we're kind
of making it more numerical.

525
00:40:05,010 --> 00:40:07,570
We're going to count
the number of pieces

526
00:40:07,570 --> 00:40:10,480
just to see how
quickly do they grow.

527
00:40:10,480 --> 00:40:12,490
So what happens here?

528
00:40:12,490 --> 00:40:14,760
So I have four folds.

529
00:40:14,760 --> 00:40:18,660
Right now, I have N equal 2.

530
00:40:18,660 --> 00:40:23,080
m is 2 here in this picture.

531
00:40:23,080 --> 00:40:27,030
And I'm trying to draw this
surface, in here I've put in 2.

532
00:40:27,030 --> 00:40:28,050
Did I take N?

533
00:40:28,050 --> 00:40:34,030
Yeah, two folds, and now I'm
going to go up to three folds.

534
00:40:34,030 --> 00:40:34,720
OK.

535
00:40:34,720 --> 00:40:37,890
So let me fold it
along that line.

536
00:40:37,890 --> 00:40:39,570
How many pieces of I got now?

537
00:40:44,240 --> 00:40:49,600
Let's see, can I
count those pieces?

538
00:40:49,600 --> 00:40:52,360
Is it seven?

539
00:40:52,360 --> 00:40:54,340
So what is a formula?

540
00:40:54,340 --> 00:40:55,810
What if I do another fold?

541
00:40:59,890 --> 00:41:01,910
Yeah, let's pretend
we do another fold.

542
00:41:01,910 --> 00:41:02,630
Yeah?

543
00:41:02,630 --> 00:41:04,260
AUDIENCE: [INAUDIBLE]

544
00:41:04,260 --> 00:41:06,600
GILBERT STRANG: Uh, yeah.

545
00:41:06,600 --> 00:41:10,050
Well, maybe that's
going to be it.

546
00:41:12,580 --> 00:41:15,020
It's a kind of nice
question, because it asks

547
00:41:15,020 --> 00:41:16,970
you to visualize this thing.

548
00:41:16,970 --> 00:41:17,470
OK.

549
00:41:20,290 --> 00:41:22,030
So what happened?

550
00:41:22,030 --> 00:41:24,250
How many of those
lines will be--

551
00:41:24,250 --> 00:41:27,190
if I put in a fourth line--

552
00:41:27,190 --> 00:41:29,040
how many?

553
00:41:29,040 --> 00:41:33,300
Yeah, how many new
folds do I create?

554
00:41:33,300 --> 00:41:34,850
That's kind of the
question, and I'm

555
00:41:34,850 --> 00:41:37,430
assuming that
fourth line doesn't

556
00:41:37,430 --> 00:41:39,110
go through any of these points.

557
00:41:39,110 --> 00:41:40,655
It's sort of in
general position.

558
00:41:43,660 --> 00:41:47,650
So I put it in a fourth line,
da-da-da-da, there it is.

559
00:41:47,650 --> 00:41:50,620
OK, so what happened here?

560
00:41:50,620 --> 00:41:53,890
How many new ones did it create?

561
00:41:53,890 --> 00:41:55,360
How many new ones did it create?

562
00:41:58,690 --> 00:42:01,390
Let me make that one
green, because I'm

563
00:42:01,390 --> 00:42:03,460
distinguishing
that's the guy that's

564
00:42:03,460 --> 00:42:06,580
added after the original.

565
00:42:06,580 --> 00:42:07,660
We had seven.

566
00:42:07,660 --> 00:42:14,860
We had seven pieces,
and now we've got more.

567
00:42:14,860 --> 00:42:15,580
Was it seven?

568
00:42:15,580 --> 00:42:16,390
It was, wasn't it?

569
00:42:16,390 --> 00:42:20,800
One, two, three, four, five,
six, seven, but now how many

570
00:42:20,800 --> 00:42:22,090
pieces have I got?

571
00:42:22,090 --> 00:42:30,160
Or how many pieces did
this new line create?

572
00:42:30,160 --> 00:42:32,750
We want to build it
up, use a recursion.

573
00:42:32,750 --> 00:42:35,150
How many pieces did this new--

574
00:42:35,150 --> 00:42:39,920
well, this new line created
one new piece there.

575
00:42:39,920 --> 00:42:40,910
Right?

576
00:42:40,910 --> 00:42:43,760
One new piece there,
one new piece there,

577
00:42:43,760 --> 00:42:50,050
one new piece there, so
there are four new pieces.

578
00:42:50,050 --> 00:42:52,410
OK.

579
00:42:52,410 --> 00:42:55,490
Yes, so there's
some formula that's

580
00:42:55,490 --> 00:42:59,430
going to tell us that, and now
what would the next one create?

581
00:42:59,430 --> 00:43:03,870
Well, now I have one,
two, three, four lines.

582
00:43:03,870 --> 00:43:06,870
So now, I'm going to put
through a fifth line,

583
00:43:06,870 --> 00:43:09,300
and that will create a
whole bunch of pieces.

584
00:43:09,300 --> 00:43:14,970
I'm losing the thread of this
argument, but you're onto it.

585
00:43:14,970 --> 00:43:16,410
Right?

586
00:43:16,410 --> 00:43:19,800
Yeah, so any suggestions?

587
00:43:19,800 --> 00:43:20,860
Yeah.

588
00:43:20,860 --> 00:43:23,910
AUDIENCE: Yeah, I think you add
essentially the number of lines

589
00:43:23,910 --> 00:43:26,840
that you have each time
you add a line at most.

590
00:43:26,840 --> 00:43:27,960
GILBERT STRANG: OK.

591
00:43:27,960 --> 00:43:30,230
Yes.

592
00:43:30,230 --> 00:43:30,930
That's right.

593
00:43:30,930 --> 00:43:34,290
So there is a recursion
formula that I want to know,

594
00:43:34,290 --> 00:43:36,840
and I learned it from
Kleinberg's paper.

595
00:43:39,820 --> 00:43:42,210
And then we have
an addition to do,

596
00:43:42,210 --> 00:43:45,970
so the recursion will tell
me how much it goes up

597
00:43:45,970 --> 00:43:49,330
with each new function,
and then we have to add.

598
00:43:49,330 --> 00:43:49,960
OK.

599
00:43:49,960 --> 00:43:52,210
So the recursion formula,
let me write that down.

600
00:43:56,520 --> 00:44:06,030
So this is r of N and m that
I'd like to find a formula for.

601
00:44:06,030 --> 00:44:12,260
It's the number of flat pieces
with an m dimensional surface--

602
00:44:12,260 --> 00:44:14,510
well, we're taking m to be 2--

603
00:44:14,510 --> 00:44:16,115
and N folds.

604
00:44:18,790 --> 00:44:22,340
So N equal 1, 2, 3.

605
00:44:22,340 --> 00:44:25,982
Let's write down
the numbers we know.

606
00:44:25,982 --> 00:44:29,630
With one fold, how many pieces?

607
00:44:29,630 --> 00:44:32,710
Two, good, so far so good.

608
00:44:35,760 --> 00:44:38,940
With one fold, there
were two pieces.

609
00:44:38,940 --> 00:44:44,640
So this is the count r, and
then with two folds, how many?

610
00:44:44,640 --> 00:44:46,740
Oh, we've gone past that point.

611
00:44:46,740 --> 00:44:50,130
So can we get back
to just those two?

612
00:44:50,130 --> 00:44:51,280
Was it four?

613
00:44:51,280 --> 00:44:52,020
AUDIENCE: Yes.

614
00:44:52,020 --> 00:44:55,050
GILBERT STRANG: OK, thanks.

615
00:44:55,050 --> 00:44:59,790
Now, when I put in that third
fold, how many did I have

616
00:44:59,790 --> 00:45:02,550
without the green line yet?

617
00:45:02,550 --> 00:45:06,180
Seven, was it seven?

618
00:45:06,180 --> 00:45:12,300
And when the fourth one went in,
that green one, how many have I

619
00:45:12,300 --> 00:45:13,320
got in this picture?

620
00:45:15,990 --> 00:45:19,860
So the question is how many
new ones did I create, I guess.

621
00:45:19,860 --> 00:45:24,270
So that line got chopped
into that piece, that piece,

622
00:45:24,270 --> 00:45:27,930
that piece, that piece, four
pieces for the new line.

623
00:45:27,930 --> 00:45:32,400
Four pieces for the new line,
and then each of those pieces

624
00:45:32,400 --> 00:45:36,720
like added a flat bit.

625
00:45:36,720 --> 00:45:40,350
Because that piece
from here to here

626
00:45:40,350 --> 00:45:43,260
separated these two
which were previously

627
00:45:43,260 --> 00:45:45,770
just one piece, one flat piece.

628
00:45:45,770 --> 00:45:47,400
I folded on that line.

629
00:45:47,400 --> 00:45:48,420
I folded on this.

630
00:45:48,420 --> 00:45:49,380
I folded there.

631
00:45:49,380 --> 00:45:51,930
I think it went up by 4 to 11.

632
00:45:55,510 --> 00:45:59,740
So now, we just have
to guess a formula that

633
00:45:59,740 --> 00:46:02,950
matches those numbers, and
then of course, we really

634
00:46:02,950 --> 00:46:09,490
should guess it for
any m and any N.

635
00:46:09,490 --> 00:46:13,390
And I'll write down the
formula that they found.

636
00:46:16,340 --> 00:46:19,410
It involves binomial numbers.

637
00:46:19,410 --> 00:46:24,450
Everything in the world
involves binomial numbers,

638
00:46:24,450 --> 00:46:28,650
because they satisfy every
identity you could think of.

639
00:46:34,060 --> 00:46:35,065
So here's their formula.

640
00:46:38,920 --> 00:46:43,180
r with N folds, and
we're in m dimensions.

641
00:46:43,180 --> 00:46:46,300
So we've really in our
thinking had m equal to 2,

642
00:46:46,300 --> 00:46:53,470
but we should grow up and
get m to be five dimensional.

643
00:46:53,470 --> 00:46:55,540
So we have a five dimensional--

644
00:46:55,540 --> 00:46:57,700
let's not think about that.

645
00:46:57,700 --> 00:47:00,180
OK.

646
00:47:00,180 --> 00:47:03,030
So it turns out it's
binomial numbers--

647
00:47:03,030 --> 00:47:11,840
N 0, N 1, up to N m.

648
00:47:16,260 --> 00:47:28,880
So for m equals 2, which is
my picture, it's N 0 plus N 1

649
00:47:28,880 --> 00:47:32,750
plus N 2, and what are these?

650
00:47:32,750 --> 00:47:37,560
What does that N 2
mean, for example?

651
00:47:37,560 --> 00:47:41,040
That's a binomial number.

652
00:47:41,040 --> 00:47:45,640
I don't know if you're
keen on binomial numbers.

653
00:47:45,640 --> 00:47:50,170
Some people, their whole lives
go into binomial numbers.

654
00:47:50,170 --> 00:47:53,290
So it's something like--

655
00:47:53,290 --> 00:48:00,610
is it N factorial divided
by N minus 2 factorial and 2

656
00:48:00,610 --> 00:48:01,390
factorial?

657
00:48:05,590 --> 00:48:08,070
I think that's what
that number means.

658
00:48:08,070 --> 00:48:09,410
That's the binomial number.

659
00:48:12,860 --> 00:48:17,030
So at this point, I'm hoping to
get the answer seven, I think.

660
00:48:22,290 --> 00:48:25,200
I'm in m equal to--

661
00:48:25,200 --> 00:48:27,690
I've gone up to 2.

662
00:48:27,690 --> 00:48:36,510
Yeah, so I think I've obviously
allowed for three cuts,

663
00:48:36,510 --> 00:48:41,430
and the r, when we
had just three, was 7.

664
00:48:46,380 --> 00:48:49,110
So this is now I'm
taking N to be 3,

665
00:48:49,110 --> 00:48:56,050
and I'm hoping for answer is 7.

666
00:49:01,370 --> 00:49:03,170
So I add these three things.

667
00:49:03,170 --> 00:49:08,490
So what is 3, the
binomial number 3 with 2?

668
00:49:08,490 --> 00:49:10,550
I've forgotten how to say that.

669
00:49:10,550 --> 00:49:11,660
I'm ashamed to admit.

670
00:49:11,660 --> 00:49:13,380
3 choose 2, thanks.

671
00:49:13,380 --> 00:49:14,550
I knew there was a good way.

672
00:49:14,550 --> 00:49:15,890
So what is 3 choose 2?

673
00:49:18,700 --> 00:49:23,450
Well, put in 3, and 2
is in there already,

674
00:49:23,450 --> 00:49:28,520
so that'd be 6 over 1 times 2.

675
00:49:28,520 --> 00:49:29,690
This would be 3.

676
00:49:29,690 --> 00:49:30,530
Would that be 3?

677
00:49:35,520 --> 00:49:39,639
And what is 3 choose 1?

678
00:49:39,639 --> 00:49:41,120
AUDIENCE: 3.

679
00:49:41,120 --> 00:49:42,780
GILBERT STRANG: How
do you know that?

680
00:49:42,780 --> 00:49:45,730
You're probably right.

681
00:49:45,730 --> 00:49:48,190
3, I think, yeah.

682
00:49:48,190 --> 00:49:52,800
Oh yeah, probably a theorem
that if these add 3.

683
00:49:52,800 --> 00:49:55,340
Yeah, so I'm doing
N equals 3 here.

684
00:49:55,340 --> 00:49:55,840
OK.

685
00:49:55,840 --> 00:49:57,270
So yeah, I agree.

686
00:49:57,270 --> 00:50:01,000
That's 3, and what about N to 0?

687
00:50:01,000 --> 00:50:05,580
That you have to live
with 0 factorial,

688
00:50:05,580 --> 00:50:10,000
but 0 factorial
is by no means 0.

689
00:50:10,000 --> 00:50:13,200
So what is 0 factorial?

690
00:50:13,200 --> 00:50:17,470
1, yeah.

691
00:50:17,470 --> 00:50:19,510
I remember when I was
an undergraduate having

692
00:50:19,510 --> 00:50:20,470
a bet on that.

693
00:50:24,890 --> 00:50:29,030
I won, but he didn't pay off.

694
00:50:29,030 --> 00:50:31,480
Yeah, so it's 3.

695
00:50:31,480 --> 00:50:35,320
This is 3 factorial over 3
factorial times 0 factorial.

696
00:50:35,320 --> 00:50:37,060
So it's 6 over 6 times 1.

697
00:50:37,060 --> 00:50:38,130
So it's 1.

698
00:50:38,130 --> 00:50:41,496
Yeah, 1 and 3 and 3 make 7.

699
00:50:41,496 --> 00:50:44,450
So that proves the formula.

700
00:50:44,450 --> 00:50:46,190
Well, it doesn't quite
prove the formula,

701
00:50:46,190 --> 00:50:51,380
but the way to prove
it is by an induction.

702
00:50:51,380 --> 00:51:00,910
If you like this stuff,
the recursion that you use

703
00:51:00,910 --> 00:51:01,850
induction on.

704
00:51:01,850 --> 00:51:04,810
Which is just what we did
now, what we did here.

705
00:51:04,810 --> 00:51:11,130
Here comes in a number
4, and it cuts through,

706
00:51:11,130 --> 00:51:14,910
and then we just counted
the 4 pieces there.

707
00:51:14,910 --> 00:51:23,250
So yeah, so let me just tell
you what the r then and m.

708
00:51:23,250 --> 00:51:26,250
The number we're looking
for is the number

709
00:51:26,250 --> 00:51:30,780
that we had with one less cut.

710
00:51:30,780 --> 00:51:35,310
So that's the previous
count of flat pieces

711
00:51:35,310 --> 00:51:43,080
plus the number that was here
was 4, the number of pieces

712
00:51:43,080 --> 00:51:44,310
that cut that.

713
00:51:44,310 --> 00:51:52,000
And that's r of N
minus 1, m minus 1.

714
00:51:52,000 --> 00:51:57,790
Yeah, and I won't go further.

715
00:51:57,790 --> 00:52:02,800
Time's up, but that
rule for recursion

716
00:52:02,800 --> 00:52:08,440
is proved in the section
7.1 taken from the paper

717
00:52:08,440 --> 00:52:11,030
by Kleinberg and others.

718
00:52:11,030 --> 00:52:11,530
Yeah.

719
00:52:11,530 --> 00:52:14,320
So OK, I think this is--

720
00:52:14,320 --> 00:52:16,180
I don't know what you feel.

721
00:52:16,180 --> 00:52:20,380
For me, this like gave
me a better feeling

722
00:52:20,380 --> 00:52:25,660
that I was understanding what
kind of functions we had here.

723
00:52:25,660 --> 00:52:30,430
And so then the question is--

724
00:52:33,270 --> 00:52:35,220
with this family
of functions, we

725
00:52:35,220 --> 00:52:45,880
want to choose the A's and
the weights, the A's and b's,

726
00:52:45,880 --> 00:52:50,130
to match the training data.

727
00:52:50,130 --> 00:52:56,190
So that we have a problem in
minimizing the total loss,

728
00:52:56,190 --> 00:53:00,480
and we have a gradient
descent problem.

729
00:53:00,480 --> 00:53:04,370
So we have to find the gradient,
so that Wednesday's job.

730
00:53:04,370 --> 00:53:09,330
Wednesday's job is to
find the gradient of F,

731
00:53:09,330 --> 00:53:11,260
and that's back propagation.

732
00:53:11,260 --> 00:53:11,760
Good.

733
00:53:11,760 --> 00:53:14,040
Thank you very much.

734
00:53:14,040 --> 00:53:16,790
7.1 is done.