1
00:00:01,550 --> 00:00:03,920
The following content is
provided under a Creative

2
00:00:03,920 --> 00:00:05,310
Commons license.

3
00:00:05,310 --> 00:00:07,520
Your support will help
MIT open courseware

4
00:00:07,520 --> 00:00:11,610
continue to offer high quality
educational resources for free.

5
00:00:11,610 --> 00:00:14,180
To make a donation or to
view additional materials

6
00:00:14,180 --> 00:00:18,140
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,140 --> 00:00:19,026
at ocw.mit.edu.

8
00:00:25,045 --> 00:00:27,150
PROFESSOR: Let's go.

9
00:00:27,150 --> 00:00:32,040
So if you want to know the
subject of today's class,

10
00:00:32,040 --> 00:00:34,650
it's A x = b.

11
00:00:34,650 --> 00:00:38,460
I got started writing down
different possibilities

12
00:00:38,460 --> 00:00:43,320
for A x = b, and I
got carried away.

13
00:00:43,320 --> 00:00:52,420
It just appears all over the
place for different sizes,

14
00:00:52,420 --> 00:00:55,770
different ranks, different
situations, nearly singular,

15
00:00:55,770 --> 00:00:57,660
not nearly singular.

16
00:00:57,660 --> 00:01:02,590
And the question is, what
do you do in each case?

17
00:01:02,590 --> 00:01:08,730
So can I outline my
little two pages of notes

18
00:01:08,730 --> 00:01:12,600
here, and then pick on
one or two of these topics

19
00:01:12,600 --> 00:01:17,820
to develop today, and
a little more on Friday

20
00:01:17,820 --> 00:01:19,950
about Gram-Schmidt?

21
00:01:19,950 --> 00:01:23,730
So I won't do much, if
any, of Gram-Schmidt today,

22
00:01:23,730 --> 00:01:28,260
but I will do the others.

23
00:01:28,260 --> 00:01:30,705
So the problem is A x = b.

24
00:01:30,705 --> 00:01:32,670
That problem has
come from somewhere.

25
00:01:32,670 --> 00:01:36,510
We have to produce some
kind of an answer, x.

26
00:01:36,510 --> 00:01:46,530
So I'm going from good to bad or
easy to difficult in this list.

27
00:01:46,530 --> 00:01:53,680
Well, except for number 0,
which is an answer in all cases,

28
00:01:53,680 --> 00:01:58,120
using the pseudo inverse
that I introduced last time.

29
00:01:58,120 --> 00:02:03,910
So that deals with 0
eigenvalues and zero singular

30
00:02:03,910 --> 00:02:06,430
values by saying
their inverse is also

31
00:02:06,430 --> 00:02:09,310
0, which is kind of wild.

32
00:02:09,310 --> 00:02:13,750
So we'll come back to the
meaning of the pseudo inverse.

33
00:02:13,750 --> 00:02:16,630
But now, I want
to get real, here,

34
00:02:16,630 --> 00:02:18,250
about different situations.

35
00:02:18,250 --> 00:02:21,910
So number 1 is the
good, normal case,

36
00:02:21,910 --> 00:02:27,430
when a person has a square
matrix of reasonable size,

37
00:02:27,430 --> 00:02:30,020
reasonable condition,
a condition

38
00:02:30,020 --> 00:02:31,750
number-- oh, the
condition number,

39
00:02:31,750 --> 00:02:37,290
I should call it
sigma 1 over sigma n.

40
00:02:37,290 --> 00:02:41,560
It's the ratio of the largest
to the smallest singular value.

41
00:02:41,560 --> 00:02:45,940
And let's say that's within
reason, not more than 1,000

42
00:02:45,940 --> 00:02:46,900
or something.

43
00:02:46,900 --> 00:02:53,530
Then normal, ordinary
elimination is going to work,

44
00:02:53,530 --> 00:02:55,030
and Matlab--

45
00:02:55,030 --> 00:03:00,040
the command that would produce
the answer is just backslash.

46
00:03:00,040 --> 00:03:04,180
So this is the normal case.

47
00:03:04,180 --> 00:03:10,340
Now, the cases that follow
have problems of some kind,

48
00:03:10,340 --> 00:03:12,880
and I guess I'm
hoping that this is

49
00:03:12,880 --> 00:03:21,790
a sort of useful dictionary of
what to do for you and me both.

50
00:03:21,790 --> 00:03:26,800
So we have this case here, where
we have too many equations.

51
00:03:26,800 --> 00:03:32,710
So that's a pretty normal
case, and we'll think mostly

52
00:03:32,710 --> 00:03:36,400
of solving by least
squares, which leads us

53
00:03:36,400 --> 00:03:37,940
to the normal equation.

54
00:03:37,940 --> 00:03:42,040
So this is standard, happens
all the time in statistics.

55
00:03:42,040 --> 00:03:47,200
And I'm thinking in
the reasonable case,

56
00:03:47,200 --> 00:03:48,370
that would be ex hat.

57
00:03:51,070 --> 00:03:54,280
The solution A--
this matrix would be

58
00:03:54,280 --> 00:03:57,700
invertible and reasonable size.

59
00:03:57,700 --> 00:04:02,650
So backslash would still
solve that problem.

60
00:04:02,650 --> 00:04:05,860
Backslash doesn't
require a square matrix

61
00:04:05,860 --> 00:04:07,570
to give you an answer.

62
00:04:07,570 --> 00:04:12,730
So that's the good case,
where the matrix is not

63
00:04:12,730 --> 00:04:16,300
too big, so it's not
unreasonable to form

64
00:04:16,300 --> 00:04:18,930
a transpose.

65
00:04:18,930 --> 00:04:20,499
Now, here's the other extreme.

66
00:04:23,290 --> 00:04:28,240
What's exciting for us is this
is the underdetermined case.

67
00:04:28,240 --> 00:04:30,490
I don't have enough
equations, so I

68
00:04:30,490 --> 00:04:34,870
have to put something more
in to get a specific answer.

69
00:04:34,870 --> 00:04:38,620
And what makes it exciting
for us is that that's

70
00:04:38,620 --> 00:04:41,290
typical of deep learning.

71
00:04:41,290 --> 00:04:44,740
There are so many weights
in a deep neural network

72
00:04:44,740 --> 00:04:49,060
that the weights
would be the unknowns.

73
00:04:49,060 --> 00:04:51,355
Of course, it wouldn't
be necessarily linear.

74
00:04:51,355 --> 00:04:54,325
It wouldn't be linear,
but still the idea's

75
00:04:54,325 --> 00:05:01,580
the same that we
have many solutions,

76
00:05:01,580 --> 00:05:04,360
and we have to pick one.

77
00:05:04,360 --> 00:05:07,000
Or we have to pick an algorithm,
and then it will find one.

78
00:05:09,850 --> 00:05:13,450
So we could pick the minimum
norm solution, the shortest

79
00:05:13,450 --> 00:05:14,050
solution.

80
00:05:14,050 --> 00:05:16,690
That would be an L2 answer.

81
00:05:16,690 --> 00:05:19,440
Or we could go to L1.

82
00:05:19,440 --> 00:05:25,980
And the big question that, I
think, might be settled in 2018

83
00:05:25,980 --> 00:05:32,340
is, does deep learning and
the iteration from stochastic

84
00:05:32,340 --> 00:05:36,750
gradient descent that
we'll see pretty soon--

85
00:05:36,750 --> 00:05:40,230
does it go to the minimum L1?

86
00:05:40,230 --> 00:05:42,330
Does it pick out an L1 solution?

87
00:05:42,330 --> 00:05:46,340
That's really an
exciting math question.

88
00:05:46,340 --> 00:05:51,620
For a long time, it
was standard to say

89
00:05:51,620 --> 00:05:57,760
that these deep learning
AI codes are fantastic,

90
00:05:57,760 --> 00:05:59,880
but what are they doing?

91
00:05:59,880 --> 00:06:05,160
We don't know all the interior,
but we-- when I say we,

92
00:06:05,160 --> 00:06:10,110
I don't mean I. Other
people are getting there,

93
00:06:10,110 --> 00:06:13,710
and I'm going to tell you
as much as I can about it

94
00:06:13,710 --> 00:06:16,080
when we get there.

95
00:06:16,080 --> 00:06:18,645
So those are pretty
standard cases.

96
00:06:21,820 --> 00:06:27,730
m = n, m greater than n, m
less than n, but not crazy.

97
00:06:27,730 --> 00:06:32,440
Now, the second board will
have more difficult problems.

98
00:06:37,300 --> 00:06:40,100
Usually, because they're
nearly singular in some way,

99
00:06:40,100 --> 00:06:43,720
the columns are
nearly dependent.

100
00:06:43,720 --> 00:06:47,140
So that would be the
columns in bad condition.

101
00:06:47,140 --> 00:06:49,750
You just picked
a terrible basis,

102
00:06:49,750 --> 00:06:54,160
or nature did, or somehow you
got a matrix A whose columns

103
00:06:54,160 --> 00:06:57,640
are virtually dependent--

104
00:06:57,640 --> 00:06:59,800
almost linearly dependent.

105
00:06:59,800 --> 00:07:06,340
The inverse matrix is
really big, but it exists.

106
00:07:06,340 --> 00:07:10,270
Then that's when you go in,
and you fix the columns.

107
00:07:10,270 --> 00:07:12,250
You orthogonalize columns.

108
00:07:12,250 --> 00:07:16,840
Instead of accepting the
columns A1, A2, up to An

109
00:07:16,840 --> 00:07:20,170
of the given matrix,
you go in, and you

110
00:07:20,170 --> 00:07:25,170
find orthonormal vectors
in that column space

111
00:07:25,170 --> 00:07:30,610
and orthonormal basis Q1 to Qn.

112
00:07:30,610 --> 00:07:32,980
And the two are connected
by Gram-Schmidt.

113
00:07:32,980 --> 00:07:37,030
And the famous matrix
statement of Gram-Schmidt

114
00:07:37,030 --> 00:07:41,920
is here are the columns of
A. Here are the columns of Q,

115
00:07:41,920 --> 00:07:47,890
and there's a triangular
matrix that connects the two.

116
00:07:47,890 --> 00:07:52,710
So that is the central
topic of Gram-Schmidt

117
00:07:52,710 --> 00:07:55,680
in that idea of orthogonalizing.

118
00:07:55,680 --> 00:07:56,805
It just appears everywhere.

119
00:07:56,805 --> 00:08:01,920
It appears all over course
6 in many, many situations

120
00:08:01,920 --> 00:08:02,970
with different names.

121
00:08:02,970 --> 00:08:08,550
So that, I'm sort of saving
a little bit until next time,

122
00:08:08,550 --> 00:08:10,990
and let me tell you why.

123
00:08:10,990 --> 00:08:17,430
Because just the organization
of Gram-Schmidt is interesting.

124
00:08:17,430 --> 00:08:21,810
So Gram-Schmidt, you
could do the normal way.

125
00:08:21,810 --> 00:08:25,410
So that's what I teach in 18.06.

126
00:08:25,410 --> 00:08:29,220
Just take every
column as it comes.

127
00:08:29,220 --> 00:08:33,419
Subtract off projections
onto their previous stuff.

128
00:08:33,419 --> 00:08:36,960
Get it orthogonal to
the previous guys.

129
00:08:36,960 --> 00:08:39,630
Normalize it to
be a unit vector.

130
00:08:39,630 --> 00:08:40,830
Then you've got that column.

131
00:08:40,830 --> 00:08:41,460
Go on.

132
00:08:41,460 --> 00:08:43,710
So I say that
again, and then I'll

133
00:08:43,710 --> 00:08:49,090
say it again two days from now.

134
00:08:49,090 --> 00:08:52,380
So Gram-Schmidt, the idea
is you take the columns--

135
00:08:55,440 --> 00:09:03,690
you say the second
orthogonal vector, Q2,

136
00:09:03,690 --> 00:09:07,140
will be some combination
of columns 1 and 2,

137
00:09:07,140 --> 00:09:08,625
orthogonal to the first.

138
00:09:11,720 --> 00:09:12,840
Lots to do.

139
00:09:12,840 --> 00:09:17,760
And there's another order,
which is really the better order

140
00:09:17,760 --> 00:09:20,160
to do Gram-Schmidt,
and it allows

141
00:09:20,160 --> 00:09:23,140
you to do column pivoting.

142
00:09:23,140 --> 00:09:27,330
So this is my topic
for next time,

143
00:09:27,330 --> 00:09:30,180
to see Gram-Schmidt
more carefully.

144
00:09:32,920 --> 00:09:39,130
Column pivoting means
the columns might not

145
00:09:39,130 --> 00:09:46,750
come in a good order, so you
allow yourself to reorder them.

146
00:09:46,750 --> 00:09:50,930
We know that you have to
do that for elimination.

147
00:09:50,930 --> 00:09:53,710
In elimination,
it would be rows.

148
00:09:53,710 --> 00:09:58,270
So elimination, we
would have the matrix A,

149
00:09:58,270 --> 00:10:06,430
and we take the first row
as the first pivot row,

150
00:10:06,430 --> 00:10:08,900
and then the second row,
and then the third row.

151
00:10:08,900 --> 00:10:23,280
But if the pivot is too
small, then reorder the rows.

152
00:10:27,060 --> 00:10:31,165
So it's row ordering that
comes up in elimination.

153
00:10:37,350 --> 00:10:42,450
And Matlab just
systematically says, OK,

154
00:10:42,450 --> 00:10:45,000
that's the pivot
that's coming up.

155
00:10:45,000 --> 00:10:48,330
The third pivot comes
up out of the third row.

156
00:10:48,330 --> 00:10:51,450
But Matlab says look down
that whole third column

157
00:10:51,450 --> 00:10:53,850
for a better pivot,
a bigger pivot.

158
00:10:53,850 --> 00:10:56,490
Switch to a row exchange.

159
00:10:56,490 --> 00:10:58,590
So there are lots of
permutations then.

160
00:10:58,590 --> 00:11:03,960
You end up with something
there that permutes the rows,

161
00:11:03,960 --> 00:11:07,170
and then that gets
factored into LU.

162
00:11:07,170 --> 00:11:10,680
So I'm saying something
about elimination

163
00:11:10,680 --> 00:11:14,430
that's just sort
of a side comment

164
00:11:14,430 --> 00:11:18,240
that you would
never do elimination

165
00:11:18,240 --> 00:11:22,020
without considering the
possibility of row exchanges.

166
00:11:22,020 --> 00:11:29,120
And then this is Gram-Schmidt
orthogonalization.

167
00:11:29,120 --> 00:11:31,420
So this is the LU world.

168
00:11:31,420 --> 00:11:34,580
Here is the QR
world, and here, it

169
00:11:34,580 --> 00:11:37,800
happens to be columns
that you're permuting.

170
00:11:37,800 --> 00:11:39,150
So that's coming.

171
00:11:39,150 --> 00:11:46,710
This is section 2.2, now.

172
00:11:46,710 --> 00:11:47,760
But there's more.

173
00:11:47,760 --> 00:11:50,580
2.2 has quite a bit
in it, including

174
00:11:50,580 --> 00:11:53,605
number 0, the pseudo
inverse, and including

175
00:11:53,605 --> 00:11:54,480
some of these things.

176
00:11:54,480 --> 00:11:58,710
Actually, this will
be also in 2.2.

177
00:11:58,710 --> 00:12:03,160
And maybe this is what I'm
saying more about today.

178
00:12:03,160 --> 00:12:08,460
So I'll put a little
star for today, here.

179
00:12:08,460 --> 00:12:09,600
What do you do?

180
00:12:09,600 --> 00:12:15,960
So this is a case where the
matrix is nearly singular.

181
00:12:15,960 --> 00:12:17,290
You're in danger.

182
00:12:17,290 --> 00:12:19,570
It's inverse is
going to be big--

183
00:12:19,570 --> 00:12:21,190
unreasonably big.

184
00:12:21,190 --> 00:12:23,320
And I wrote inverse
problems there,

185
00:12:23,320 --> 00:12:31,360
because inverse problem
is a type of problem

186
00:12:31,360 --> 00:12:34,360
with an application that
you often need to solve

187
00:12:34,360 --> 00:12:40,970
or that engineering and
science have to solve.

188
00:12:40,970 --> 00:12:43,300
So I'll just say a
little more about that,

189
00:12:43,300 --> 00:12:50,240
but that's a typical application
in which you're near singular.

190
00:12:50,240 --> 00:12:53,720
Your matrix isn't
good enough to invert.

191
00:12:53,720 --> 00:12:55,340
Well, of course, you
could always say,

192
00:12:55,340 --> 00:12:57,140
well, I'll just use
the pseudo inverse,

193
00:12:57,140 --> 00:13:00,440
but numerically,
that's like cheating.

194
00:13:00,440 --> 00:13:04,380
You've got to get in there
and do something about it.

195
00:13:04,380 --> 00:13:06,920
So inverse problems
would be examples.

196
00:13:10,180 --> 00:13:12,220
Actually, as I
write that, I think

197
00:13:12,220 --> 00:13:14,680
that would be a
topic that I should

198
00:13:14,680 --> 00:13:18,400
add to the list of potential
topics for a three week

199
00:13:18,400 --> 00:13:19,120
project.

200
00:13:19,120 --> 00:13:22,360
Look up a book on
inverse problems.

201
00:13:22,360 --> 00:13:24,640
So what do I mean by
an inverse problem?

202
00:13:24,640 --> 00:13:26,880
I'll just finish this thought.

203
00:13:26,880 --> 00:13:29,020
What's an inverse problem?

204
00:13:29,020 --> 00:13:41,330
Typically, you know about
a system, say a network,

205
00:13:41,330 --> 00:13:46,810
RLC network, and you give
it a voltage or current.

206
00:13:46,810 --> 00:13:50,310
You give it an input,
and you find the output.

207
00:13:50,310 --> 00:13:54,960
You find out what current
flows, what the voltages are.

208
00:13:54,960 --> 00:13:57,600
But inverse problems are--

209
00:13:57,600 --> 00:14:07,730
suppose you know the response
to different voltages.

210
00:14:07,730 --> 00:14:11,210
What was the network?

211
00:14:11,210 --> 00:14:12,560
You see the problem?

212
00:14:12,560 --> 00:14:14,150
Let me say it again.

213
00:14:14,150 --> 00:14:17,860
Discover what the network
is from its outputs.

214
00:14:17,860 --> 00:14:21,400
So that turns out to
typically be a problem that

215
00:14:21,400 --> 00:14:23,640
gives nearly singular matrices.

216
00:14:23,640 --> 00:14:28,190
That's a difficult problem.

217
00:14:28,190 --> 00:14:32,440
A lot of nearby networks would
give virtually the same output.

218
00:14:32,440 --> 00:14:37,700
So you have a matrix
that's nearly singular.

219
00:14:37,700 --> 00:14:43,340
It's got singular
values very close to 0.

220
00:14:43,340 --> 00:14:46,400
What do you do then?

221
00:14:46,400 --> 00:14:49,810
Well, the world of
inverse problems

222
00:14:49,810 --> 00:14:55,660
thinks of adding a penalty term,
some kind of a penalty term.

223
00:14:55,660 --> 00:14:59,110
When I minimize this
thing just by itself,

224
00:14:59,110 --> 00:15:04,472
in the usual way, A transpose,
it has a giant inverse.

225
00:15:04,472 --> 00:15:08,800
The matrix A is
badly conditioned.

226
00:15:08,800 --> 00:15:13,140
It takes vectors almost to 0.

227
00:15:13,140 --> 00:15:18,090
So that A transpose has
got a giant inverse,

228
00:15:18,090 --> 00:15:22,380
and you're at risk of losing
everything to round off.

229
00:15:22,380 --> 00:15:26,490
So this is the solution.

230
00:15:26,490 --> 00:15:28,680
You could call it
a cheap solution,

231
00:15:28,680 --> 00:15:30,510
but everybody uses it.

232
00:15:30,510 --> 00:15:36,070
So I won't put that
word on videotape.

233
00:15:36,070 --> 00:15:40,440
But that sort of resolves
the problem, but then

234
00:15:40,440 --> 00:15:41,820
the question--

235
00:15:41,820 --> 00:15:46,260
it shifts the problem,
anyway, to what number--

236
00:15:46,260 --> 00:15:47,940
what should be the penalty?

237
00:15:47,940 --> 00:15:50,110
How much should you penalize it?

238
00:15:50,110 --> 00:15:58,500
You see, by adding that, you're
going to make it invertible.

239
00:15:58,500 --> 00:16:02,040
And if you make this bigger,
and bigger, and bigger,

240
00:16:02,040 --> 00:16:07,130
it's more and more
well-conditioned.

241
00:16:07,130 --> 00:16:09,350
It resolves the trouble, here.

242
00:16:09,350 --> 00:16:13,780
And like today, I'm going
to do more with that.

243
00:16:13,780 --> 00:16:17,530
So with that, I'll stop
there and pick it up

244
00:16:17,530 --> 00:16:20,110
after saying something
about 6 and 7.

245
00:16:23,350 --> 00:16:24,770
I hope this is helpful.

246
00:16:24,770 --> 00:16:29,470
It was helpful to me, certainly,
to see all these possibilities

247
00:16:29,470 --> 00:16:33,880
and to write down
what the symptom is.

248
00:16:33,880 --> 00:16:38,560
It's like a linear
equation doctor.

249
00:16:38,560 --> 00:16:45,160
Like you look for the symptoms,
and then you propose something

250
00:16:45,160 --> 00:16:48,160
at CVS that works
or doesn't work.

251
00:16:48,160 --> 00:16:52,590
But you do something about it.

252
00:16:52,590 --> 00:16:55,470
So when the problem is too big--

253
00:16:59,390 --> 00:17:04,430
up to now, the problems have
not been giant out of core.

254
00:17:04,430 --> 00:17:06,050
But now, when it's too big--

255
00:17:06,050 --> 00:17:08,619
maybe it's still in
core but really big--

256
00:17:08,619 --> 00:17:13,250
then this is in 2.1.

257
00:17:13,250 --> 00:17:15,470
So that's to come back to.

258
00:17:18,020 --> 00:17:19,980
The word I could
have written in here,

259
00:17:19,980 --> 00:17:25,640
if I was just going to write
one word, would be iteration.

260
00:17:25,640 --> 00:17:31,835
Iterative methods, meaning
you take a step like--

261
00:17:31,835 --> 00:17:37,580
the conjugate radiant method is
the hero of iterative methods.

262
00:17:37,580 --> 00:17:40,760
And then that name
I erased is Krylov,

263
00:17:40,760 --> 00:17:43,160
and there are other
names associated

264
00:17:43,160 --> 00:17:45,080
with iterative methods.

265
00:17:45,080 --> 00:17:51,060
So that's the section
that we passed over just

266
00:17:51,060 --> 00:17:56,030
to get rolling, but
we'll come back to.

267
00:17:56,030 --> 00:17:59,860
So then that one, you
never get the exact answer,

268
00:17:59,860 --> 00:18:03,190
but you get closer and closer.

269
00:18:03,190 --> 00:18:05,230
If the iterative
method is successful,

270
00:18:05,230 --> 00:18:09,250
like conjugate gradients, you
get pretty close, pretty fast.

271
00:18:09,250 --> 00:18:12,820
And then you say,
OK, I'll take it.

272
00:18:12,820 --> 00:18:19,160
And then finally, way
too big, like nowhere.

273
00:18:19,160 --> 00:18:20,960
You're not in core.

274
00:18:20,960 --> 00:18:24,890
Just your matrix-- you just
have a giant, giant problem,

275
00:18:24,890 --> 00:18:28,540
which, of course, is
happening these days.

276
00:18:28,540 --> 00:18:33,490
And then one way to
do it is your matrix.

277
00:18:33,490 --> 00:18:36,250
You can't even look
at the matrix A,

278
00:18:36,250 --> 00:18:37,840
much less A transpose.

279
00:18:37,840 --> 00:18:40,870
A transpose would
be unthinkable.

280
00:18:40,870 --> 00:18:46,990
You couldn't do it in a year.

281
00:18:46,990 --> 00:18:52,430
So randomized linear
algebra has popped up,

282
00:18:52,430 --> 00:18:55,580
and the idea there,
which we'll see,

283
00:18:55,580 --> 00:19:05,460
is to use probability
to sample the matrix

284
00:19:05,460 --> 00:19:08,750
and work with your samples.

285
00:19:08,750 --> 00:19:17,020
So if the matrix is way too big,
but not too crazy, so to speak,

286
00:19:17,020 --> 00:19:23,090
then you could sample
the columns and the rows,

287
00:19:23,090 --> 00:19:29,280
and get an answer
from the sample.

288
00:19:29,280 --> 00:19:33,420
See, if I sample the columns
of a matrix, I'm getting--

289
00:19:33,420 --> 00:19:35,010
so what does sampling mean?

290
00:19:35,010 --> 00:19:40,710
Let me just complete this, say,
add a little to this thought.

291
00:19:40,710 --> 00:19:42,010
Sample a matrix.

292
00:19:42,010 --> 00:19:46,760
So I have a giant matrix A.
It might be sparse, of course.

293
00:19:46,760 --> 00:19:49,670
I didn't distinguish
over their sparse things.

294
00:19:49,670 --> 00:19:51,620
That would be another thing.

295
00:19:51,620 --> 00:20:01,930
So if I just take random
X's, more than one,

296
00:20:01,930 --> 00:20:07,620
but not the full n
dimensions, those

297
00:20:07,620 --> 00:20:12,510
will give me random guys
in the column space.

298
00:20:12,510 --> 00:20:18,660
And if the matrix
is reasonable, it

299
00:20:18,660 --> 00:20:21,450
won't take too many to have a
pretty reasonable idea of what

300
00:20:21,450 --> 00:20:26,520
that column space is like, and
with it's the right hand side.

301
00:20:26,520 --> 00:20:29,730
So this world of
randomized linear algebra

302
00:20:29,730 --> 00:20:33,870
has grown because it had to.

303
00:20:33,870 --> 00:20:37,620
And of course, any
statement can never

304
00:20:37,620 --> 00:20:40,340
say for sure you're going
to get the right answer,

305
00:20:40,340 --> 00:20:47,070
but using the inequalities
of probability,

306
00:20:47,070 --> 00:20:51,220
you can often say that the
chance of being way off

307
00:20:51,220 --> 00:20:55,650
is less than 1 in 2 to
the 20th or something.

308
00:20:55,650 --> 00:21:01,580
So the answer is, in reality,
you get a good answer.

309
00:21:01,580 --> 00:21:06,390
That is the end of
this chapter, 2.4.

310
00:21:06,390 --> 00:21:08,650
So this is all
chapter 2, really.

311
00:21:11,700 --> 00:21:16,890
The iterative method's in 2.1.

312
00:21:16,890 --> 00:21:20,980
Most of this is in 2.2.

313
00:21:20,980 --> 00:21:29,540
Big is 2.3, and then really
big is randomized in 2.4.

314
00:21:29,540 --> 00:21:32,650
So now, where are we?

315
00:21:32,650 --> 00:21:37,710
You were going to let me know
or not if this is useful to see.

316
00:21:37,710 --> 00:21:43,070
But you sort of see what
are real life problems.

317
00:21:43,070 --> 00:21:46,940
And of course, we're highly,
especially interested

318
00:21:46,940 --> 00:21:50,990
in getting to the deep
learning examples, which

319
00:21:50,990 --> 00:21:53,720
are underdetermined.

320
00:21:53,720 --> 00:21:55,220
Then when you're
underdetermined,

321
00:21:55,220 --> 00:21:58,850
you've got many solutions,
and the question

322
00:21:58,850 --> 00:22:00,740
is, which one is a good one?

323
00:22:00,740 --> 00:22:02,900
And in deep learning,
I just can't

324
00:22:02,900 --> 00:22:05,420
resist saying another word.

325
00:22:09,350 --> 00:22:12,840
So there are many solutions.

326
00:22:12,840 --> 00:22:14,160
What to do?

327
00:22:14,160 --> 00:22:20,350
Well, you pick some algorithm,
like steepest descent, which

328
00:22:20,350 --> 00:22:22,970
is going to find a solution.

329
00:22:22,970 --> 00:22:25,790
So you hope it's a good one.

330
00:22:25,790 --> 00:22:29,540
And what does a good one
mean verses a not good one?

331
00:22:29,540 --> 00:22:31,570
They're all solutions.

332
00:22:31,570 --> 00:22:35,330
A good one means that when
you apply it to the test data

333
00:22:35,330 --> 00:22:39,080
that you haven't yet seen,
it gives good results

334
00:22:39,080 --> 00:22:41,270
on the test data.

335
00:22:41,270 --> 00:22:43,460
The solution has
learned something

336
00:22:43,460 --> 00:22:47,220
from the training data, and
it works on the test data.

337
00:22:47,220 --> 00:22:51,860
So that's the big
question in deep learning.

338
00:22:51,860 --> 00:22:55,640
How does it happen that you,
by doing gradient descent

339
00:22:55,640 --> 00:22:58,220
or whatever algorithm--

340
00:22:58,220 --> 00:23:02,610
how does that algorithm
bias the solution?

341
00:23:02,610 --> 00:23:04,670
It's called implicit bias.

342
00:23:04,670 --> 00:23:07,190
How does that algorithm
bias a solution

343
00:23:07,190 --> 00:23:11,810
toward a solution
that generalizes,

344
00:23:11,810 --> 00:23:14,270
that works on test data?

345
00:23:14,270 --> 00:23:16,130
And you can think
of algorithms which

346
00:23:16,130 --> 00:23:21,160
would approach a solution that
did not work on test data.

347
00:23:21,160 --> 00:23:23,583
So that's what you
want to stay away from.

348
00:23:23,583 --> 00:23:24,750
You want the ones that work.

349
00:23:24,750 --> 00:23:31,050
So there's very deep
math questions there,

350
00:23:31,050 --> 00:23:33,340
which are kind of new.

351
00:23:33,340 --> 00:23:36,430
They didn't arise
until they did.

352
00:23:36,430 --> 00:23:45,190
And we'll try to save some
of what's being understood.

353
00:23:45,190 --> 00:23:50,980
Can I focus now on,
for probably the rest

354
00:23:50,980 --> 00:23:57,030
of today, this case, when the
matrix is nearly singular?

355
00:23:57,030 --> 00:24:00,600
So you could apply
elimination, but it

356
00:24:00,600 --> 00:24:11,330
would give a poor result.
So one solution is the SVD.

357
00:24:11,330 --> 00:24:14,440
I haven't even mentioned the
SVD, here, as an algorithm,

358
00:24:14,440 --> 00:24:16,250
but of course, it is.

359
00:24:16,250 --> 00:24:20,340
The SVD gives you an answer.

360
00:24:20,340 --> 00:24:22,020
Boy, where should
that have gone?

361
00:24:22,020 --> 00:24:27,660
Well, the space
over here, the SVD.

362
00:24:27,660 --> 00:24:35,160
So that produces-- you have
A = U sigma V transposed,

363
00:24:35,160 --> 00:24:40,320
and then A inverse is V
sigma inverse U transposed.

364
00:24:46,670 --> 00:24:48,170
So we're in the case, here.

365
00:24:48,170 --> 00:24:50,510
We're talking about number 5.

366
00:24:50,510 --> 00:24:54,620
Nearly singular, where
sigma has some very small,

367
00:24:54,620 --> 00:24:56,390
singular values.

368
00:24:56,390 --> 00:25:00,570
Then sigma inverse has some
very big singular values.

369
00:25:00,570 --> 00:25:07,220
So you're really
in wild territory

370
00:25:07,220 --> 00:25:11,120
here with very big inverses.

371
00:25:11,120 --> 00:25:13,080
So that would be
one way to do it.

372
00:25:13,080 --> 00:25:19,340
But this is a way to
regularize the problem.

373
00:25:19,340 --> 00:25:21,310
So let's just pay
attention to that.

374
00:25:28,730 --> 00:25:36,800
So suppose I minimize the sum
of A x minus b squared and delta

375
00:25:36,800 --> 00:25:40,790
squared times the
size of x squared.

376
00:25:40,790 --> 00:25:43,370
And I'm going to
use the L2 norm.

377
00:25:43,370 --> 00:25:48,410
It's going to be a least
squares with penalty,

378
00:25:48,410 --> 00:25:50,600
so of course, it's
the L2 norm here, too.

379
00:25:55,240 --> 00:25:58,150
Suppose I solve
that for a delta.

380
00:25:58,150 --> 00:26:03,820
For some, I have to
choose a positive delta.

381
00:26:06,550 --> 00:26:08,350
And when I choose
a positive delta,

382
00:26:08,350 --> 00:26:12,070
then I have a solvable problem.

383
00:26:12,070 --> 00:26:17,830
Even if this goes to 0,
or A does crazy things,

384
00:26:17,830 --> 00:26:23,710
this is going to keep
me away from singular.

385
00:26:23,710 --> 00:26:27,760
In fact, what equation
does that lead to?

386
00:26:27,760 --> 00:26:31,480
So that's a least squares
problem with an extra penalty

387
00:26:31,480 --> 00:26:32,960
term.

388
00:26:32,960 --> 00:26:34,980
So it would come, I suppose.

389
00:26:34,980 --> 00:26:42,260
Let's see, if I write
the equations A delta I,

390
00:26:42,260 --> 00:26:53,170
x equals b 0, maybe that is
the least squares equation--

391
00:26:53,170 --> 00:26:55,270
the usual, normal equation--

392
00:26:55,270 --> 00:26:59,200
for this augmented system.

393
00:26:59,200 --> 00:27:01,600
Because what's the error here?

394
00:27:01,600 --> 00:27:03,460
This is the new big A--

395
00:27:03,460 --> 00:27:05,214
A star, let's say.

396
00:27:08,330 --> 00:27:10,550
X equals-- this is the new b.

397
00:27:13,720 --> 00:27:19,850
So if I apply least squares
to that, what do I do?

398
00:27:19,850 --> 00:27:21,665
I minimize the sum of squares.

399
00:27:25,580 --> 00:27:29,400
So least squares would
minimize A x minus b squared.

400
00:27:29,400 --> 00:27:32,720
That would be from
the first components.

401
00:27:32,720 --> 00:27:40,790
And delta squared x squared
from the last component, which

402
00:27:40,790 --> 00:27:43,610
is exactly what we
said we were doing.

403
00:27:43,610 --> 00:27:47,480
So in a way, this
is the equation

404
00:27:47,480 --> 00:27:52,100
that the penalty
method is solving.

405
00:27:52,100 --> 00:27:57,530
And one question, naturally,
is, what should delta be?

406
00:27:57,530 --> 00:28:02,450
Well, that question's
beyond us, today.

407
00:28:02,450 --> 00:28:06,800
It's a balance of
what you can believe,

408
00:28:06,800 --> 00:28:11,900
and how much noise is in
the system, and everything.

409
00:28:11,900 --> 00:28:13,970
That choice of delta--

410
00:28:13,970 --> 00:28:18,830
what we could ask
is a math question.

411
00:28:18,830 --> 00:28:22,580
What happens as delta goes to 0?

412
00:28:22,580 --> 00:28:25,050
So suppose I solve this problem.

413
00:28:25,050 --> 00:28:28,010
Let's see, I could
write it differently.

414
00:28:28,010 --> 00:28:31,430
What would be the
equation, here?

415
00:28:31,430 --> 00:28:33,740
This part would give
us the A transpose,

416
00:28:33,740 --> 00:28:39,890
and then this part would
give us just the identity,

417
00:28:39,890 --> 00:28:45,620
x equals A transpose b, I think.

418
00:28:45,620 --> 00:28:46,850
Wouldn't that be?

419
00:28:46,850 --> 00:28:49,580
So really, I've written here--

420
00:28:49,580 --> 00:28:53,570
what that is is A
star transpose A star.

421
00:28:53,570 --> 00:29:00,440
This is least squares on
this gives that equation.

422
00:29:00,440 --> 00:29:03,200
So all of those are equivalent.

423
00:29:03,200 --> 00:29:05,510
All of those would be
equivalent statements

424
00:29:05,510 --> 00:29:09,890
of what the penalized problem
is that you're solving.

425
00:29:09,890 --> 00:29:14,390
And then the question is, as
delta goes to 0, what happens?

426
00:29:17,210 --> 00:29:18,830
Of course, something.

427
00:29:18,830 --> 00:29:22,610
When delta goes to 0, you're
falling off the cliff.

428
00:29:22,610 --> 00:29:25,700
Something quite
different is suddenly

429
00:29:25,700 --> 00:29:26,960
going to happen, there.

430
00:29:26,960 --> 00:29:33,470
Maybe we could even understand
this question with a 1

431
00:29:33,470 --> 00:29:36,200
by 1 matrix.

432
00:29:36,200 --> 00:29:39,830
I think this section
starts with a 1 by 1.

433
00:29:39,830 --> 00:29:41,680
Suppose A is just a number.

434
00:29:44,580 --> 00:29:47,370
Maybe I'll just put that
on this board, here.

435
00:29:47,370 --> 00:29:48,810
Suppose A is just a number.

436
00:29:51,630 --> 00:29:54,300
So what am I going
to call that number?

437
00:29:54,300 --> 00:29:55,620
Just 1 by 1.

438
00:29:55,620 --> 00:29:58,140
Let me call it sigma,
because it's certainly

439
00:29:58,140 --> 00:30:00,230
the leading singular value.

440
00:30:06,520 --> 00:30:10,130
So what's my equation
that I'm solving?

441
00:30:10,130 --> 00:30:15,500
A transpose A would be sigma
squared plus delta squared, 1

442
00:30:15,500 --> 00:30:18,350
by 1, x--

443
00:30:18,350 --> 00:30:20,690
should I give some
subscript here?

444
00:30:20,690 --> 00:30:23,960
I should, really,
to do it right.

445
00:30:23,960 --> 00:30:26,750
This is the solution
for a given delta.

446
00:30:32,150 --> 00:30:33,700
So that solution will exist.

447
00:30:33,700 --> 00:30:34,390
Fine.

448
00:30:34,390 --> 00:30:36,670
This matrix is
certainly invertible.

449
00:30:36,670 --> 00:30:40,460
That's positive
semidefinite, at least.

450
00:30:40,460 --> 00:30:42,320
That's positive
semidefinite, and then what

451
00:30:42,320 --> 00:30:45,530
about delta squared I?

452
00:30:45,530 --> 00:30:49,160
It is positive
definite, of course.

453
00:30:49,160 --> 00:30:52,680
It's just the identity
with a factor.

454
00:30:52,680 --> 00:30:55,370
So this is a positive
definite matrix.

455
00:30:55,370 --> 00:30:57,470
I certainly have a solution.

456
00:30:57,470 --> 00:31:01,500
And let me keep going
on this 1 by 1 case.

457
00:31:01,500 --> 00:31:03,180
This would be A transpose.

458
00:31:03,180 --> 00:31:04,700
A is just a sigma.

459
00:31:04,700 --> 00:31:06,890
I think it's just sigma b.

460
00:31:11,710 --> 00:31:17,890
So A is 1 by 1, and there
are two cases, here--

461
00:31:17,890 --> 00:31:25,230
Sigma bigger than 0,
or sigma equals 0.

462
00:31:25,230 --> 00:31:28,290
And in either case, I just
want to know what's the limit.

463
00:31:28,290 --> 00:31:32,310
So the answer x--

464
00:31:32,310 --> 00:31:34,490
let me just take
the right hand side.

465
00:31:34,490 --> 00:31:35,370
Well, that's fine.

466
00:31:39,140 --> 00:31:42,810
Am I computing OK?

467
00:31:42,810 --> 00:31:47,580
Using the penalize thing on a
1 by 1 problem, which you could

468
00:31:47,580 --> 00:31:50,910
say is a little bit small--

469
00:31:50,910 --> 00:32:00,620
so solving this equation or
equivalently minimizing this,

470
00:32:00,620 --> 00:32:03,151
so here, I'm finding
the minimum of--

471
00:32:07,590 --> 00:32:14,030
A was sigma x minus b squared
plus delta squared x squared.

472
00:32:18,890 --> 00:32:20,620
You see it's just 1 by 1?

473
00:32:20,620 --> 00:32:21,400
Just a number.

474
00:32:21,400 --> 00:32:25,480
And I'm hoping that calculus
will agree with linear algebra

475
00:32:25,480 --> 00:32:29,060
here, that if I find
the minimum of this--

476
00:32:29,060 --> 00:32:31,030
so let me write it out.

477
00:32:31,030 --> 00:32:36,170
Sigma squared x squared and
delta squared x squared,

478
00:32:36,170 --> 00:32:42,820
and then minus 2 sigma xb,
and then plus b squared.

479
00:32:42,820 --> 00:32:46,060
And now, I'm going to find
the minimum, which means

480
00:32:46,060 --> 00:32:48,490
I'd set the derivative to 0.

481
00:32:48,490 --> 00:32:51,430
So I get 2 sigma squared
and 2 delta squared.

482
00:32:51,430 --> 00:32:55,780
I get a two here,
and this gives me

483
00:32:55,780 --> 00:32:57,850
the x derivative as 2 sigma b.

484
00:32:57,850 --> 00:33:00,490
So I get a 2 there, and I'm OK.

485
00:33:00,490 --> 00:33:06,610
I just cancel both 2s,
and that's the equation.

486
00:33:06,610 --> 00:33:09,840
So I can solve that equation.

487
00:33:09,840 --> 00:33:19,110
X is sigma over sigma
squared plus delta squared b.

488
00:33:19,110 --> 00:33:22,260
So it's really that quantity.

489
00:33:22,260 --> 00:33:25,230
I want to let delta go to 0.

490
00:33:28,850 --> 00:33:31,960
So again, what am I doing here?

491
00:33:31,960 --> 00:33:34,710
I'm taking a 1 by 1
example just to see

492
00:33:34,710 --> 00:33:42,840
what happens in the
limit as delta goes to 0.

493
00:33:42,840 --> 00:33:45,600
What happens?

494
00:33:45,600 --> 00:33:48,300
So I just have to look at that.

495
00:33:48,300 --> 00:33:54,130
What is the limit of that
thing in a circle, as delta

496
00:33:54,130 --> 00:33:55,150
goes to 0?

497
00:33:55,150 --> 00:33:58,090
So I'm finding out for
a 1 by 1 problem what

498
00:33:58,090 --> 00:34:04,390
a penalized least squares
problem, ridge regression,

499
00:34:04,390 --> 00:34:05,860
all over the place--

500
00:34:05,860 --> 00:34:07,630
what happens?

501
00:34:07,630 --> 00:34:12,690
So what happens to that
number as delta goes to 0?

502
00:34:15,400 --> 00:34:17,659
1 over sigma.

503
00:34:17,659 --> 00:34:21,670
So now, let delta go to 0.

504
00:34:21,670 --> 00:34:27,159
So that approaches 1 over
sigma, because delta disappears.

505
00:34:27,159 --> 00:34:29,570
Sigma over sigma
squared, 1 over sigma.

506
00:34:29,570 --> 00:34:34,590
So it approaches the
inverse, but what's

507
00:34:34,590 --> 00:34:37,100
the other possibility, here?

508
00:34:37,100 --> 00:34:41,380
The other possibility
is that sigma is 0.

509
00:34:41,380 --> 00:34:44,719
I didn't say whether this
matrix, this 1 by 1 matrix,

510
00:34:44,719 --> 00:34:46,909
was invertible or not.

511
00:34:46,909 --> 00:34:53,500
If sigma is not 0, then
I go to 1 over sigma.

512
00:34:53,500 --> 00:34:57,330
If sigma is really small,
it will take a while.

513
00:34:57,330 --> 00:35:00,930
Delta will have to get small,
small, small, even compared

514
00:35:00,930 --> 00:35:04,230
to sigma, until finally,
that term goes away,

515
00:35:04,230 --> 00:35:06,000
and I just have 1 over sigma.

516
00:35:06,000 --> 00:35:09,390
But what if sigma is 0?

517
00:35:09,390 --> 00:35:14,410
Sorry to get excited about 0.

518
00:35:14,410 --> 00:35:16,970
Who would get excited about 0?

519
00:35:16,970 --> 00:35:20,840
So this is the case when
this is 1 over sigma,

520
00:35:20,840 --> 00:35:23,000
if sigma is positive.

521
00:35:23,000 --> 00:35:25,325
And what does it
approach if sigma is 0?

522
00:35:28,080 --> 00:35:29,770
0!

523
00:35:29,770 --> 00:35:32,400
Because this is 0,
the whole problem

524
00:35:32,400 --> 00:35:34,810
was like disappeared, here.

525
00:35:34,810 --> 00:35:37,400
The sigma was 0.

526
00:35:37,400 --> 00:35:39,570
Here is a sigma.

527
00:35:39,570 --> 00:35:48,430
So anyway, if sigma is 0, then
I'm getting 0 all the time.

528
00:35:48,430 --> 00:35:50,400
But I have a decent
problem, because the delta

529
00:35:50,400 --> 00:35:51,940
squared is there.

530
00:35:51,940 --> 00:35:53,920
I have a decent problem
until the last minute.

531
00:35:53,920 --> 00:35:55,090
My problem falls apart.

532
00:35:55,090 --> 00:35:58,660
Delta goes to 0, and I
have a 0 equals 0 problem.

533
00:35:58,660 --> 00:35:59,440
I'm lost.

534
00:35:59,440 --> 00:36:03,370
But the point is the
penalty kept me positive.

535
00:36:03,370 --> 00:36:07,300
It kept me with his
delta squared term

536
00:36:07,300 --> 00:36:10,890
until the last critical moment.

537
00:36:10,890 --> 00:36:14,260
It kept me positive
even if that was 0.

538
00:36:14,260 --> 00:36:19,600
If that is 0, and this is 0,
I still have something here.

539
00:36:19,600 --> 00:36:22,030
I still have a problem to solve.

540
00:36:22,030 --> 00:36:24,430
And what's the limit then?

541
00:36:24,430 --> 00:36:29,050
So 1 over sigma if
sigma is positive.

542
00:36:29,050 --> 00:36:32,720
And what's the answer if
sigma is not positive?

543
00:36:32,720 --> 00:36:34,690
It's 0.

544
00:36:34,690 --> 00:36:36,700
Just tell me.

545
00:36:36,700 --> 00:36:38,260
I'm getting 0.

546
00:36:38,260 --> 00:36:40,750
I get 0 all the way, and
I get 0 in the limit.

547
00:36:47,080 --> 00:36:53,450
And now, let me just ask,
what have I got here?

548
00:36:53,450 --> 00:36:59,870
What is this sudden bifurcation?

549
00:36:59,870 --> 00:37:02,000
Do I recognize this?

550
00:37:02,000 --> 00:37:06,710
The inverse in the
limit as delta goes to 0

551
00:37:06,710 --> 00:37:11,120
is either 1 over sigma,
if that makes sense,

552
00:37:11,120 --> 00:37:14,090
or it's 0, which is
not like 1 over sigma.

553
00:37:14,090 --> 00:37:16,970
1 over sigma-- as
sigma goes to 0,

554
00:37:16,970 --> 00:37:19,130
this thing is getting
bigger and bigger.

555
00:37:19,130 --> 00:37:22,850
But at sigma equals 0, it's 0.

556
00:37:22,850 --> 00:37:27,230
You see, that's a really
strange kind of a limit.

557
00:37:30,560 --> 00:37:32,810
Now, it would be over there.

558
00:37:32,810 --> 00:37:38,910
What have I found
here, in this limit?

559
00:37:38,910 --> 00:37:40,950
Say it again, because
that was exactly right.

560
00:37:40,950 --> 00:37:43,230
The pseudo inverse.

561
00:37:43,230 --> 00:37:49,290
So this system-- choose
delta greater than 0,

562
00:37:49,290 --> 00:37:51,810
then delta going to 0.

563
00:37:51,810 --> 00:37:55,710
The solution goes to
the pseudo inverse.

564
00:38:00,360 --> 00:38:01,680
That's the key fact.

565
00:38:05,240 --> 00:38:07,980
When delta is
really, really small,

566
00:38:07,980 --> 00:38:12,440
then this behaves in
a pretty crazy way.

567
00:38:12,440 --> 00:38:18,770
If delta is really, really
small, then sigma is bigger,

568
00:38:18,770 --> 00:38:20,300
or it's 0.

569
00:38:20,300 --> 00:38:22,140
If it's bigger, you go this way.

570
00:38:22,140 --> 00:38:23,560
If it's 0, you go that way.

571
00:38:27,850 --> 00:38:32,972
So that's the message,
and this is penalized.

572
00:38:39,240 --> 00:38:45,550
These squares, as the penalty
gets smaller and smaller,

573
00:38:45,550 --> 00:38:50,070
approaches the correct answer,
the always correct answer,

574
00:38:50,070 --> 00:38:54,600
with that sudden split
between 0 and not 0

575
00:38:54,600 --> 00:39:01,020
that we associate with
the pseudo inverse.

576
00:39:01,020 --> 00:39:04,340
Of course, in a
practical case, you're

577
00:39:04,340 --> 00:39:09,860
trying to find the resistances
and inductions in a circuit

578
00:39:09,860 --> 00:39:16,460
by trying the circuit, and
looking at the output b,

579
00:39:16,460 --> 00:39:18,890
and figuring out what input.

580
00:39:21,810 --> 00:39:29,100
So the unknown x is the
unknown system parameters.

581
00:39:29,100 --> 00:39:33,780
Not the voltage and current, but
the resistance, and inductance,

582
00:39:33,780 --> 00:39:34,953
and capacitance.

583
00:39:42,251 --> 00:39:46,200
I've only proved that
in the 1 by 1 case.

584
00:39:46,200 --> 00:39:49,790
You may say that's
not much of a proof.

585
00:39:49,790 --> 00:39:56,840
In the 1 by 1 case, we can see
it happen in front of our eyes.

586
00:39:56,840 --> 00:40:01,820
So really, a step I
haven't taken here

587
00:40:01,820 --> 00:40:07,320
is to complete that
to any matrix A.

588
00:40:07,320 --> 00:40:10,080
So that the statement then.

589
00:40:10,080 --> 00:40:11,342
That's the statement.

590
00:40:20,500 --> 00:40:21,610
So that's the statement.

591
00:40:21,610 --> 00:40:29,820
For any matrix A, this matrix,
A transpose A plus delta

592
00:40:29,820 --> 00:40:35,010
squared inverse
times A transpose--

593
00:40:35,010 --> 00:40:37,455
that's the solution
matrix to our problem.

594
00:40:40,700 --> 00:40:42,440
That's what I wrote
down up there.

595
00:40:42,440 --> 00:40:45,560
I take the inverse
and pop it over there.

596
00:40:45,560 --> 00:40:54,020
That approaches A plus,
the pseudo inverse.

597
00:40:59,480 --> 00:41:02,380
And that's what we just
checked for 1 by 1.

598
00:41:02,380 --> 00:41:07,220
For 1 by 1, this
was sigma over sigma

599
00:41:07,220 --> 00:41:09,200
squared plus delta squared.

600
00:41:09,200 --> 00:41:18,100
And it went either to
1 over sigma or to 0.

601
00:41:18,100 --> 00:41:20,370
It split in the limit.

602
00:41:20,370 --> 00:41:23,460
It shows that limits
can be delicate.

603
00:41:23,460 --> 00:41:26,550
The limit-- as delta
goes to 0, this thing

604
00:41:26,550 --> 00:41:28,920
is suddenly discontinuous.

605
00:41:28,920 --> 00:41:31,200
It's this number
that is growing,

606
00:41:31,200 --> 00:41:35,100
and then suddenly, at
0, it falls back to 0.

607
00:41:35,100 --> 00:41:38,460
Anyway, that would
be the statement.

608
00:41:38,460 --> 00:41:42,810
Actually, statisticians
discovered the pseudo inverse

609
00:41:42,810 --> 00:41:49,620
independently of the linear
algebra history of it,

610
00:41:49,620 --> 00:41:54,330
because statisticians
did exactly that.

611
00:41:54,330 --> 00:41:58,620
To regularize the problem,
they introduced a penalty

612
00:41:58,620 --> 00:42:01,390
and worked with this matrix.

613
00:42:01,390 --> 00:42:07,980
So statisticians were
the first to think

614
00:42:07,980 --> 00:42:13,642
of that as a natural thing
to do in a practical case--

615
00:42:13,642 --> 00:42:14,225
add a penalty.

616
00:42:19,020 --> 00:42:23,130
So this is adding a
penalty, but remember

617
00:42:23,130 --> 00:42:30,730
that we stayed with L2 norms,
staying with L2, least squares.

618
00:42:38,000 --> 00:42:41,090
We could ask, what happens?

619
00:42:41,090 --> 00:42:45,050
Suppose the penalty
is the L1 norm.

620
00:42:48,200 --> 00:42:50,840
I'm not up to do this today.

621
00:42:50,840 --> 00:42:52,850
Suppose I minimize that.

622
00:42:52,850 --> 00:43:01,666
Maybe I'll do L2, but I'll do
the penalty guy in the L1 norm.

623
00:43:07,850 --> 00:43:11,000
I'm certainly not
an expert on that.

624
00:43:11,000 --> 00:43:15,290
Or you could even
think just that power.

625
00:43:15,290 --> 00:43:18,680
So that would have a name.

626
00:43:18,680 --> 00:43:21,590
A statistician invented this.

627
00:43:21,590 --> 00:43:27,100
It's called the Lasso in the
L1 norm, and it's a big deal.

628
00:43:27,100 --> 00:43:36,600
Statisticians like the
L1 norm, because it

629
00:43:36,600 --> 00:43:38,010
gives sparse solutions.

630
00:43:38,010 --> 00:43:41,700
It gives more genuine
solutions without a whole lot

631
00:43:41,700 --> 00:43:46,320
of little components
in the answer.

632
00:43:46,320 --> 00:43:48,660
So this was an important step.

633
00:43:52,780 --> 00:43:56,695
Let me just say again where
we are in that big list.

634
00:44:00,910 --> 00:44:04,970
The two important ones
that I haven't done yet

635
00:44:04,970 --> 00:44:08,900
are these iterative
methods in 2.1.

636
00:44:08,900 --> 00:44:12,590
So that's like conventional
linear algebra,

637
00:44:12,590 --> 00:44:15,530
just how to deal
with a big matrix,

638
00:44:15,530 --> 00:44:17,390
maybe with some
special structure.

639
00:44:17,390 --> 00:44:21,740
That's what numerical
linear algebra is all about.

640
00:44:21,740 --> 00:44:27,790
And then Gram-Schmidt
with or without pivoting,

641
00:44:27,790 --> 00:44:32,410
which is a workhorse
of numerical computing,

642
00:44:32,410 --> 00:44:37,350
and I think I better
save that for next time.

643
00:44:37,350 --> 00:44:42,870
So this is the one I
picked for this time.

644
00:44:42,870 --> 00:44:47,010
And we saw what happened in L2.

645
00:44:47,010 --> 00:44:49,290
Well, we saw it for 1 by 1.

646
00:44:49,290 --> 00:44:56,740
Would you want to extend
to prove this for any A,

647
00:44:56,740 --> 00:45:00,210
going beyond 1 by 1?

648
00:45:00,210 --> 00:45:05,735
How would you prove
such a thing for any A?

649
00:45:05,735 --> 00:45:10,710
I guess I'm not going to do it.

650
00:45:10,710 --> 00:45:17,930
It's too painful, but
how would you do it?

651
00:45:17,930 --> 00:45:20,330
You would use the SVD.

652
00:45:20,330 --> 00:45:23,780
If you want to prove something
about matrices, about

653
00:45:23,780 --> 00:45:28,230
any matrix, the SVD
is the best thing

654
00:45:28,230 --> 00:45:30,210
you could have-- the
best tool you could have.

655
00:45:30,210 --> 00:45:34,810
I can write this in
terms of the SVD.

656
00:45:34,810 --> 00:45:40,830
I just plug-in A equals
whatever the SVD tells

657
00:45:40,830 --> 00:45:41,850
me to put in there.

658
00:45:41,850 --> 00:45:46,600
U sigma V transposed.

659
00:45:46,600 --> 00:45:50,800
Plug it in there,
simplify it using the fact

660
00:45:50,800 --> 00:45:54,370
that these are orthogonal.

661
00:45:54,370 --> 00:45:58,000
If I have any good luck,
it'll get an identity

662
00:45:58,000 --> 00:46:01,110
somewhere from there and an
identity somewhere from there.

663
00:46:04,400 --> 00:46:05,900
And it will all simplify.

664
00:46:05,900 --> 00:46:09,480
It will all diagonalize.

665
00:46:09,480 --> 00:46:13,710
That's what the SVD really
does is turns my messy problem

666
00:46:13,710 --> 00:46:17,310
into a problem about their
diagonal matrix, sigma

667
00:46:17,310 --> 00:46:18,180
in the middle.

668
00:46:18,180 --> 00:46:20,330
So I might as well put
sigma in the middle.

669
00:46:20,330 --> 00:46:21,420
Yeah, why not?

670
00:46:21,420 --> 00:46:23,627
Before we give up on it--

671
00:46:26,970 --> 00:46:32,340
a special case of that, but
really, the genuine case

672
00:46:32,340 --> 00:46:34,350
would be when A is sigma.

673
00:46:34,350 --> 00:46:41,580
Sigma transpose sigma plus
delta squared I inverse times

674
00:46:41,580 --> 00:46:49,820
sigma transpose approaches the
pseudo inverse, sigma plus.

675
00:46:49,820 --> 00:46:52,540
And the point is the matrix
sigma here is diagonal.

676
00:46:55,840 --> 00:46:59,920
Oh, I'm practically
there, actually.

677
00:46:59,920 --> 00:47:06,390
Why am I close to being
able to read this off?

678
00:47:06,390 --> 00:47:08,580
Well, everything
is diagonal here.

679
00:47:08,580 --> 00:47:10,320
Diagonal, diagonal, diagonal.

680
00:47:13,340 --> 00:47:16,085
And what's happening on
those diagonal entries?

681
00:47:20,690 --> 00:47:25,330
So you had to take my word
that when I plugged in the SVD,

682
00:47:25,330 --> 00:47:30,050
the U and the V got
separated out to the far left

683
00:47:30,050 --> 00:47:31,220
and the far right.

684
00:47:31,220 --> 00:47:35,940
And it was that that
stayed in the middle.

685
00:47:35,940 --> 00:47:38,920
So it's really this
is the heart of it.

686
00:47:38,920 --> 00:47:46,570
And say, well, that's
diagonal matrix.

687
00:47:46,570 --> 00:47:52,010
So I'm just looking at what
happens on each diagonal entry,

688
00:47:52,010 --> 00:47:55,650
and which problem is that?

689
00:47:55,650 --> 00:47:59,520
The question of what's happening
on a typical diagonal entry

690
00:47:59,520 --> 00:48:05,020
of this thing is what question?

691
00:48:05,020 --> 00:48:07,880
The 1 by 1 case!

692
00:48:07,880 --> 00:48:11,660
The 1 by 1, because each
entry in the diagonal

693
00:48:11,660 --> 00:48:15,980
is not even noticing the others.

694
00:48:15,980 --> 00:48:19,940
So that's the logic, and
it would be in the notes.

695
00:48:19,940 --> 00:48:27,450
Prove it first for 1 by 1,
then secondly for diagonal.

696
00:48:27,450 --> 00:48:33,750
This, and finally with A's,
and they're using the SVD with

697
00:48:33,750 --> 00:48:37,560
and U and V transposed
to get out of the way

698
00:48:37,560 --> 00:48:39,720
and bring us back to here.

699
00:48:39,720 --> 00:48:45,750
So that's the
theory, but really, I

700
00:48:45,750 --> 00:48:50,910
guess I'm thinking that far
the most important message

701
00:48:50,910 --> 00:48:57,840
in today's lecture is in
this list of different types

702
00:48:57,840 --> 00:49:01,830
of problems that appear
and different ways

703
00:49:01,830 --> 00:49:03,630
to work with them.

704
00:49:03,630 --> 00:49:08,250
And we haven't
done Gram-Schmidt,

705
00:49:08,250 --> 00:49:10,920
and we haven't done iteration.

706
00:49:10,920 --> 00:49:16,470
So this chapter is a survey of--

707
00:49:16,470 --> 00:49:20,160
well, more than a survey of
what numerical linear algebra

708
00:49:20,160 --> 00:49:20,760
is about.

709
00:49:20,760 --> 00:49:22,370
And I haven't done random, yet.

710
00:49:22,370 --> 00:49:23,640
Sorry, that's coming, too.

711
00:49:26,240 --> 00:49:29,120
So three pieces
are still to come,

712
00:49:29,120 --> 00:49:35,252
but let's take the last two
minutes off and call it a day.