1
00:00:01,161 --> 00:00:03,920
ANNOUNCER: The following content
is provided under a Creative

2
00:00:03,920 --> 00:00:05,310
Commons license.

3
00:00:05,310 --> 00:00:07,520
Your support will help
MIT Open Courseware

4
00:00:07,520 --> 00:00:11,610
continue to offer high quality
educational resources for free.

5
00:00:11,610 --> 00:00:14,180
To make a donation or to
view additional materials

6
00:00:14,180 --> 00:00:16,670
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:16,670 --> 00:00:18,540
at ocw.mit.edu.

8
00:00:22,812 --> 00:00:23,770
JULIAN SHUN: All right.

9
00:00:23,770 --> 00:00:25,870
So today we're going
to talk about cache

10
00:00:25,870 --> 00:00:27,730
oblivious algorithms.

11
00:00:27,730 --> 00:00:30,250
Who remembers what a cache
oblivious algorithm is?

12
00:00:38,190 --> 00:00:43,430
So what is a cache oblivious
algorithm oblivious to?

13
00:00:43,430 --> 00:00:44,680
Cache size.

14
00:00:44,680 --> 00:00:48,710
So a cache oblivious
algorithm is an algorithm

15
00:00:48,710 --> 00:00:53,210
that automatically tunes for
the cache size on your machine.

16
00:00:53,210 --> 00:00:55,820
So it achieves good
cache efficiency.

17
00:00:55,820 --> 00:00:58,730
And the code doesn't need to
have any knowledge of the cache

18
00:00:58,730 --> 00:00:59,950
parameters of your machine.

19
00:00:59,950 --> 00:01:02,330
In contrast, a
cache-aware algorithm

20
00:01:02,330 --> 00:01:05,239
would actually know the
parameters of the cache sizes

21
00:01:05,239 --> 00:01:06,230
on your machine.

22
00:01:06,230 --> 00:01:10,970
And the code would actually put
the size of the cache inside.

23
00:01:10,970 --> 00:01:14,270
So today we're going to talk a
lot more about cache oblivious

24
00:01:14,270 --> 00:01:14,960
algorithms.

25
00:01:14,960 --> 00:01:17,630
Last time we talked about
one cache oblivious algorithm

26
00:01:17,630 --> 00:01:19,400
that was for matrix
multiplication.

27
00:01:19,400 --> 00:01:22,820
And today we're going to
talk about some other ones.

28
00:01:22,820 --> 00:01:24,740
So first example I
want to talk about

29
00:01:24,740 --> 00:01:29,900
is simulation of heat diffusion.

30
00:01:29,900 --> 00:01:35,360
So here's a famous equation
known as the heat equation.

31
00:01:35,360 --> 00:01:38,450
And this equation is
in two dimensions.

32
00:01:38,450 --> 00:01:41,420
And we want to
simulate this function

33
00:01:41,420 --> 00:01:46,400
u that has three parameters t,
x, and y. t is the time step.

34
00:01:46,400 --> 00:01:49,960
X and y are the x, y
coordinates of the 2D space.

35
00:01:49,960 --> 00:01:52,190
And we want to know the
temperature for each x,

36
00:01:52,190 --> 00:01:57,290
y coordinate for
any point in time t.

37
00:01:57,290 --> 00:01:59,540
And the 2D heat
equation can be modeled

38
00:01:59,540 --> 00:02:02,030
using a differential equation.

39
00:02:02,030 --> 00:02:05,940
So how many of you have seen
differential equations before?

40
00:02:05,940 --> 00:02:06,440
OK.

41
00:02:06,440 --> 00:02:08,720
So good, most of you.

42
00:02:08,720 --> 00:02:12,650
So here I'm showing the
equation in two dimensions.

43
00:02:12,650 --> 00:02:14,240
But you can get
similarly equations

44
00:02:14,240 --> 00:02:16,170
for higher dimensions.

45
00:02:16,170 --> 00:02:19,310
So here it says the partial
derivative of u with respect

46
00:02:19,310 --> 00:02:22,820
to t is equal to alpha.

47
00:02:22,820 --> 00:02:26,210
Alpha is what's called
the thermo diffusivity.

48
00:02:26,210 --> 00:02:30,960
It's a constant times the sum
of the second partial derivative

49
00:02:30,960 --> 00:02:35,570
of u with respect to x, and the
second partial derivative of u

50
00:02:35,570 --> 00:02:38,750
with respect to y.

51
00:02:38,750 --> 00:02:43,470
So this is a pretty
famous equation.

52
00:02:43,470 --> 00:02:45,740
And you can see that we
have a single derivative

53
00:02:45,740 --> 00:02:48,860
on the left side and a double
derivative on the right side.

54
00:02:48,860 --> 00:02:50,510
And how do we
actually write code

55
00:02:50,510 --> 00:02:54,990
to simulate this
2D heat process?

56
00:02:54,990 --> 00:02:57,350
So oftentimes in
scientific computing,

57
00:02:57,350 --> 00:03:00,440
people will come up with these
differential equations just

58
00:03:00,440 --> 00:03:02,358
to describe physical processes.

59
00:03:02,358 --> 00:03:04,400
And then they want to come
up with efficient code

60
00:03:04,400 --> 00:03:06,860
to actually simulate
the physical process.

61
00:03:10,440 --> 00:03:10,940
OK.

62
00:03:10,940 --> 00:03:14,660
So here's an example
of a 2D heat diffusion.

63
00:03:14,660 --> 00:03:18,350
So let's say we started out with
a configuration on the left.

64
00:03:18,350 --> 00:03:21,020
And here the color
corresponds to a temperature.

65
00:03:21,020 --> 00:03:23,030
So a brighter color
means it's hotter.

66
00:03:23,030 --> 00:03:25,980
Yellow is the hottest
and blue is the coldest.

67
00:03:25,980 --> 00:03:28,985
In on the left we
just have 6172,

68
00:03:28,985 --> 00:03:30,110
which is the course number.

69
00:03:30,110 --> 00:03:31,527
So if you didn't
know that, you're

70
00:03:31,527 --> 00:03:33,560
probably in the wrong class.

71
00:03:33,560 --> 00:03:35,720
And then afterwards
we're going to run it

72
00:03:35,720 --> 00:03:37,100
for a couple time steps.

73
00:03:37,100 --> 00:03:39,470
And then the heat is going
to diffuse to the neighboring

74
00:03:39,470 --> 00:03:41,500
regions of the 2D space.

75
00:03:41,500 --> 00:03:43,250
So after you run it
for a couple of steps,

76
00:03:43,250 --> 00:03:46,370
you might get the
configuration on the right

77
00:03:46,370 --> 00:03:49,730
where the heat is
more spread out now.

78
00:03:49,730 --> 00:03:53,000
And oftentimes, you want
to run this simulation

79
00:03:53,000 --> 00:03:57,110
for a number of time steps
until the distribution of heat

80
00:03:57,110 --> 00:03:59,690
converges so it
becomes stable and that

81
00:03:59,690 --> 00:04:01,110
doesn't change by much anymore.

82
00:04:01,110 --> 00:04:02,720
And then we stop the simulation.

83
00:04:07,020 --> 00:04:09,300
So this is the 1D heat equation.

84
00:04:09,300 --> 00:04:10,870
I showed you a 2D one earlier.

85
00:04:10,870 --> 00:04:13,680
But we're actually going to
generate code for the 1D heat

86
00:04:13,680 --> 00:04:15,270
equation since it's simpler.

87
00:04:15,270 --> 00:04:18,959
But all the ideas generalize
to higher dimensions.

88
00:04:18,959 --> 00:04:23,640
And here's the range of colors
corresponding to temperature,

89
00:04:23,640 --> 00:04:25,830
so the hottest
colors on the left

90
00:04:25,830 --> 00:04:28,690
and the coldest
colors on the right.

91
00:04:28,690 --> 00:04:31,170
And if you had a
heat source that's

92
00:04:31,170 --> 00:04:33,570
on the left hand
side of this bar,

93
00:04:33,570 --> 00:04:36,700
then this might possibly
be a stable distribution.

94
00:04:36,700 --> 00:04:39,150
So if you keep running
the simulation,

95
00:04:39,150 --> 00:04:41,580
you might get a stable
distribution of heat

96
00:04:41,580 --> 00:04:44,770
that looks like this.

97
00:04:44,770 --> 00:04:45,270
OK.

98
00:04:45,270 --> 00:04:48,660
So how do we actually write code
to simulate this differential

99
00:04:48,660 --> 00:04:50,160
equation?

100
00:04:50,160 --> 00:04:53,670
So one commonly used method
is known as finite difference

101
00:04:53,670 --> 00:04:55,650
approximation.

102
00:04:55,650 --> 00:05:00,120
So we're going to approximate
the partial derivative of u

103
00:05:00,120 --> 00:05:04,230
with respect to each
of its coordinates.

104
00:05:04,230 --> 00:05:07,980
So the partial derivative
of u with respect to t

105
00:05:07,980 --> 00:05:12,090
is approximately equal
to u of t plus delta t

106
00:05:12,090 --> 00:05:18,120
where delta t is some small
value, and x, and then

107
00:05:18,120 --> 00:05:20,010
minus u of tx.

108
00:05:20,010 --> 00:05:23,070
And then that's all
divided by delta t.

109
00:05:23,070 --> 00:05:26,580
So how many of you have seen
this approximation method

110
00:05:26,580 --> 00:05:29,720
before from your calculus class?

111
00:05:29,720 --> 00:05:30,240
OK, good.

112
00:05:30,240 --> 00:05:34,050
So as you bring the value
of delta t down to 0,

113
00:05:34,050 --> 00:05:36,070
then the thing on the
right hand side approach

114
00:05:36,070 --> 00:05:39,760
is the true partial derivative.

115
00:05:39,760 --> 00:05:41,880
So that's a partial
derivative with respect to t.

116
00:05:41,880 --> 00:05:47,340
We also need to get the partial
derivative with respect to x.

117
00:05:47,340 --> 00:05:50,670
And here I'm saying the
partial derivative with respect

118
00:05:50,670 --> 00:05:54,510
to x is approximately
equal to ut of x plus delta

119
00:05:54,510 --> 00:05:58,410
x over 2 minus ut
of x minus delta x

120
00:05:58,410 --> 00:06:01,350
over 2, all divided by delta x.

121
00:06:01,350 --> 00:06:04,350
So notice here that
instead of adding delta

122
00:06:04,350 --> 00:06:06,360
x in the first term
and not adding anything

123
00:06:06,360 --> 00:06:08,700
in the second term, I'm
actually adding delta x over 2

124
00:06:08,700 --> 00:06:11,330
in the first term and
subtracting text over 2

125
00:06:11,330 --> 00:06:12,150
in the second term.

126
00:06:12,150 --> 00:06:14,920
And it turns out that I can
do this with the approximation

127
00:06:14,920 --> 00:06:15,420
method.

128
00:06:15,420 --> 00:06:19,740
And it still turns out to be
valid as long as the two terms

129
00:06:19,740 --> 00:06:22,860
that I'm putting in, their
difference is delta x.

130
00:06:22,860 --> 00:06:24,300
So here the
difference is delta x.

131
00:06:24,300 --> 00:06:26,730
And I can basically
decide how to split up

132
00:06:26,730 --> 00:06:33,120
this delta x term among the
two things in the numerator.

133
00:06:33,120 --> 00:06:36,870
And the reason why I
chose delta x over 2 here

134
00:06:36,870 --> 00:06:39,460
is because the math is just
going to work out nicely.

135
00:06:39,460 --> 00:06:42,100
And it's going to
give us cleaner code.

136
00:06:42,100 --> 00:06:44,880
This is just the first partial
derivative with respect to x.

137
00:06:44,880 --> 00:06:46,980
Actually need the second
partial derivative

138
00:06:46,980 --> 00:06:48,930
since the right hand
side of this equation

139
00:06:48,930 --> 00:06:51,990
has the second
partial derivative.

140
00:06:51,990 --> 00:06:54,870
So this is what the second
partial derivative looks like.

141
00:06:54,870 --> 00:06:57,810
So I just take the partial
derivative with respect

142
00:06:57,810 --> 00:07:01,680
to x of each of the terms in
my numerator from the equation

143
00:07:01,680 --> 00:07:03,060
above.

144
00:07:03,060 --> 00:07:05,670
And then now I can
actually plug in the value

145
00:07:05,670 --> 00:07:08,460
of this partial derivative
by applying the equation

146
00:07:08,460 --> 00:07:12,240
above using the arguments
t and x plus delta x

147
00:07:12,240 --> 00:07:14,400
over 2, and similarly
for the second term.

148
00:07:17,250 --> 00:07:18,840
So for the first
term when I plug it

149
00:07:18,840 --> 00:07:23,970
into the equation for the
partial derivative with respect

150
00:07:23,970 --> 00:07:30,305
to x, I'm just going to get ut
of x plus delta x minus utx.

151
00:07:30,305 --> 00:07:31,680
And then for the
second term, I'm

152
00:07:31,680 --> 00:07:34,440
going to get ut of
x minus delta x.

153
00:07:34,440 --> 00:07:37,740
And then I subtract
another factor of utx.

154
00:07:37,740 --> 00:07:42,000
So that's why I'm subtracting
2 times utx in the numerator

155
00:07:42,000 --> 00:07:42,930
here.

156
00:07:42,930 --> 00:07:44,940
And then the partial
derivative of each

157
00:07:44,940 --> 00:07:46,860
of the things in
a numerator also

158
00:07:46,860 --> 00:07:49,170
have to divide by
this delta x term.

159
00:07:49,170 --> 00:07:51,720
So on the denominator,
I get delta x squared.

160
00:07:55,420 --> 00:07:59,485
So now I have the second partial
derivative with respect to x.

161
00:07:59,485 --> 00:08:01,860
And I also have the first
partial derivative with respect

162
00:08:01,860 --> 00:08:02,360
to t.

163
00:08:02,360 --> 00:08:06,450
So I can just plug them
into my equation above.

164
00:08:06,450 --> 00:08:09,660
So on the left hand side I
just have this term here.

165
00:08:09,660 --> 00:08:11,790
And I'm multiplying by
this alpha constant.

166
00:08:11,790 --> 00:08:16,600
And then on this term
just comes from here.

167
00:08:16,600 --> 00:08:19,080
So this is what the
1d heat equation

168
00:08:19,080 --> 00:08:22,180
reduces to using the finite
difference approximation

169
00:08:22,180 --> 00:08:22,680
method.

170
00:08:22,680 --> 00:08:30,050
So any questions on this

171
00:08:30,050 --> 00:08:37,250
So how do we actually write code
to simulate this equation here?

172
00:08:37,250 --> 00:08:41,179
So we're going to use what's
called a stencil computation.

173
00:08:41,179 --> 00:08:46,790
And here I'm going to set delta
at x and delta t equal to 1,

174
00:08:46,790 --> 00:08:47,720
just for simplicity.

175
00:08:47,720 --> 00:08:49,760
But in general you can set
them to whatever you want.

176
00:08:49,760 --> 00:08:51,135
You can make them
smaller to have

177
00:08:51,135 --> 00:08:54,240
a more fine grained simulation.

178
00:08:54,240 --> 00:08:57,780
So my set delta x and
delta eat t equal to 1.

179
00:08:57,780 --> 00:09:00,800
Then the denominators of these
two terms just become one

180
00:09:00,800 --> 00:09:04,160
and I don't need to
worry about them.

181
00:09:04,160 --> 00:09:06,080
And then I'm going
to represent my 2D

182
00:09:06,080 --> 00:09:12,620
space using a 2D matrix where
the horizontal axis represents

183
00:09:12,620 --> 00:09:17,210
values of x, and the vertical
axis represents values of t.

184
00:09:17,210 --> 00:09:20,630
And I want to fill
in all these entries

185
00:09:20,630 --> 00:09:23,810
that have a black dot in it.

186
00:09:23,810 --> 00:09:26,840
The ones with the orange
dot, those are my boundaries.

187
00:09:26,840 --> 00:09:29,670
So those actually fixed
throughout the computation.

188
00:09:29,670 --> 00:09:32,000
So I'm not going to do
any computation on those.

189
00:09:32,000 --> 00:09:33,500
Those are just given
to me as input.

190
00:09:33,500 --> 00:09:36,890
So they could be
heat sources if we're

191
00:09:36,890 --> 00:09:40,680
doing the heat simulation.

192
00:09:40,680 --> 00:09:43,280
And then now I can
actually write code

193
00:09:43,280 --> 00:09:46,890
to simulate this equation.

194
00:09:46,890 --> 00:09:53,510
So if I want to compute u of t
plus 1x, I can just go up here.

195
00:09:53,510 --> 00:09:57,530
And I see that that's equal
to this thing over here.

196
00:09:57,530 --> 00:10:00,830
And then I bring the negative
utx term to the right.

197
00:10:00,830 --> 00:10:04,550
So I get ut of x
plus alpha times ut

198
00:10:04,550 --> 00:10:11,150
x plus 1 minus 2 times
utx plus ut x minus 1.

199
00:10:11,150 --> 00:10:13,850
As I said before, we just
want to keep iterating this

200
00:10:13,850 --> 00:10:18,210
until the temperatures
becomes stable.

201
00:10:18,210 --> 00:10:25,310
So I'm going to proceed in time,
which in time is going up here.

202
00:10:25,310 --> 00:10:29,540
And to compute one
of these points--

203
00:10:29,540 --> 00:10:31,910
so let's say this
is ut plus 1x--

204
00:10:31,910 --> 00:10:34,860
I need to know the value
of utx, which is just

205
00:10:34,860 --> 00:10:37,640
a thing below me in the matrix.

206
00:10:37,640 --> 00:10:41,480
And I also need to know
utx plus 1 and utx minus 1.

207
00:10:41,480 --> 00:10:43,460
And those are just
the things below me

208
00:10:43,460 --> 00:10:48,150
and diagonal to either the
left or the right side.

209
00:10:48,150 --> 00:10:52,070
So each value here just
depends on three other values.

210
00:10:52,070 --> 00:10:53,960
And this is called a
three point stencil.

211
00:10:53,960 --> 00:10:58,460
This is the pattern that this
equation is representing.

212
00:10:58,460 --> 00:11:00,410
And in general, a
stencil computation

213
00:11:00,410 --> 00:11:05,180
is going to update each point in
an array using a fixed pattern.

214
00:11:05,180 --> 00:11:07,465
This is called a stencil.

215
00:11:07,465 --> 00:11:08,840
So I'm going to
do the same thing

216
00:11:08,840 --> 00:11:10,490
for all of the other points.

217
00:11:10,490 --> 00:11:13,400
And here, I'm going to
compute all the values

218
00:11:13,400 --> 00:11:15,170
of x for a given time step.

219
00:11:15,170 --> 00:11:17,450
And then I move on to
the next time step.

220
00:11:20,390 --> 00:11:24,470
And then I keep doing this
until the distribution

221
00:11:24,470 --> 00:11:26,990
of temperatures becomes stable.

222
00:11:26,990 --> 00:11:29,810
And then I'm done.

223
00:11:29,810 --> 00:11:30,310
OK.

224
00:11:30,310 --> 00:11:33,590
So these stencil
computations are widely

225
00:11:33,590 --> 00:11:35,840
used in scientific computing.

226
00:11:35,840 --> 00:11:38,780
They're used for weather
simulations, stock market

227
00:11:38,780 --> 00:11:42,390
simulations, fluid dynamics,
image processing probability,

228
00:11:42,390 --> 00:11:42,890
and so on.

229
00:11:42,890 --> 00:11:45,350
So they're used all over
the place in science.

230
00:11:45,350 --> 00:11:50,100
So this is a very
important concept to know.

231
00:11:50,100 --> 00:11:53,802
So let's say I just ran
the code as I showed you

232
00:11:53,802 --> 00:11:54,510
in the animation.

233
00:11:54,510 --> 00:11:57,120
So I completed one row at
a time before I moved on

234
00:11:57,120 --> 00:11:58,400
to the next row.

235
00:11:58,400 --> 00:12:00,950
How would this code perform
with respect to caching?

236
00:12:11,895 --> 00:12:12,395
Yes?

237
00:12:15,200 --> 00:12:19,315
AUDIENCE: I think if
x is less than a third

238
00:12:19,315 --> 00:12:23,227
of the cache size, [INAUDIBLE]

239
00:12:23,227 --> 00:12:24,710
JULIAN SHUN: Yeah.

240
00:12:24,710 --> 00:12:28,910
So if x is small, this
would do pretty well.

241
00:12:28,910 --> 00:12:32,875
But what if x is much larger
than the size of your cache?

242
00:12:32,875 --> 00:12:33,825
AUDIENCE: [INAUDIBLE].

243
00:12:33,825 --> 00:12:34,783
JULIAN SHUN: Yeah, you.

244
00:12:34,783 --> 00:12:37,245
Would do badly and why is that?

245
00:12:37,245 --> 00:12:46,890
AUDIENCE: Because the whole
[INAUDIBLE] second row,

246
00:12:46,890 --> 00:12:49,245
[INAUDIBLE]

247
00:12:49,245 --> 00:12:49,995
JULIAN SHUN: Yeah.

248
00:12:49,995 --> 00:12:52,650
So it turns out that there
could be some reuse here,

249
00:12:52,650 --> 00:12:56,280
because when I compute
the second row,

250
00:12:56,280 --> 00:12:58,020
I'm actually using
some values that

251
00:12:58,020 --> 00:12:59,920
are computed in the first row.

252
00:12:59,920 --> 00:13:02,280
But if the row is much
larger than the cache size,

253
00:13:02,280 --> 00:13:03,840
by the time I get
to the second row,

254
00:13:03,840 --> 00:13:06,720
the values that I loaded
into cache from the first row

255
00:13:06,720 --> 00:13:08,190
would have been evicted.

256
00:13:08,190 --> 00:13:11,490
And therefore, I'm going to
suffer a cache miss again

257
00:13:11,490 --> 00:13:13,710
for the second row, for the
values I need to load in,

258
00:13:13,710 --> 00:13:17,280
even though they could have been
used if we made our code have

259
00:13:17,280 --> 00:13:18,505
good locality.

260
00:13:23,740 --> 00:13:26,050
Another question I
have is if we only

261
00:13:26,050 --> 00:13:30,040
cared about the values of x
at the most recent time step,

262
00:13:30,040 --> 00:13:33,490
do we actually have to keep
around this whole 2D matrix?

263
00:13:33,490 --> 00:13:35,590
Or can we get by
with less storage?

264
00:13:44,500 --> 00:13:45,000
Yeah.

265
00:13:45,000 --> 00:13:47,610
So how many rows would
I have to keep around

266
00:13:47,610 --> 00:13:51,260
if I only cared about the
most recent time step?

267
00:13:51,260 --> 00:13:51,760
Yeah?

268
00:13:51,760 --> 00:13:52,362
AUDIENCE: Two.

269
00:13:52,362 --> 00:13:53,070
JULIAN SHUN: Two.

270
00:13:53,070 --> 00:13:54,060
And why is that?

271
00:13:54,060 --> 00:13:55,893
AUDIENCE: So one for
the previous time step.

272
00:13:55,893 --> 00:13:57,546
One for the current time step.

273
00:13:57,546 --> 00:13:58,608
[INAUDIBLE]

274
00:13:58,608 --> 00:13:59,400
JULIAN SHUN: Right.

275
00:13:59,400 --> 00:14:01,650
So I need to keep around
two rows because I

276
00:14:01,650 --> 00:14:03,630
need the row from the
previous time step

277
00:14:03,630 --> 00:14:06,570
in order to compute the values
in the current time step.

278
00:14:06,570 --> 00:14:08,010
And after the
current time step, I

279
00:14:08,010 --> 00:14:10,180
can just swap the
roles of the two rows

280
00:14:10,180 --> 00:14:12,060
that I'm keeping
around, and then

281
00:14:12,060 --> 00:14:14,460
reuse the previous
row for the next one.

282
00:14:14,460 --> 00:14:15,371
Yes?

283
00:14:15,371 --> 00:14:18,808
AUDIENCE: Would you
only need one and then

284
00:14:18,808 --> 00:14:22,245
a constant amount
of extra space,

285
00:14:22,245 --> 00:14:27,646
like if you had
three extra things,

286
00:14:27,646 --> 00:14:30,592
you could probably
do it with one.

287
00:14:30,592 --> 00:14:32,650
JULIAN SHUN: So I need to know--

288
00:14:32,650 --> 00:14:34,630
when I'm computing
the second row,

289
00:14:34,630 --> 00:14:36,850
I need to keep around
all of these values

290
00:14:36,850 --> 00:14:38,260
that I computed
in the first row,

291
00:14:38,260 --> 00:14:42,190
because these values
get fed to one

292
00:14:42,190 --> 00:14:43,850
of the computations
in the second row.

293
00:14:43,850 --> 00:14:46,005
So I need to actually
keep all of them around.

294
00:14:46,005 --> 00:14:50,388
AUDIENCE: I think if you
iterate to the right,

295
00:14:50,388 --> 00:14:56,080
then you have three that are
this one and the next one.

296
00:14:56,080 --> 00:14:56,986
Just three.

297
00:14:56,986 --> 00:14:59,180
Then you can--

298
00:14:59,180 --> 00:15:01,180
JULIAN SHUN: Oh, I see I
see what you're saying.

299
00:15:01,180 --> 00:15:01,680
Yeah.

300
00:15:01,680 --> 00:15:03,330
So that's actually
a good observation.

301
00:15:03,330 --> 00:15:06,010
So you only need to
keep a constant amount

302
00:15:06,010 --> 00:15:11,230
more storage, because you'll
just be overwriting the values

303
00:15:11,230 --> 00:15:13,090
as you go through the row.

304
00:15:13,090 --> 00:15:15,370
So if you keep around one
row, some of the values

305
00:15:15,370 --> 00:15:17,718
would be for the current
time step, and some of them

306
00:15:17,718 --> 00:15:19,260
would be from the
previous time step.

307
00:15:19,260 --> 00:15:21,390
So that's a good
observation, yes.

308
00:15:26,240 --> 00:15:28,850
OK.

309
00:15:28,850 --> 00:15:32,600
So that code, as we saw, it
wasn't very cache efficient.

310
00:15:32,600 --> 00:15:35,540
You could make a cache
efficient using tiling.

311
00:15:35,540 --> 00:15:38,120
But we're going to go straight
to the cache oblivious

312
00:15:38,120 --> 00:15:41,190
algorithm because
it's much cleaner.

313
00:15:41,190 --> 00:15:44,480
So let's recall the
ideal cache model.

314
00:15:44,480 --> 00:15:47,340
We talked about this in
the previous lecture.

315
00:15:47,340 --> 00:15:49,460
So here we have a
two level hierarchy.

316
00:15:49,460 --> 00:15:52,250
We have a cache.

317
00:15:52,250 --> 00:15:56,060
And then we have
the main memory.

318
00:15:56,060 --> 00:15:58,910
The cache has size of M bytes.

319
00:15:58,910 --> 00:16:01,080
And a cache line is B bytes.

320
00:16:01,080 --> 00:16:05,930
So you can keep around M over
B cache lines in your cache.

321
00:16:05,930 --> 00:16:08,570
And if something's in cache
and you operate on it,

322
00:16:08,570 --> 00:16:11,450
then it doesn't incur
any cache misses.

323
00:16:11,450 --> 00:16:13,550
But if you have to
go to main memory

324
00:16:13,550 --> 00:16:18,060
to load the cache line in,
then you incur one cache miss.

325
00:16:18,060 --> 00:16:20,500
The ideal cache model
assumes that the cache

326
00:16:20,500 --> 00:16:22,240
is fully associative.

327
00:16:22,240 --> 00:16:25,780
So any cache line can go
anywhere in the cache.

328
00:16:25,780 --> 00:16:29,380
And it also assumes either an
optimal omniscient replacement

329
00:16:29,380 --> 00:16:31,760
policy or the LRU policy.

330
00:16:31,760 --> 00:16:34,690
So the optimal omniscient
replacement policy

331
00:16:34,690 --> 00:16:38,380
knows the sequence of all
future requests to memory.

332
00:16:38,380 --> 00:16:40,060
And when it needs
to evict something,

333
00:16:40,060 --> 00:16:42,640
it's going to pick the thing
that leads to the fewest cache

334
00:16:42,640 --> 00:16:45,840
misses overall to evict.

335
00:16:45,840 --> 00:16:49,000
The LRU policy just
evict the thing

336
00:16:49,000 --> 00:16:51,520
that was least recently used.

337
00:16:51,520 --> 00:16:53,830
But we saw from the
previous lecture,

338
00:16:53,830 --> 00:16:57,010
that in terms of
asymptotic costs,

339
00:16:57,010 --> 00:16:59,140
these two replacement
policies will give you

340
00:16:59,140 --> 00:17:01,405
cache misses within a
constant fact of each other.

341
00:17:01,405 --> 00:17:04,690
So you can use either one,
depending on what's convenient.

342
00:17:07,940 --> 00:17:09,520
And two performance
measures that we

343
00:17:09,520 --> 00:17:11,740
care about when we're
analyzing an algorithm

344
00:17:11,740 --> 00:17:15,037
and the ideal cache
model are the work

345
00:17:15,037 --> 00:17:16,329
and the number of cache misses.

346
00:17:16,329 --> 00:17:18,819
So the work is just the
total number of operations

347
00:17:18,819 --> 00:17:20,619
that the algorithm incurs.

348
00:17:20,619 --> 00:17:24,641
And serially, this is just
the ordinary running time.

349
00:17:24,641 --> 00:17:27,099
And the number of cache misses
is the number of cache lines

350
00:17:27,099 --> 00:17:31,240
you have to transfer between
the cache and your main memory.

351
00:17:33,770 --> 00:17:37,270
So let's assume that
we're running an algorithm

352
00:17:37,270 --> 00:17:39,640
or analyzing an algorithm
in the ideal cache model,

353
00:17:39,640 --> 00:17:41,800
and it runs serially.

354
00:17:41,800 --> 00:17:44,920
What kinds of cache misses
does the ideal cache model

355
00:17:44,920 --> 00:17:45,535
not capture?

356
00:17:48,380 --> 00:17:51,133
So remember, we talked about
several types of cache misses.

357
00:17:51,133 --> 00:17:52,550
And there's one
type of cache miss

358
00:17:52,550 --> 00:17:55,310
that this model doesn't capture
when we're running serially.

359
00:18:06,760 --> 00:18:09,220
So let's assume we're
running this serially

360
00:18:09,220 --> 00:18:11,770
without any parallelism here.

361
00:18:11,770 --> 00:18:13,690
So the sharing misses
has only come about when

362
00:18:13,690 --> 00:18:14,910
you have parallelism.

363
00:18:14,910 --> 00:18:15,410
Yes?

364
00:18:15,410 --> 00:18:16,220
AUDIENCE: Conflictness?

365
00:18:16,220 --> 00:18:16,928
JULIAN SHUN: Yes.

366
00:18:16,928 --> 00:18:18,440
So the answer is conflictnesses.

367
00:18:18,440 --> 00:18:19,300
And why is that?

368
00:18:19,300 --> 00:18:21,790
Why does this model
not capture it?

369
00:18:21,790 --> 00:18:24,010
AUDIENCE: There's
not a specific sets

370
00:18:24,010 --> 00:18:26,566
that could get replaced then
since it's fully associated.

371
00:18:26,566 --> 00:18:27,370
JULIAN SHUN: Yes.

372
00:18:27,370 --> 00:18:29,620
So this is a fully
associative cache.

373
00:18:29,620 --> 00:18:32,230
So any cache line can go
anywhere in the cache.

374
00:18:32,230 --> 00:18:35,310
And you can only get conflict
misses for set associated

375
00:18:35,310 --> 00:18:37,960
schemes where each
cache line can only

376
00:18:37,960 --> 00:18:40,122
be mapped to a particular set.

377
00:18:40,122 --> 00:18:41,830
And if you have too
many cache lines that

378
00:18:41,830 --> 00:18:43,420
map to that particular
set, then you're

379
00:18:43,420 --> 00:18:44,795
going to keep
evicting each other

380
00:18:44,795 --> 00:18:47,350
even though the rest of
the cache could have space.

381
00:18:47,350 --> 00:18:50,220
And that's what's
called a conflict miss.

382
00:18:50,220 --> 00:18:54,160
The ideal cash model does
capture capacity misses.

383
00:18:54,160 --> 00:18:57,430
So therefore, it is
still a very good model

384
00:18:57,430 --> 00:18:59,140
to use at a high
level when you're

385
00:18:59,140 --> 00:19:01,930
designing efficient
algorithms, because it

386
00:19:01,930 --> 00:19:06,100
encourages you to optimize for
spatial and temporal locality.

387
00:19:06,100 --> 00:19:08,770
And once you have a good
algorithm in the ideal cache

388
00:19:08,770 --> 00:19:11,530
model then you can start
dealing with conflict misses

389
00:19:11,530 --> 00:19:13,030
using some of the
strategies that we

390
00:19:13,030 --> 00:19:15,220
talked about last
time such as padding

391
00:19:15,220 --> 00:19:18,010
or using temporary memory.

392
00:19:18,010 --> 00:19:19,660
So any questions on this?

393
00:19:29,210 --> 00:19:29,710
OK.

394
00:19:29,710 --> 00:19:37,120
So this is the code that
does the heat simulation

395
00:19:37,120 --> 00:19:38,180
that we saw earlier.

396
00:19:38,180 --> 00:19:42,040
So it's just two for
loops, a nested for loop.

397
00:19:42,040 --> 00:19:44,800
In the outer loop, we're
looping over the time dimension.

398
00:19:44,800 --> 00:19:47,715
In the inner loop, we're looping
over the space dimension.

399
00:19:47,715 --> 00:19:49,090
So we're computing
all the values

400
00:19:49,090 --> 00:19:53,320
of x before we move on
to the next time step.

401
00:19:53,320 --> 00:19:55,600
And then we're
storing two rows here

402
00:19:55,600 --> 00:19:59,380
and we're using this trick
called a even odd trick.

403
00:19:59,380 --> 00:20:00,410
And here's how it works.

404
00:20:00,410 --> 00:20:04,840
So to access the next row
that we want to compute,

405
00:20:04,840 --> 00:20:08,050
that we just do
a t plus 1 mod 2.

406
00:20:08,050 --> 00:20:10,480
And then to access the current
row, it's just t mod 2.

407
00:20:10,480 --> 00:20:14,020
So this is implicitly going to
swap the roles of the two rows

408
00:20:14,020 --> 00:20:18,500
that we're keeping around
as we progress through time.

409
00:20:18,500 --> 00:20:21,100
And then we're
going to set u of t

410
00:20:21,100 --> 00:20:25,600
plus 1 mod 2x equal
to kernel of u--

411
00:20:25,600 --> 00:20:28,450
pointer to ut mod 2x.

412
00:20:28,450 --> 00:20:31,600
And this kernel function
is defined up here.

413
00:20:31,600 --> 00:20:34,240
And recall, that
when we're actually

414
00:20:34,240 --> 00:20:36,340
passing a pointer to
this kernel function,

415
00:20:36,340 --> 00:20:40,960
we can actually treat a pointer
as the beginning of an array.

416
00:20:40,960 --> 00:20:43,570
So we're using array
notation up here

417
00:20:43,570 --> 00:20:45,250
inside the kernel function.

418
00:20:45,250 --> 00:20:48,400
So the array W is
passed as input.

419
00:20:48,400 --> 00:20:51,130
And then we need
to return W of 0.

420
00:20:51,130 --> 00:20:54,430
That's just the element
at the current pointer

421
00:20:54,430 --> 00:20:56,290
that we passed to kernel.

422
00:20:56,290 --> 00:20:59,500
And then we add alpha
times w of negative 1.

423
00:20:59,500 --> 00:21:03,730
That's one element before the
thing that we're pointing to,

424
00:21:03,730 --> 00:21:07,000
minus 2 times W
of 0 plus W of 1.

425
00:21:07,000 --> 00:21:09,190
W of 1 is the next element
that we're pointing to.

426
00:21:13,110 --> 00:21:13,610
OK.

427
00:21:13,610 --> 00:21:17,210
So let's look at the caching
behavior of this code.

428
00:21:17,210 --> 00:21:21,470
So we're going to analyze
the cache complexity.

429
00:21:21,470 --> 00:21:24,590
And we're going to assume the
LRU replacement policy here,

430
00:21:24,590 --> 00:21:26,480
because we can.

431
00:21:26,480 --> 00:21:28,960
And as we said
before, we're going

432
00:21:28,960 --> 00:21:31,370
to loop through one
entire row at a time

433
00:21:31,370 --> 00:21:34,980
before we go onto the next row.

434
00:21:34,980 --> 00:21:37,040
So the number of
cache misses I get,

435
00:21:37,040 --> 00:21:40,310
assuming that n
is greater than M,

436
00:21:40,310 --> 00:21:42,860
so that the row size is
greater than the cache

437
00:21:42,860 --> 00:21:48,530
size, the number of cache
misses is theta of NT over B.

438
00:21:48,530 --> 00:21:51,740
So how do I get this cache
complexity around here?

439
00:22:04,490 --> 00:22:06,230
So how many cache
misses do I have

440
00:22:06,230 --> 00:22:10,457
to incur for each row of this
2D space that I'm computing?

441
00:22:14,685 --> 00:22:15,185
Yes?

442
00:22:15,185 --> 00:22:16,112
AUDIENCE: N over B.

443
00:22:16,112 --> 00:22:16,904
JULIAN SHUN: Right.

444
00:22:16,904 --> 00:22:20,190
So I need N over B cache
misses for each row.

445
00:22:20,190 --> 00:22:25,130
And this is because I can
load in B bytes at a time.

446
00:22:25,130 --> 00:22:27,860
So I benefit from
spatial locality there.

447
00:22:27,860 --> 00:22:29,880
And then I have N elements
I need to compute.

448
00:22:29,880 --> 00:22:31,940
So it's theta of
N over B per row.

449
00:22:31,940 --> 00:22:34,370
And as we said before, when
we get to the next row,

450
00:22:34,370 --> 00:22:37,250
the stuff that we need
from the previous row

451
00:22:37,250 --> 00:22:39,300
have already been
evicted from cache.

452
00:22:39,300 --> 00:22:41,630
So I basically have to incur
theta of N over B cache

453
00:22:41,630 --> 00:22:43,280
misses for every row.

454
00:22:43,280 --> 00:22:46,040
And the number of rows
I'm going to compute as t.

455
00:22:46,040 --> 00:22:52,460
So it's just theta of NT over B.
Any questions on this analysis?

456
00:23:02,000 --> 00:23:04,640
So how many of you think
we can do better than this?

457
00:23:08,370 --> 00:23:08,870
OK.

458
00:23:08,870 --> 00:23:11,410
So one person.

459
00:23:11,410 --> 00:23:12,470
Two, three.

460
00:23:12,470 --> 00:23:13,620
OK.

461
00:23:13,620 --> 00:23:16,065
So turns out that we
can do better than this.

462
00:23:16,065 --> 00:23:17,690
You can actually do
better with tiling,

463
00:23:17,690 --> 00:23:19,550
but I'm not going to
do the tiling version.

464
00:23:19,550 --> 00:23:23,420
I want to do the cache
oblivious version.

465
00:23:23,420 --> 00:23:25,250
And the cache
oblivious version is

466
00:23:25,250 --> 00:23:29,390
going to work on trapezoidal
regions in the 2D space.

467
00:23:29,390 --> 00:23:34,670
And recall that a trapezoid has
a top base and a bottom base.

468
00:23:34,670 --> 00:23:41,420
And here the top base is at
t1, the bottom base is at t0,

469
00:23:41,420 --> 00:23:45,030
and the height is
just t1 minus t0.

470
00:23:45,030 --> 00:23:48,020
And the width of a trapezoid
is just the width of it

471
00:23:48,020 --> 00:23:54,590
at the midpoint between t1 and
t0, so at t1 plus t0 over 2.

472
00:23:54,590 --> 00:23:58,910
So we're going to
compute all of the points

473
00:23:58,910 --> 00:24:02,150
inside this trapezoid
that satisfy

474
00:24:02,150 --> 00:24:03,380
these inequalities here.

475
00:24:03,380 --> 00:24:07,670
So t has to be greater than or
equal to t0 and less than t1.

476
00:24:07,670 --> 00:24:13,430
And then x is greater than
or equal to x0 plus dx0

477
00:24:13,430 --> 00:24:14,960
times t minus t0.

478
00:24:14,960 --> 00:24:20,150
So dx0 is actually the
inverse slope here.

479
00:24:20,150 --> 00:24:22,100
And then it also has
to be less than x1

480
00:24:22,100 --> 00:24:25,120
plus dx1 times t minus t0.

481
00:24:25,120 --> 00:24:28,100
So dx1 is the inverse
slope on the other side.

482
00:24:28,100 --> 00:24:33,240
And dx0 and dx1 have to be
either negative 1, 0, or 1.

483
00:24:33,240 --> 00:24:37,590
So negative 1 just corresponds
to inverse slope of negative 1,

484
00:24:37,590 --> 00:24:40,300
which is also a
slope of negative 1.

485
00:24:40,300 --> 00:24:44,150
If it's 1, then it's just
a slope or inverse of 1.

486
00:24:44,150 --> 00:24:48,500
And then if it's 0, then we
just have a vertical line.

487
00:24:52,110 --> 00:24:52,610
OK.

488
00:24:52,610 --> 00:24:56,180
So the nice property
of this trapezoid

489
00:24:56,180 --> 00:24:59,120
is that we can actually compute
everything inside the trapezoid

490
00:24:59,120 --> 00:25:01,740
without looking
outside the trapezoid.

491
00:25:01,740 --> 00:25:03,800
So we can compute everything
here independently

492
00:25:03,800 --> 00:25:06,433
of any other trapezoids
we might be generating.

493
00:25:06,433 --> 00:25:08,100
And we're going to
come up with a divide

494
00:25:08,100 --> 00:25:11,510
and conquer approach
to execute this code.

495
00:25:14,480 --> 00:25:18,710
So the divide and conquer
algorithm has a base case.

496
00:25:18,710 --> 00:25:20,540
So our base case
is going to be when

497
00:25:20,540 --> 00:25:23,390
the height of the
trapezoid is 1.

498
00:25:23,390 --> 00:25:25,130
And when the height
is 1, then we're

499
00:25:25,130 --> 00:25:30,660
just going to compute all of
the values using a simple loop.

500
00:25:30,660 --> 00:25:34,160
And any order if the
computation inside this loop

501
00:25:34,160 --> 00:25:37,400
is valid, since we
have all the values

502
00:25:37,400 --> 00:25:40,520
in the base of the
trapezoid and we

503
00:25:40,520 --> 00:25:43,160
can compute the values in
the top of the trapezoid

504
00:25:43,160 --> 00:25:43,910
in whatever order.

505
00:25:43,910 --> 00:25:45,243
They don't depend on each other.

506
00:25:48,083 --> 00:25:49,000
So that's a base case.

507
00:25:49,000 --> 00:25:50,040
Any questions so far?

508
00:25:56,940 --> 00:25:58,482
So here's one of
the recursive cases.

509
00:25:58,482 --> 00:26:00,023
It turns out that
we're going to have

510
00:26:00,023 --> 00:26:01,330
two different types of cuts.

511
00:26:01,330 --> 00:26:04,690
The first cut is
called a space cut.

512
00:26:04,690 --> 00:26:08,470
So I'm going to do a space cut
if the width of the trapezoid

513
00:26:08,470 --> 00:26:10,520
is greater than or equal
to twice the height.

514
00:26:10,520 --> 00:26:13,390
So this means that the
trapezoid is too wide.

515
00:26:13,390 --> 00:26:16,660
And I'm going to
cut it vertically.

516
00:26:19,420 --> 00:26:21,640
More specifically, I'm
going to cut it with a line,

517
00:26:21,640 --> 00:26:24,820
with slope negative 1
going through the center

518
00:26:24,820 --> 00:26:27,490
of the trapezoid.

519
00:26:27,490 --> 00:26:30,910
And then I'm going to traverse
the trapezoid on the left side

520
00:26:30,910 --> 00:26:31,450
first.

521
00:26:31,450 --> 00:26:32,950
And then after I'm
done with that,

522
00:26:32,950 --> 00:26:35,200
traverse the trapezoid
on the right side.

523
00:26:37,750 --> 00:26:40,310
So can I actually switch
the order of this?

524
00:26:40,310 --> 00:26:42,340
Can I compute the
stuff on the right side

525
00:26:42,340 --> 00:26:46,250
before I do this stuff
on the left side?

526
00:26:46,250 --> 00:26:46,750
No.

527
00:26:46,750 --> 00:26:49,890
Why is that?

528
00:26:49,890 --> 00:26:52,650
AUDIENCE: [INAUDIBLE].

529
00:26:52,650 --> 00:26:53,430
JULIAN SHUN: Yeah.

530
00:26:53,430 --> 00:26:55,500
So there some points
in the right trapezoid

531
00:26:55,500 --> 00:26:59,340
that depend on the values
from the left trapezoid.

532
00:26:59,340 --> 00:27:04,140
And so for the left trapezoid,
every point we want to compute,

533
00:27:04,140 --> 00:27:05,730
we already have
all of its points,

534
00:27:05,730 --> 00:27:10,320
assuming that we get all the
values of the base points.

535
00:27:10,320 --> 00:27:12,510
But for the right hand
side, some of the values

536
00:27:12,510 --> 00:27:15,140
depend on values in
the left trapezoid.

537
00:27:15,140 --> 00:27:17,100
So we can't execute
the right trapezoid

538
00:27:17,100 --> 00:27:19,710
until we're done with
the left trapezoid.

539
00:27:19,710 --> 00:27:22,410
And this is the reason
why I cut this trapezoid

540
00:27:22,410 --> 00:27:24,030
with a slope of
negative 1 instead

541
00:27:24,030 --> 00:27:25,680
of using a vertical cut.

542
00:27:25,680 --> 00:27:27,570
Because if I did a
vertical cut then

543
00:27:27,570 --> 00:27:29,160
inside both of the
trapezoids, I would

544
00:27:29,160 --> 00:27:34,110
have points that depend
on the other trapezoid.

545
00:27:34,110 --> 00:27:35,405
So this is one of the two cuts.

546
00:27:35,405 --> 00:27:36,530
This is called a space cut.

547
00:27:36,530 --> 00:27:40,010
And it happens when the
trapezoid is too wide.

548
00:27:40,010 --> 00:27:41,630
The other cut is
the time cut I'm

549
00:27:41,630 --> 00:27:44,310
going to cut with respect
to the time dimension.

550
00:27:44,310 --> 00:27:46,580
And this happens when the
trapezoid is too tall,

551
00:27:46,580 --> 00:27:49,520
so when the width is less
than twice the height

552
00:27:49,520 --> 00:27:50,948
of the trapezoid.

553
00:27:50,948 --> 00:27:52,490
Then what I'm going
to do is I'm just

554
00:27:52,490 --> 00:27:55,040
going to cut it with
a horizontal line

555
00:27:55,040 --> 00:27:57,360
through the center.

556
00:27:57,360 --> 00:28:00,448
And then I'm going to traverse
the bottom trapezoid first.

557
00:28:00,448 --> 00:28:02,990
And after I'm done with that,
I can traverse a top trapezoid.

558
00:28:02,990 --> 00:28:05,277
And again, the top trapezoid
depends on some points

559
00:28:05,277 --> 00:28:06,360
from the bottom trapezoid.

560
00:28:06,360 --> 00:28:10,140
So it's I can't switch
the order of those.

561
00:28:10,140 --> 00:28:10,950
Any questions?

562
00:28:17,040 --> 00:28:17,540
OK.

563
00:28:17,540 --> 00:28:20,010
So let's now look
at the code that

564
00:28:20,010 --> 00:28:24,690
implements this recursive
divide and conquer algorithm.

565
00:28:24,690 --> 00:28:26,760
So here's the C code.

566
00:28:26,760 --> 00:28:29,970
It takes as input t0 and t1.

567
00:28:29,970 --> 00:28:31,800
These are the
coordinates of the top

568
00:28:31,800 --> 00:28:34,050
and the bottom up the
trapezoid, or bottom and top

569
00:28:34,050 --> 00:28:35,730
of the trapezoid, then x.

570
00:28:35,730 --> 00:28:38,670
0 is the left side
of the trapezoid--

571
00:28:38,670 --> 00:28:40,710
of the base of the trapezoid.

572
00:28:40,710 --> 00:28:43,050
dx0 is the inverse
slope, and the x1

573
00:28:43,050 --> 00:28:45,930
is the right side of the
bottom base of the trapezoid,

574
00:28:45,930 --> 00:28:50,670
and dx1 is the inverse
slope on the right side.

575
00:28:50,670 --> 00:28:53,310
So we're first going to compute
the height of our trapezoid.

576
00:28:53,310 --> 00:28:55,890
And we're going to
let LT be the height.

577
00:28:55,890 --> 00:28:58,350
And that's just t1 minus t0.

578
00:28:58,350 --> 00:29:00,510
And if the height
is 1, then we're

579
00:29:00,510 --> 00:29:05,370
just going to use a for
loop over all the points

580
00:29:05,370 --> 00:29:07,993
in that height 1 trapezoid.

581
00:29:07,993 --> 00:29:09,660
We're going to call
this kernel function

582
00:29:09,660 --> 00:29:11,980
that we defined before.

583
00:29:11,980 --> 00:29:15,235
And otherwise, the
height is greater than 1.

584
00:29:15,235 --> 00:29:17,610
And we're going to check
whether we should do a space cut

585
00:29:17,610 --> 00:29:19,020
or time cut.

586
00:29:19,020 --> 00:29:22,140
So we do a space cut if
the trapezoid is too wide.

587
00:29:22,140 --> 00:29:24,380
And this condition
inside the if clause

588
00:29:24,380 --> 00:29:27,030
is checking if the
trapezoid is too wide.

589
00:29:27,030 --> 00:29:30,780
And if so, then we're going
to make two recursive calls

590
00:29:30,780 --> 00:29:32,370
to trapezoid.

591
00:29:32,370 --> 00:29:34,590
And we're going to cut
it in the middle using

592
00:29:34,590 --> 00:29:36,420
this slope of negative 1.

593
00:29:36,420 --> 00:29:40,170
So you see the negative ones
here in the recursive calls.

594
00:29:40,170 --> 00:29:42,180
And otherwise,
we'll do a time cut.

595
00:29:42,180 --> 00:29:44,620
And for the time cut we
just cut it in the middle.

596
00:29:44,620 --> 00:29:49,020
So we cut it at the value of t
that's equal to LT divided by,

597
00:29:49,020 --> 00:29:52,860
2 or t0 plus LT divided by 2.

598
00:29:52,860 --> 00:29:56,010
And again, we have two
recursive calls to trapezoid.

599
00:29:59,700 --> 00:30:00,200
OK.

600
00:30:00,200 --> 00:30:03,540
So even though I'm only
generating slopes of negative 1

601
00:30:03,540 --> 00:30:05,360
in this recursive
call, this code

602
00:30:05,360 --> 00:30:08,450
is going to work even if
I have slopes of 1 and 0,

603
00:30:08,450 --> 00:30:11,330
because I could start out
with slopes of 1 and 0.

604
00:30:11,330 --> 00:30:13,740
For example, if I had
a rectangular region,

605
00:30:13,740 --> 00:30:15,428
then the slopes are
just going to be 0,

606
00:30:15,428 --> 00:30:16,970
and this code is
still going to work.

607
00:30:16,970 --> 00:30:18,803
But most of the slopes
that I'm dealing with

608
00:30:18,803 --> 00:30:20,630
are going to be a
negative 1, because those

609
00:30:20,630 --> 00:30:22,677
are the new slopes
and I'm generating.

610
00:30:25,420 --> 00:30:26,460
Any questions?

611
00:30:34,720 --> 00:30:36,820
So this code is very concise.

612
00:30:36,820 --> 00:30:42,400
It turns out that even, odd
tricks still works here.

613
00:30:42,400 --> 00:30:44,502
You can still keep
around just two rows,

614
00:30:44,502 --> 00:30:46,210
because you're guaranteed
that you're not

615
00:30:46,210 --> 00:30:51,400
going to overwrite any
value until all the things

616
00:30:51,400 --> 00:30:54,640
that depend on it are computed.

617
00:30:54,640 --> 00:30:57,220
But when you're
using just two rows,

618
00:30:57,220 --> 00:30:59,560
the values in a particular
row might not all

619
00:30:59,560 --> 00:31:01,100
correspond to the
same time step,

620
00:31:01,100 --> 00:31:03,610
because we're not finishing
an entire row before we

621
00:31:03,610 --> 00:31:04,870
move on to the next one.

622
00:31:04,870 --> 00:31:07,420
We're actually partially
computing rows.

623
00:31:10,590 --> 00:31:11,090
OK.

624
00:31:11,090 --> 00:31:15,152
So let's analyze the cash
complexity of this algorithm.

625
00:31:17,750 --> 00:31:19,970
Again, we're going to use
the recursion tree approach

626
00:31:19,970 --> 00:31:22,930
that we talked about last time.

627
00:31:22,930 --> 00:31:28,010
And our code is
going to split itself

628
00:31:28,010 --> 00:31:32,990
into two cell problems at every
level until it gets to a leaf.

629
00:31:32,990 --> 00:31:37,280
And each leaf represents
theta of hw points

630
00:31:37,280 --> 00:31:39,620
where h is theta of w.

631
00:31:39,620 --> 00:31:41,960
So h is the height.
w is the width.

632
00:31:41,960 --> 00:31:47,328
And we have the property that
h is equal to theta w because

633
00:31:47,328 --> 00:31:48,620
of the nature of the algorithm.

634
00:31:48,620 --> 00:31:50,610
When the trapezoid
becomes too wide,

635
00:31:50,610 --> 00:31:54,920
we're going to make it less
wide by doing a space cut.

636
00:31:54,920 --> 00:31:56,300
And if it becomes
too tall, we're

637
00:31:56,300 --> 00:31:58,280
going to do a horizontal cut.

638
00:31:58,280 --> 00:32:00,410
So we're guaranteed that
the height and the width

639
00:32:00,410 --> 00:32:02,660
are going to be within a
constant factor of each other

640
00:32:02,660 --> 00:32:06,140
when we get to the base case.

641
00:32:06,140 --> 00:32:08,540
And each leaf in
the base case is

642
00:32:08,540 --> 00:32:12,230
going to incur theta
of w over B misses

643
00:32:12,230 --> 00:32:16,460
because we have to load in
the base of the trapezoid

644
00:32:16,460 --> 00:32:17,900
from main memory.

645
00:32:17,900 --> 00:32:19,923
And once that's in
cache, we can compute

646
00:32:19,923 --> 00:32:21,590
all of the other
points in the trapezoid

647
00:32:21,590 --> 00:32:23,400
without incurring any
more cache misses.

648
00:32:23,400 --> 00:32:28,040
So the cache misses per
leaf is just theta w over B.

649
00:32:28,040 --> 00:32:33,660
And we're going to set w equal
to theta of M in the analysis,

650
00:32:33,660 --> 00:32:37,610
because that's the point when
the trapezoid fits into cache.

651
00:32:37,610 --> 00:32:41,420
The algorithm doesn't
actually have any knowledge

652
00:32:41,420 --> 00:32:42,530
of this M parameter.

653
00:32:42,530 --> 00:32:44,600
So it's still going
to keep divide

654
00:32:44,600 --> 00:32:47,870
and conquering until it gets
the base case of size 1.

655
00:32:47,870 --> 00:32:50,840
But just for the analysis,
we're going to use a base

656
00:32:50,840 --> 00:32:56,950
case when w is theta of M.

657
00:32:56,950 --> 00:33:00,340
So the number of leaves
we have is theta of NT

658
00:33:00,340 --> 00:33:04,600
divided by w because each
leaf is a size theta hw.

659
00:33:08,500 --> 00:33:10,450
The number of
internal notes we have

660
00:33:10,450 --> 00:33:12,070
is equal to a number
of leaves minus

661
00:33:12,070 --> 00:33:14,500
because we have a tree here.

662
00:33:14,500 --> 00:33:17,470
But the internal notes don't
contribute substantially

663
00:33:17,470 --> 00:33:20,050
to the cache complexity,
because each of them

664
00:33:20,050 --> 00:33:22,420
is just doing a
constant number of cache

665
00:33:22,420 --> 00:33:24,280
misses to set up the
two recursive calls.

666
00:33:24,280 --> 00:33:26,980
And it's not doing anything
more expensive than that.

667
00:33:29,623 --> 00:33:31,540
So we just need to compute
the number of cache

668
00:33:31,540 --> 00:33:33,700
misses at the leaves.

669
00:33:33,700 --> 00:33:37,660
We have theta of NT over hw
leaves, each one of which

670
00:33:37,660 --> 00:33:40,928
takes theta of w
over B cache misses.

671
00:33:40,928 --> 00:33:42,470
And we're going to
multiply that out.

672
00:33:42,470 --> 00:33:44,330
So the w term cancels out.

673
00:33:44,330 --> 00:33:51,790
We just have NT over hB and we
set h and w to be theta of M.

674
00:33:51,790 --> 00:33:56,620
So we're just left with NT over
MB as our cache complexity.

675
00:33:56,620 --> 00:33:57,866
Yes?

676
00:33:57,866 --> 00:34:02,576
AUDIENCE: Can you explain
why hB [INAUDIBLE]??

677
00:34:02,576 --> 00:34:03,570
JULIAN SHUN: Sure.

678
00:34:03,570 --> 00:34:06,870
So each leaf only incurs
theta w over B cache

679
00:34:06,870 --> 00:34:10,920
misses because we need
to compute the values

680
00:34:10,920 --> 00:34:12,840
of the base of the trapezoid.

681
00:34:12,840 --> 00:34:15,239
And that's going to
incur theta of w over B

682
00:34:15,239 --> 00:34:17,580
cache misses
because it's w wide,

683
00:34:17,580 --> 00:34:20,292
and therefore, everything else
is going to fit into cache.

684
00:34:20,292 --> 00:34:21,750
So when we compute
them, we already

685
00:34:21,750 --> 00:34:26,340
have all of our previous
values that we want in cache.

686
00:34:26,340 --> 00:34:29,228
So that's why it's not going
to incur any more cache misses.

687
00:34:29,228 --> 00:34:30,270
So does that makes sense?

688
00:34:30,270 --> 00:34:32,360
AUDIENCE: Yeah, I forgot
that it was [INAUDIBLE]..

689
00:34:32,360 --> 00:34:33,110
JULIAN SHUN: Yeah.

690
00:34:39,310 --> 00:34:39,810
OK.

691
00:34:39,810 --> 00:34:45,310
So this was just analysis
for one dimension.

692
00:34:45,310 --> 00:34:48,670
But it actually generalizes
to more than one dimension.

693
00:34:48,670 --> 00:34:50,940
So in general, if we
have d dimensions,

694
00:34:50,940 --> 00:34:53,310
then the cache complexity
is going to be theta of NT

695
00:34:53,310 --> 00:34:57,060
divided by M to the
1 over d times B.

696
00:34:57,060 --> 00:35:00,720
So if d is 1, then that
just reduces to NT over MB.

697
00:35:00,720 --> 00:35:05,250
If d is 2, then it's going to
be NT over B times square root

698
00:35:05,250 --> 00:35:06,720
of M and so on.

699
00:35:06,720 --> 00:35:08,790
And it turns out that this
bound is also optimal.

700
00:35:13,170 --> 00:35:14,730
So any questions?

701
00:35:14,730 --> 00:35:16,770
So compared to the
looping code, this code

702
00:35:16,770 --> 00:35:19,170
actually has much
better cache complexity.

703
00:35:19,170 --> 00:35:22,660
It's saving by a factor
of M. The looping code

704
00:35:22,660 --> 00:35:25,770
had a cache complexity that
was theta of NT over B.

705
00:35:25,770 --> 00:35:30,170
It didn't have an M
in the denominator.

706
00:35:30,170 --> 00:35:32,530
OK.

707
00:35:32,530 --> 00:35:34,870
So we're actually going
to do a simulation now.

708
00:35:34,870 --> 00:35:36,940
We're going to compare
how the looping

709
00:35:36,940 --> 00:35:40,200
code in a trapezoid code runs.

710
00:35:40,200 --> 00:35:44,080
And in this simulation,
the green points

711
00:35:44,080 --> 00:35:48,040
correspond to a cache hit,
the purple points correspond

712
00:35:48,040 --> 00:35:49,900
to a cache miss.

713
00:35:49,900 --> 00:35:53,170
And we're going assume a fully
associative cache using the LRU

714
00:35:53,170 --> 00:35:57,850
replacement policy where the
cache line size is 4 points

715
00:35:57,850 --> 00:36:01,720
and cache size is 32 points.

716
00:36:01,720 --> 00:36:05,470
And we're going to set the cache
hit latency to be one cycle,

717
00:36:05,470 --> 00:36:08,230
and the cache miss
latency to be 10 cycles.

718
00:36:08,230 --> 00:36:11,220
So an order of magnitude
slower for cache misses.

719
00:36:11,220 --> 00:36:13,600
And we're doing this
for a rectangular region

720
00:36:13,600 --> 00:36:15,790
where N is 95.

721
00:36:15,790 --> 00:36:18,130
And we're doing it
for 87 time steps.

722
00:36:18,130 --> 00:36:22,340
So when we pull out
the simulation now.

723
00:36:22,340 --> 00:36:22,840
OK.

724
00:36:22,840 --> 00:36:25,120
So on the left hand side, we're
going to have the looping code.

725
00:36:25,120 --> 00:36:26,578
On the right hand
side, we're going

726
00:36:26,578 --> 00:36:27,880
to have the trapezoidal code.

727
00:36:27,880 --> 00:36:30,640
So let's start this.

728
00:36:30,640 --> 00:36:32,980
So you can see that
the looping code

729
00:36:32,980 --> 00:36:36,130
is going over one row at a time
whereas the trapezoidal code is

730
00:36:36,130 --> 00:36:36,850
not doing that.

731
00:36:36,850 --> 00:36:39,700
It's partially computing
one row and then moving

732
00:36:39,700 --> 00:36:40,660
on to the next row.

733
00:36:44,280 --> 00:36:47,950
I can also show the
cuts that are generated

734
00:36:47,950 --> 00:36:51,670
by the trapezoidal algorithm.

735
00:36:51,670 --> 00:36:54,520
I have to remember
how to do this.

736
00:36:54,520 --> 00:36:56,690
So C--

737
00:36:59,620 --> 00:37:02,710
So there there's the
cuts that are generated

738
00:37:02,710 --> 00:37:05,290
by the trapezoidal algorithm.

739
00:37:05,290 --> 00:37:06,610
And I can speed this up.

740
00:37:11,955 --> 00:37:13,830
So you can see that the
trapezoidal algorithm

741
00:37:13,830 --> 00:37:17,470
is incurring much fewer cache
misses than the looping code.

742
00:37:23,650 --> 00:37:24,900
So I just make this go faster.

743
00:37:24,900 --> 00:37:28,770
And the trapezoid code
is going to finish,

744
00:37:28,770 --> 00:37:30,990
while the looping code
is still slowly making

745
00:37:30,990 --> 00:37:32,010
its way up the top.

746
00:37:41,470 --> 00:37:41,970
OK.

747
00:37:41,970 --> 00:37:43,808
So it's finally done.

748
00:37:43,808 --> 00:37:44,850
So any questions on this?

749
00:37:44,850 --> 00:37:45,616
Yeah?

750
00:37:45,616 --> 00:37:49,704
AUDIENCE: Why doesn't the
trapezoid [INAUDIBLE]??

751
00:37:52,656 --> 00:37:55,400
JULIAN SHUN: In
which of the regions?

752
00:37:55,400 --> 00:37:59,030
So it's loading-- you get a
cache miss for the purple dots

753
00:37:59,030 --> 00:37:59,530
here.

754
00:38:02,110 --> 00:38:04,750
And then the cache
line size is 4.

755
00:38:04,750 --> 00:38:06,790
So you get a cache miss
for the first point,

756
00:38:06,790 --> 00:38:09,040
and then you hit on
the next three points.

757
00:38:09,040 --> 00:38:11,890
Then you get another cache
miss for the fifth point.

758
00:38:11,890 --> 00:38:14,882
And then you hit on
the 6, 7, and 8 points.

759
00:38:14,882 --> 00:38:17,714
AUDIENCE: I was indicating
the one above it.

760
00:38:17,714 --> 00:38:20,770
JULIAN SHUN: So for
the one above it--

761
00:38:20,770 --> 00:38:24,010
so we're assuming that the two
arrays fitting cache already.

762
00:38:24,010 --> 00:38:26,980
So we don't actually need
to load them from memory.

763
00:38:26,980 --> 00:38:29,830
The thing above it just
depends on the values

764
00:38:29,830 --> 00:38:31,360
that we have already computed.

765
00:38:31,360 --> 00:38:34,690
And that fits in cache.

766
00:38:34,690 --> 00:38:37,210
Those are ready in cache.

767
00:38:37,210 --> 00:38:40,100
This base of the trapezoid
is already in cache.

768
00:38:40,100 --> 00:38:42,520
And the row right
above it, we just

769
00:38:42,520 --> 00:38:45,830
need to look at those values.

770
00:38:45,830 --> 00:38:47,166
Does that makes sense?

771
00:38:47,166 --> 00:38:48,120
AUDIENCE: OK.

772
00:38:48,120 --> 00:38:49,676
Because of the even odd?

773
00:38:49,676 --> 00:38:51,180
JULIAN SHUN: Yeah.

774
00:38:51,180 --> 00:38:51,680
Yeah.

775
00:38:55,160 --> 00:38:56,450
Any other questions on this?

776
00:38:58,970 --> 00:39:01,980
Does anyone want
to see this again?

777
00:39:01,980 --> 00:39:02,480
Yeah?

778
00:39:20,640 --> 00:39:22,950
So I could let this run for
the rest of the lecture,

779
00:39:22,950 --> 00:39:25,207
but I have more
interesting material

780
00:39:25,207 --> 00:39:26,290
that I want to talk about.

781
00:39:26,290 --> 00:39:28,185
So let's just stop
after this finishes.

782
00:39:33,247 --> 00:39:34,830
And as you see again,
the looping code

783
00:39:34,830 --> 00:39:37,140
is slowly making
its way to the top.

784
00:39:37,140 --> 00:39:39,960
It's much slower than
the trapezoid code.

785
00:39:49,330 --> 00:39:49,830
OK.

786
00:39:49,830 --> 00:39:52,780
So that was only
for one dimensions.

787
00:39:52,780 --> 00:39:55,680
Now let's look at what
happens in two dimensions.

788
00:39:55,680 --> 00:39:57,870
And here, I'm going
to show another demo.

789
00:39:57,870 --> 00:40:01,290
And this demo, I'm actually
going to run the code for real.

790
00:40:01,290 --> 00:40:03,310
The previous demo was
just a simulation.

791
00:40:03,310 --> 00:40:05,380
So this is going to
happen in real time.

792
00:40:05,380 --> 00:40:08,670
And I'm going to simulate
the heat in a 2D space.

793
00:40:08,670 --> 00:40:11,670
And recall that the colors
correspond to the temperature.

794
00:40:11,670 --> 00:40:14,790
So a brighter color
means it's hotter.

795
00:40:14,790 --> 00:40:16,860
So let me pull out
the other demo.

796
00:40:25,770 --> 00:40:26,270
OK.

797
00:40:26,270 --> 00:40:32,140
So here, my mouse cursor
is the source of heat.

798
00:40:32,140 --> 00:40:37,270
So you see that it's making the
points around my mouse cursor

799
00:40:37,270 --> 00:40:37,900
hot.

800
00:40:37,900 --> 00:40:42,890
And then it's slowly diffusing
its way to the other points.

801
00:40:42,890 --> 00:40:46,450
Now I can actually move this
so that my heat source changes,

802
00:40:46,450 --> 00:40:49,210
and then the heat I generated
earlier slowly goes away.

803
00:40:52,110 --> 00:40:52,610
OK.

804
00:40:57,430 --> 00:41:00,470
So in the lower
left hand corner,

805
00:41:00,470 --> 00:41:02,200
I'm showing the
number of iterations

806
00:41:02,200 --> 00:41:04,450
per second of the code.

807
00:41:04,450 --> 00:41:08,320
And we can see that the
looping code is taking--

808
00:41:08,320 --> 00:41:13,550
it's doing about 1,560
iterations per second.

809
00:41:13,550 --> 00:41:16,240
Let's see what happens when we
switch to the trapezoid code.

810
00:41:23,330 --> 00:41:29,420
So the trapezoid code is
doing about 1,830 iterations

811
00:41:29,420 --> 00:41:31,440
per second.

812
00:41:31,440 --> 00:41:35,420
So it's a little bit
faster, but not by too much.

813
00:41:40,580 --> 00:41:45,210
Does anyone have an idea why
we're seeing this behavior?

814
00:41:45,210 --> 00:41:46,700
So we said that
the trapezoid code

815
00:41:46,700 --> 00:41:50,600
incurs many fewer cache
misses than the looping code,

816
00:41:50,600 --> 00:41:53,240
so we would expect it to
be significantly faster.

817
00:41:53,240 --> 00:41:57,000
But here it's only
a little bit faster.

818
00:41:57,000 --> 00:41:57,500
Yeah?

819
00:41:57,500 --> 00:42:01,328
AUDIENCE: [INAUDIBLE].

820
00:42:01,328 --> 00:42:02,120
JULIAN SHUN: Right.

821
00:42:02,120 --> 00:42:03,870
So that's a good point.

822
00:42:03,870 --> 00:42:06,080
So in 2D you're
only saving a factor

823
00:42:06,080 --> 00:42:09,620
of square root of M instead
of M. But square root of M

824
00:42:09,620 --> 00:42:12,350
is still pretty big
compared to the speed

825
00:42:12,350 --> 00:42:13,470
up we're getting here.

826
00:42:13,470 --> 00:42:15,180
So any other ideas?

827
00:42:15,180 --> 00:42:15,680
Yeah.

828
00:42:15,680 --> 00:42:18,750
So there is a constant factor
in the trapezoidal code.

829
00:42:18,750 --> 00:42:22,160
But even after accounting
for the constant factor,

830
00:42:22,160 --> 00:42:27,260
you should still see a
better speed up than this.

831
00:42:27,260 --> 00:42:30,710
So even accounting for
the constant factors,

832
00:42:30,710 --> 00:42:32,810
what other problem
might be going on here?

833
00:42:32,810 --> 00:42:33,460
Yeah?

834
00:42:33,460 --> 00:42:36,050
AUDIENCE: Is it that we don't
actually have an ideal cache?

835
00:42:36,050 --> 00:42:36,800
JULIAN SHUN: Yeah.

836
00:42:36,800 --> 00:42:41,180
So that's another
good observation.

837
00:42:41,180 --> 00:42:44,630
But the caches that
we're using, they still

838
00:42:44,630 --> 00:42:47,060
should get pretty good cache.

839
00:42:47,060 --> 00:42:50,150
I mean, they should still
have cache complexly

840
00:42:50,150 --> 00:42:52,160
that's pretty close to
the ideal cache model.

841
00:42:52,160 --> 00:42:55,490
I mean, you might be off
by small constant factor,

842
00:42:55,490 --> 00:42:56,490
but not by too much.

843
00:42:56,490 --> 00:42:57,229
Yeah?

844
00:42:57,229 --> 00:43:00,981
AUDIENCE: Maybe
because [INAUDIBLE]

845
00:43:01,808 --> 00:43:02,600
JULIAN SHUN: Sorry.

846
00:43:02,600 --> 00:43:04,260
Could you repeat that?

847
00:43:04,260 --> 00:43:07,640
AUDIENCE: There are [INAUDIBLE]
this time like [INAUDIBLE]

848
00:43:07,640 --> 00:43:08,390
JULIAN SHUN: Yeah.

849
00:43:08,390 --> 00:43:09,440
So OK.

850
00:43:09,440 --> 00:43:11,180
So if I move the
cursor, it's probably

851
00:43:11,180 --> 00:43:15,440
going to be a little bit slower,
go slower by a little bit.

852
00:43:15,440 --> 00:43:18,150
But that doesn't really
affect the performance.

853
00:43:18,150 --> 00:43:20,540
I can just leave my cursor
there and this is what

854
00:43:20,540 --> 00:43:22,790
the iterations per second is.

855
00:43:27,170 --> 00:43:27,948
Yes?

856
00:43:27,948 --> 00:43:30,240
AUDIENCE: Maybe there's like,
a lot of similar programs

857
00:43:30,240 --> 00:43:32,221
doing this [INAUDIBLE].

858
00:43:37,191 --> 00:43:37,970
JULIAN SHUN: Yeah.

859
00:43:37,970 --> 00:43:40,040
So there is some other
factor dominating.

860
00:43:40,040 --> 00:43:43,250
Does anyone have an idea
what that factor might be?

861
00:43:43,250 --> 00:43:44,666
AUDIENCE: Rendering?

862
00:43:44,666 --> 00:43:46,040
JULIAN SHUN: No.

863
00:43:46,040 --> 00:43:47,390
It's not the rendering.

864
00:43:47,390 --> 00:43:50,000
I ran the code without
showing the graphics,

865
00:43:50,000 --> 00:43:51,940
and performance was similar.

866
00:43:51,940 --> 00:43:52,947
Yes?

867
00:43:52,947 --> 00:43:56,426
AUDIENCE: Maybe similar to what
she said there could be other

868
00:43:56,426 --> 00:44:01,396
things using cache [INAUDIBLE].

869
00:44:05,869 --> 00:44:07,300
JULIAN SHUN: Yes.

870
00:44:07,300 --> 00:44:07,800
Yeah.

871
00:44:07,800 --> 00:44:10,250
So there could be other
things using the cache.

872
00:44:10,250 --> 00:44:13,430
But that would be true
for both of the programs.

873
00:44:13,430 --> 00:44:15,890
And I don't actually have
anything that's intensive

874
00:44:15,890 --> 00:44:18,050
running, except for PowerPoint.

875
00:44:18,050 --> 00:44:19,790
I don't think that
uses much of my cache.

876
00:44:22,710 --> 00:44:23,210
All right.

877
00:44:23,210 --> 00:44:29,750
So let's look at why
this is the case.

878
00:44:29,750 --> 00:44:33,910
So it turns out that the
hardware is actually helping

879
00:44:33,910 --> 00:44:34,952
the looping code.

880
00:44:34,952 --> 00:44:36,910
So the question is how
come the cache oblivious

881
00:44:36,910 --> 00:44:40,448
trapezoidal code can have
so many fewer cache misses,

882
00:44:40,448 --> 00:44:42,490
but the advantage gained
over the looping version

883
00:44:42,490 --> 00:44:44,650
is so marginal?

884
00:44:44,650 --> 00:44:46,660
Turns out that for
the looping code,

885
00:44:46,660 --> 00:44:48,280
the hardware is
actually helping it

886
00:44:48,280 --> 00:44:50,920
by doing hardware prefetching.

887
00:44:50,920 --> 00:44:53,080
And hardware prefetching
for the looping code

888
00:44:53,080 --> 00:44:55,270
is actually pretty good,
because the access pattern

889
00:44:55,270 --> 00:44:56,230
is very regular.

890
00:44:56,230 --> 00:44:58,250
It's just going
one row at a time.

891
00:44:58,250 --> 00:45:01,720
So the hardware can predict
the memory access pattern

892
00:45:01,720 --> 00:45:03,670
of the looping code,
and therefore, he

893
00:45:03,670 --> 00:45:06,520
can bring in the cache lines
that the looping code would

894
00:45:06,520 --> 00:45:11,240
need before it actually gets to
that part of the computation.

895
00:45:11,240 --> 00:45:14,050
So prefetching is
helping the looping code.

896
00:45:14,050 --> 00:45:16,090
And it's not helping
the trapezoid code

897
00:45:16,090 --> 00:45:20,620
that much because the access
pattern is less regular there.

898
00:45:20,620 --> 00:45:23,230
And prefetching does
use memory bandwidth.

899
00:45:23,230 --> 00:45:25,270
But when you're using
just a single core,

900
00:45:25,270 --> 00:45:27,550
you have more than
enough bandwidth

901
00:45:27,550 --> 00:45:30,520
to take advantage of
the hardware prefetching

902
00:45:30,520 --> 00:45:34,000
capabilities of the machine.

903
00:45:34,000 --> 00:45:37,480
But later on, we'll see that
when we're running in parallel

904
00:45:37,480 --> 00:45:39,430
the memory bandwidth
does become more

905
00:45:39,430 --> 00:45:43,630
of an issue when you
have multiple processors

906
00:45:43,630 --> 00:45:46,970
all using the memory.

907
00:45:46,970 --> 00:45:47,470
Yeah?

908
00:45:47,470 --> 00:45:48,752
Question?

909
00:45:48,752 --> 00:45:53,984
AUDIENCE: Is there a way of
touching a cache [INAUDIBLE]

910
00:45:53,984 --> 00:45:57,386
or touching a piece of memory
before you need it so that you

911
00:45:57,386 --> 00:46:00,302
don't need [INAUDIBLE]

912
00:46:00,302 --> 00:46:02,280
JULIAN SHUN: You can do
software prefetching.

913
00:46:02,280 --> 00:46:06,390
There are instructions to
do software prefetching.

914
00:46:06,390 --> 00:46:09,010
Hardware prefetching is
usually more efficient,

915
00:46:09,010 --> 00:46:11,550
but it's like less flexible
than the software prefetching.

916
00:46:11,550 --> 00:46:13,175
But here we're not
actually doing that.

917
00:46:13,175 --> 00:46:15,360
We're just taking advantage
of hardware prefetching.

918
00:46:15,360 --> 00:46:17,988
AUDIENCE: [INAUDIBLE]?

919
00:46:17,988 --> 00:46:18,750
JULIAN SHUN: Yeah.

920
00:46:18,750 --> 00:46:20,940
So we didn't actually try that.

921
00:46:20,940 --> 00:46:23,670
It could benefit a little
bit if we used a little bit

922
00:46:23,670 --> 00:46:26,825
of software prefetching.

923
00:46:26,825 --> 00:46:28,950
Although, I think it would
benefit the looping code

924
00:46:28,950 --> 00:46:30,840
probably as well if we did that.

925
00:46:30,840 --> 00:46:31,340
Yes?

926
00:46:31,340 --> 00:46:33,280
AUDIENCE: Is hardware
prefetching [INAUDIBLE]??

927
00:46:33,280 --> 00:46:33,590
JULIAN SHUN: Sorry?

928
00:46:33,590 --> 00:46:34,090
Sorry?

929
00:46:34,090 --> 00:46:36,644
AUDIENCE: Is hardware
prefetching always enabled?

930
00:46:36,644 --> 00:46:37,500
JULIAN SHUN: Yeah.

931
00:46:37,500 --> 00:46:40,350
So hardware
prefetching is enabled.

932
00:46:40,350 --> 00:46:42,602
It's always done by the
hardware on the machines

933
00:46:42,602 --> 00:46:43,560
that we're using today.

934
00:46:47,880 --> 00:46:49,920
This was a pretty
surprising result.

935
00:46:49,920 --> 00:46:53,550
But we'll actually see the
fact of the memory bandwidth

936
00:46:53,550 --> 00:46:58,432
later on when we look
at the parallel code.

937
00:46:58,432 --> 00:47:00,015
Any other questions
before I continue?

938
00:47:02,530 --> 00:47:03,030
OK.

939
00:47:03,030 --> 00:47:05,520
So let's now look at the
interplay between caching

940
00:47:05,520 --> 00:47:07,030
and parallelism.

941
00:47:07,030 --> 00:47:09,970
So this was a theorem that we
proved in the previous lecture.

942
00:47:09,970 --> 00:47:11,625
So let's recall what it says.

943
00:47:11,625 --> 00:47:14,220
It says let Q sub p
be the number of cache

944
00:47:14,220 --> 00:47:17,460
misses in a deterministic
Cilk computation

945
00:47:17,460 --> 00:47:20,340
when run on p processors,
each with a private cache,

946
00:47:20,340 --> 00:47:22,920
and let s sub p be the
number of successful steals

947
00:47:22,920 --> 00:47:24,630
during the computation.

948
00:47:24,630 --> 00:47:28,140
In an ideal cache model
with a cache size of M

949
00:47:28,140 --> 00:47:30,390
and a block size of
B, the number of cache

950
00:47:30,390 --> 00:47:32,820
misses on p processors
equal to the number of cache

951
00:47:32,820 --> 00:47:36,450
misses on one
processor plus order

952
00:47:36,450 --> 00:47:39,810
number of successful
steals times M over B.

953
00:47:39,810 --> 00:47:44,730
And last time we also said that
the number of successful steals

954
00:47:44,730 --> 00:47:47,400
is upper bounded by the
span of the computation

955
00:47:47,400 --> 00:47:48,730
times the number of processors.

956
00:47:48,730 --> 00:47:51,120
If you minimize the span
of your computation,

957
00:47:51,120 --> 00:47:54,570
then you can also minimize the
number of successful steals.

958
00:47:54,570 --> 00:47:56,700
And then for low
span algorithms,

959
00:47:56,700 --> 00:48:00,817
the first term is usually
going to dominate the Q1 term.

960
00:48:00,817 --> 00:48:02,400
So I'm not going to
go over the proof.

961
00:48:02,400 --> 00:48:04,110
We did that last time.

962
00:48:04,110 --> 00:48:06,120
The moral of the story
is that minimizing

963
00:48:06,120 --> 00:48:07,860
cache misses in
the serial elision

964
00:48:07,860 --> 00:48:10,590
essentially minimizes them
into parallel execution,

965
00:48:10,590 --> 00:48:12,917
assuming that you have
a low span algorithm.

966
00:48:15,720 --> 00:48:18,870
So let's see whether our
trapezoidal algorithm works

967
00:48:18,870 --> 00:48:21,090
in parallel.

968
00:48:21,090 --> 00:48:23,670
So does the space
cut work in parallel?

969
00:48:23,670 --> 00:48:26,583
Recall that the space
cut, I'm cutting it

970
00:48:26,583 --> 00:48:28,500
with a slope of negative
1 through the center,

971
00:48:28,500 --> 00:48:30,730
because it's too wide.

972
00:48:30,730 --> 00:48:34,980
So can I execute the two
trapezoids in parallel here?

973
00:48:38,590 --> 00:48:39,910
No.

974
00:48:39,910 --> 00:48:41,650
The reason is that
the right trapezoid

975
00:48:41,650 --> 00:48:44,450
depends on the result
of the left trapezoid.

976
00:48:44,450 --> 00:48:47,080
So I can't execute
them at the same time.

977
00:48:47,080 --> 00:48:50,890
But there is a way that I can
execute trapezoids in parallel.

978
00:48:50,890 --> 00:48:54,760
So instead of just doing
the cut through the center,

979
00:48:54,760 --> 00:48:58,670
I'm actually going
to do a V-cut.

980
00:48:58,670 --> 00:49:01,270
So now I have three trapezoids.

981
00:49:01,270 --> 00:49:02,950
The two trapezoid in black--

982
00:49:02,950 --> 00:49:05,260
I can actually execute
those in parallel,

983
00:49:05,260 --> 00:49:06,580
because they're independent.

984
00:49:06,580 --> 00:49:08,770
And everything in
those two trapezoids

985
00:49:08,770 --> 00:49:11,440
just depends on the
base of that trapezoid.

986
00:49:11,440 --> 00:49:14,650
And after I'm done with the
two trapezoids labeled 1,

987
00:49:14,650 --> 00:49:19,240
then I can compute
the trapezoid label 2.

988
00:49:19,240 --> 00:49:22,330
And this is known as
a parallel space cut.

989
00:49:22,330 --> 00:49:26,320
It produces two black trapezoids
as well as a gray trapezoid

990
00:49:26,320 --> 00:49:29,800
and two black trapezoids
executed in parallel.

991
00:49:29,800 --> 00:49:33,460
And afterwards, the
gray trapezoid executes.

992
00:49:33,460 --> 00:49:35,140
And this is done
recursively as well.

993
00:49:39,550 --> 00:49:41,460
Any questions?

994
00:49:41,460 --> 00:49:42,130
Yeah?

995
00:49:42,130 --> 00:49:42,890
No.

996
00:49:42,890 --> 00:49:45,310
OK.

997
00:49:45,310 --> 00:49:47,650
We also have the time cut.

998
00:49:47,650 --> 00:49:48,280
Oh, sorry.

999
00:49:48,280 --> 00:49:52,060
So if the trapezoid is
inverted, then we're

1000
00:49:52,060 --> 00:49:56,470
going to do this
upside down V-cut.

1001
00:49:56,470 --> 00:50:00,370
And in this case, we're going
to execute the middle trapezoid

1002
00:50:00,370 --> 00:50:04,090
before we execute the two
trapezoids on the side.

1003
00:50:07,590 --> 00:50:10,670
For the time cut, it turns
out that we're just going

1004
00:50:10,670 --> 00:50:13,300
to use the same cut as before.

1005
00:50:13,300 --> 00:50:16,185
And we're just going to execute
the two trapezoids serially.

1006
00:50:19,000 --> 00:50:20,890
So we do get a little
bit of parallelism

1007
00:50:20,890 --> 00:50:24,850
here from the
parallel space cut.

1008
00:50:24,850 --> 00:50:27,530
Let's look at how the
parallel codes perform now.

1009
00:50:46,970 --> 00:50:50,450
So, OK.

1010
00:50:50,450 --> 00:50:54,530
So this was a
serial looping code.

1011
00:50:54,530 --> 00:50:56,660
Here's the parallel
looping code.

1012
00:50:56,660 --> 00:51:00,971
So we had 1,450 before.

1013
00:51:00,971 --> 00:51:03,320
About 3,700 now.

1014
00:51:03,320 --> 00:51:06,382
So little over
twice the speed up.

1015
00:51:06,382 --> 00:51:07,840
And this is on a
four core machine.

1016
00:51:07,840 --> 00:51:09,220
It's just on my laptop.

1017
00:51:09,220 --> 00:51:11,398
AUDIENCE: [INAUDIBLE]?

1018
00:51:11,398 --> 00:51:12,190
JULIAN SHUN: Sorry?

1019
00:51:12,190 --> 00:51:13,630
AUDIENCE: [INAUDIBLE]?

1020
00:51:13,630 --> 00:51:14,860
JULIAN SHUN: Oh yeah, sure.

1021
00:51:20,360 --> 00:51:23,330
Yeah, it's slowing down a
little bit, but not by too much.

1022
00:51:32,560 --> 00:51:34,810
OK.

1023
00:51:34,810 --> 00:51:38,590
Let's look at the
trapezoidal code now.

1024
00:51:38,590 --> 00:51:41,500
So as we saw before,
the trapezoid code

1025
00:51:41,500 --> 00:51:45,430
does about 1,840
iterations per second.

1026
00:51:45,430 --> 00:51:47,410
And we can paralyze this.

1027
00:51:47,410 --> 00:51:54,260
So now it's doing about
5,350 iterations per second.

1028
00:51:54,260 --> 00:51:56,890
So it's getting about a
factor of three speed up.

1029
00:52:01,650 --> 00:52:05,760
I can move it around a little
bit more if you want to see it.

1030
00:52:05,760 --> 00:52:10,108
So serial trapezoid
and parallel trapezoid.

1031
00:52:12,920 --> 00:52:13,926
Is everyone happy?

1032
00:52:21,190 --> 00:52:21,690
OK.

1033
00:52:26,357 --> 00:52:27,940
Because I had to do
this in real time,

1034
00:52:27,940 --> 00:52:30,670
the input size wasn't
actually that big.

1035
00:52:30,670 --> 00:52:34,810
So I ran the experiment
offline without the rendering

1036
00:52:34,810 --> 00:52:36,350
on a much larger input.

1037
00:52:36,350 --> 00:52:42,120
So this input is a
3,000 by 3,000 grid.

1038
00:52:42,120 --> 00:52:45,220
And I did this for 1,000 time
steps using four processor

1039
00:52:45,220 --> 00:52:46,210
cores.

1040
00:52:46,210 --> 00:52:48,950
And my cache size
is 8 megabytes.

1041
00:52:48,950 --> 00:52:50,980
So last level cache size.

1042
00:52:50,980 --> 00:52:54,460
So the input size here is much
larger than my last level cache

1043
00:52:54,460 --> 00:52:56,883
size.

1044
00:52:56,883 --> 00:52:58,300
And here are the
times that I got.

1045
00:52:58,300 --> 00:53:03,560
So the serial looping code
took about 129 seconds.

1046
00:53:03,560 --> 00:53:07,720
And when we did it in parallel,
it was about 66 seconds.

1047
00:53:07,720 --> 00:53:09,820
So it got about a
factor of two speed up,

1048
00:53:09,820 --> 00:53:12,520
which is consistent
with what we saw.

1049
00:53:12,520 --> 00:53:16,090
For the trapezoidal code, it
actually got a better speed up

1050
00:53:16,090 --> 00:53:17,950
when we increased
the input size.

1051
00:53:17,950 --> 00:53:19,750
So we got about a
factor of four speed up.

1052
00:53:19,750 --> 00:53:22,180
And this is because
for larger input size,

1053
00:53:22,180 --> 00:53:25,780
the cache efficiency
plays a much larger role,

1054
00:53:25,780 --> 00:53:30,290
because the cache is so small
compared to our input size.

1055
00:53:30,290 --> 00:53:34,210
So here we see that the
parallel looping code achieves

1056
00:53:34,210 --> 00:53:35,980
less than half of
the potential speed

1057
00:53:35,980 --> 00:53:37,480
up, even though the
parallel looping

1058
00:53:37,480 --> 00:53:39,370
code has much more
potential parallelism

1059
00:53:39,370 --> 00:53:40,630
than the trapezoidal code.

1060
00:53:40,630 --> 00:53:43,570
So trapezoidal code only had a
little bit of parallelism only

1061
00:53:43,570 --> 00:53:49,780
for the space cuts, whereas the
trapezoidal code is actually

1062
00:53:49,780 --> 00:53:51,070
getting pretty good speed up.

1063
00:53:51,070 --> 00:53:54,800
So this is near linear speed
up, since I'm using four cores

1064
00:53:54,800 --> 00:54:00,360
and it's getting
3.96 x speed up.

1065
00:54:00,360 --> 00:54:02,070
So what could be going on here?

1066
00:54:06,680 --> 00:54:08,330
Another thing to
look at is to compare

1067
00:54:08,330 --> 00:54:11,660
the serial trapezoid code with
the serial looping code, as

1068
00:54:11,660 --> 00:54:14,280
well as the parallel trapezoid
code with the parallel looping

1069
00:54:14,280 --> 00:54:14,780
code.

1070
00:54:14,780 --> 00:54:17,360
So if you look at the
serial trapezoid code,

1071
00:54:17,360 --> 00:54:19,610
you see that it's
about twice as fast

1072
00:54:19,610 --> 00:54:21,800
as the serial looping code.

1073
00:54:21,800 --> 00:54:23,840
But the parallel
trapezoid or code

1074
00:54:23,840 --> 00:54:26,932
is about four times faster
than the parallel looping code.

1075
00:54:30,380 --> 00:54:33,690
And the reason here is
that the harbor prefetching

1076
00:54:33,690 --> 00:54:36,750
can't help the parallel
looping code that much.

1077
00:54:36,750 --> 00:54:42,000
Because when you're running
in parallel, all of the cores

1078
00:54:42,000 --> 00:54:43,800
are using memory.

1079
00:54:43,800 --> 00:54:46,530
And there's a memory
bandwidth bottleneck here.

1080
00:54:46,530 --> 00:54:50,260
And prefetching actually
needs to use memory bandwidth.

1081
00:54:50,260 --> 00:54:53,850
So in the serial case, we had
plenty of memory bandwidth

1082
00:54:53,850 --> 00:54:57,630
we could use for a prefetching,
but in the parallel case,

1083
00:54:57,630 --> 00:55:00,210
we don't actually have much
parallel-- but much memory

1084
00:55:00,210 --> 00:55:02,910
bandwidth we can use here.

1085
00:55:02,910 --> 00:55:06,210
So that's why in
a parallel case,

1086
00:55:06,210 --> 00:55:08,760
the trapezoid code
gets a better speed up

1087
00:55:08,760 --> 00:55:13,310
over the parallel looping code,
compared to the serial case.

1088
00:55:13,310 --> 00:55:17,820
And the trapezoid code
also gets better speed up

1089
00:55:17,820 --> 00:55:20,890
because it does
things more locally,

1090
00:55:20,890 --> 00:55:22,950
so it needs to use less--

1091
00:55:22,950 --> 00:55:24,490
fewer memory operations.

1092
00:55:24,490 --> 00:55:26,340
And there's a
scalability bottleneck

1093
00:55:26,340 --> 00:55:28,060
at the memory interconnect.

1094
00:55:28,060 --> 00:55:30,930
But because the trapezoidal
code is cache oblivious,

1095
00:55:30,930 --> 00:55:34,290
it does a lot of work in
cache, whereas the looping code

1096
00:55:34,290 --> 00:55:39,360
does more accesses
to the main memory.

1097
00:55:39,360 --> 00:55:40,530
Any questions on this?

1098
00:55:47,190 --> 00:55:53,980
So how do we know when we have
a parallel speed up bottleneck,

1099
00:55:53,980 --> 00:55:57,370
how do we know
what's causing it?

1100
00:55:57,370 --> 00:56:02,300
So there are several main
things that we should look at.

1101
00:56:02,300 --> 00:56:04,630
So we should see if
our algorithm has

1102
00:56:04,630 --> 00:56:07,660
insufficient parallelism,
whether the scheduling

1103
00:56:07,660 --> 00:56:09,910
overhead is
dominating, whether we

1104
00:56:09,910 --> 00:56:12,470
have a lack of memory
bandwidth, or whether there

1105
00:56:12,470 --> 00:56:13,630
is contention going on.

1106
00:56:13,630 --> 00:56:17,020
And contention can refer to
either locking or true and

1107
00:56:17,020 --> 00:56:22,170
false sharing, which we talked
about in the last lecture.

1108
00:56:22,170 --> 00:56:25,340
So the first two are usually
quite easy to diagnose.

1109
00:56:25,340 --> 00:56:28,100
You can compute the work in
the span of your algorithm,

1110
00:56:28,100 --> 00:56:30,200
and from that you can
get the parallelism.

1111
00:56:30,200 --> 00:56:34,340
You can also use Cilkscale to
help you diagnose the first two

1112
00:56:34,340 --> 00:56:37,223
problems, because Cilkscale can
tell you how much parallelism

1113
00:56:37,223 --> 00:56:38,390
you're getting in your code.

1114
00:56:38,390 --> 00:56:40,473
And it can also tell you
the burden of parallelism

1115
00:56:40,473 --> 00:56:42,120
which includes the
scheduler overhead.

1116
00:56:45,230 --> 00:56:46,880
What about for memory bandwidth?

1117
00:56:46,880 --> 00:56:50,530
How can we diagnose that?

1118
00:56:50,530 --> 00:56:52,040
So does anyone have any ideas?

1119
00:56:55,470 --> 00:56:57,310
So I can tell you
one way to do it.

1120
00:56:57,310 --> 00:57:00,720
I can open up my hardware and
take out all of my memory chips

1121
00:57:00,720 --> 00:57:03,750
except for one of them,
and run my serial code.

1122
00:57:03,750 --> 00:57:05,580
And if it slows
down, then that means

1123
00:57:05,580 --> 00:57:07,530
it was probably
memory bandwidth bound

1124
00:57:07,530 --> 00:57:09,810
when I did it in parallel.

1125
00:57:09,810 --> 00:57:12,660
But that's a pretty heavy handed
way to diagnose this problem.

1126
00:57:12,660 --> 00:57:16,210
Is there anything we can
do was just software?

1127
00:57:16,210 --> 00:57:16,710
Yes?

1128
00:57:16,710 --> 00:57:21,490
AUDIENCE: Can we simulate like
Valgrind would do it and count

1129
00:57:21,490 --> 00:57:24,358
how memory accesses [INAUDIBLE]

1130
00:57:24,358 --> 00:57:28,180
JULIAN SHUN: Yeah, so you
could use a tool like Valgrind

1131
00:57:28,180 --> 00:57:32,050
to count the number
of memory accesses.

1132
00:57:32,050 --> 00:57:33,458
Yes?

1133
00:57:33,458 --> 00:57:36,916
AUDIENCE: It's like
toolset or something

1134
00:57:36,916 --> 00:57:40,868
where you can make sure
that only one processor is

1135
00:57:40,868 --> 00:57:42,298
being use for this.

1136
00:57:42,298 --> 00:57:44,590
JULIAN SHUN: So you can make
sure only one processor is

1137
00:57:44,590 --> 00:57:46,390
being used, but you can't--

1138
00:57:49,390 --> 00:57:52,000
but it might be using
like, more memory

1139
00:57:52,000 --> 00:57:55,630
than just the memory
from one chip.

1140
00:57:55,630 --> 00:57:59,650
There's actually a
simpler way to do this.

1141
00:57:59,650 --> 00:58:02,890
The idea is that we'll
just run p identical copies

1142
00:58:02,890 --> 00:58:05,290
of the serial code.

1143
00:58:05,290 --> 00:58:07,440
And then they will all
executing in parallel.

1144
00:58:07,440 --> 00:58:10,480
And if the execution
slows down, then that

1145
00:58:10,480 --> 00:58:11,950
means they were
probably contending

1146
00:58:11,950 --> 00:58:14,675
for memory bandwidth.

1147
00:58:14,675 --> 00:58:15,550
Does that make sense?

1148
00:58:18,410 --> 00:58:20,030
One caveat of this
is you can only

1149
00:58:20,030 --> 00:58:22,843
do this if you have enough
physical memory, because when

1150
00:58:22,843 --> 00:58:24,260
you're running p
identical copies,

1151
00:58:24,260 --> 00:58:27,868
you have to use more DRAM
than if you just ran one copy.

1152
00:58:27,868 --> 00:58:29,660
So you have to have
enough physical memory.

1153
00:58:29,660 --> 00:58:32,270
But oftentimes, you can usually
isolate some part of the code

1154
00:58:32,270 --> 00:58:34,103
that you think has a
performance bottleneck,

1155
00:58:34,103 --> 00:58:37,190
and just execute that
part of the program

1156
00:58:37,190 --> 00:58:38,510
with p copies in parallel.

1157
00:58:38,510 --> 00:58:41,060
And hopefully that
will take less memory.

1158
00:58:41,060 --> 00:58:42,620
There are also
hardware counters you

1159
00:58:42,620 --> 00:58:45,680
can check if you have root
access to your machine

1160
00:58:45,680 --> 00:58:47,720
that can measure how
much memory bandwidth

1161
00:58:47,720 --> 00:58:51,170
your program is using.

1162
00:58:51,170 --> 00:58:55,470
But this is a pretty
simple way to do this.

1163
00:58:55,470 --> 00:58:59,390
So there are ways to diagnose
lack of memory bandwidth.

1164
00:58:59,390 --> 00:59:03,560
Turns out that contention
is much harder to diagnose.

1165
00:59:03,560 --> 00:59:07,130
There are tools that exist
that detect lock contention

1166
00:59:07,130 --> 00:59:10,250
in an execution, but usually
they only detect a contention

1167
00:59:10,250 --> 00:59:12,500
when the contention
actually happens,

1168
00:59:12,500 --> 00:59:14,360
but the contention
doesn't have to happen

1169
00:59:14,360 --> 00:59:16,860
every time you run your code.

1170
00:59:16,860 --> 00:59:22,790
So these tools don't detect a
potential for lock contention.

1171
00:59:22,790 --> 00:59:24,470
And potential for
true and false sharing

1172
00:59:24,470 --> 00:59:27,200
is even harder to detect,
especially false sharing,

1173
00:59:27,200 --> 00:59:30,440
because if you're using a bunch
of variables in your code,

1174
00:59:30,440 --> 00:59:34,260
you don't know which of those
map to the same cache line.

1175
00:59:34,260 --> 00:59:36,650
So this is much
harder to detect.

1176
00:59:36,650 --> 00:59:40,310
Usually when you're trying to
debug the speed up bottleneck

1177
00:59:40,310 --> 00:59:42,820
in your code, you would first
look at the first three things

1178
00:59:42,820 --> 00:59:43,320
here.

1179
00:59:43,320 --> 00:59:46,190
And then once you eliminated
those first few things,

1180
00:59:46,190 --> 00:59:48,742
then you can start looking
at whether contention

1181
00:59:48,742 --> 00:59:49,700
is causing the problem.

1182
00:59:53,100 --> 00:59:53,870
Any questions?

1183
00:59:57,300 --> 00:59:57,800
OK.

1184
00:59:57,800 --> 01:00:01,100
So I talked about
stencil computation.

1185
01:00:01,100 --> 01:00:04,370
I want to now talk about
another problem, sorting.

1186
01:00:04,370 --> 01:00:09,170
And we want to do this
cache efficiently.

1187
01:00:09,170 --> 01:00:09,670
OK.

1188
01:00:09,670 --> 01:00:12,750
So let's first analyze
the cache complexity

1189
01:00:12,750 --> 01:00:14,985
of a standard merge sort.

1190
01:00:14,985 --> 01:00:18,690
So we first need to analyze
the complexity of merging,

1191
01:00:18,690 --> 01:00:22,500
because this is used as a
subroutine in merge sort.

1192
01:00:22,500 --> 01:00:24,240
And as you recall
in merging, we're

1193
01:00:24,240 --> 01:00:26,640
given two sorted input arrays.

1194
01:00:26,640 --> 01:00:28,770
And we want to generate
an output array that's

1195
01:00:28,770 --> 01:00:31,890
also sorted containing all the
elements from the two input

1196
01:00:31,890 --> 01:00:33,140
arrays.

1197
01:00:33,140 --> 01:00:37,260
And the algorithm is going to
maintain a pointer to the head

1198
01:00:37,260 --> 01:00:39,090
of each of our input arrays.

1199
01:00:39,090 --> 01:00:41,310
And then it's going to
compare the two elements

1200
01:00:41,310 --> 01:00:43,830
and take the smaller one
and put it into the output,

1201
01:00:43,830 --> 01:00:48,060
and then increment the
pointer for that array.

1202
01:00:48,060 --> 01:00:50,580
And then we keep doing this,
taking the smaller of the two

1203
01:00:50,580 --> 01:00:54,840
elements until the two
input arrays become empty,

1204
01:00:54,840 --> 01:00:57,300
at which point we're
done with the algorithm.

1205
01:00:57,300 --> 01:00:59,703
We have one sorted output array.

1206
01:01:03,160 --> 01:01:03,660
OK.

1207
01:01:03,660 --> 01:01:13,450
So to merge n elements, the time
to do this is just theta of n.

1208
01:01:13,450 --> 01:01:16,380
Here n is the sum of the
sizes of my two input arrays.

1209
01:01:16,380 --> 01:01:18,880
And this is because I'm only
doing a constant amount of work

1210
01:01:18,880 --> 01:01:20,815
for each of my input elements.

1211
01:01:24,200 --> 01:01:25,860
What about the number
of cache misses?

1212
01:01:25,860 --> 01:01:30,590
How many cache misses will incur
when I'm merging n elements?

1213
01:01:37,125 --> 01:01:37,625
Yes?

1214
01:01:37,625 --> 01:01:38,840
AUDIENCE: [INAUDIBLE].

1215
01:01:38,840 --> 01:01:39,590
JULIAN SHUN: Yeah.

1216
01:01:39,590 --> 01:01:42,120
So I'm going to incur
theta of n over B cache

1217
01:01:42,120 --> 01:01:45,920
misses because my two input
arrays are stored contiguously

1218
01:01:45,920 --> 01:01:48,410
in memory so I can
read them at B bytes

1219
01:01:48,410 --> 01:01:50,930
at a time with just
one cache miss.

1220
01:01:50,930 --> 01:01:53,300
And then my output array
is also stored contiguously

1221
01:01:53,300 --> 01:01:55,130
so I can write
things out B bytes

1222
01:01:55,130 --> 01:01:57,980
at a time with just
one cache miss.

1223
01:01:57,980 --> 01:02:01,550
I might waste to cache line at
the beginning and end of each

1224
01:02:01,550 --> 01:02:04,130
of my three arrays, but
that only affects the bound

1225
01:02:04,130 --> 01:02:05,120
by a constant factor.

1226
01:02:05,120 --> 01:02:09,790
So it's theta of n over B.

1227
01:02:09,790 --> 01:02:12,280
So now let's look at merge sort.

1228
01:02:12,280 --> 01:02:16,870
So recall that merge sort has
two recursive calls to itself

1229
01:02:16,870 --> 01:02:18,680
on inputs of half the size.

1230
01:02:18,680 --> 01:02:20,520
And then it doesn't
merge at the end

1231
01:02:20,520 --> 01:02:25,700
to merge the two sorted
outputs of its recursive calls.

1232
01:02:25,700 --> 01:02:30,610
So if you look at how
the recursion precedes,

1233
01:02:30,610 --> 01:02:34,965
its first going to divide the
input array into two halves.

1234
01:02:34,965 --> 01:02:36,590
It's going to divide
it into two halves

1235
01:02:36,590 --> 01:02:40,070
again again, until
you get to the base

1236
01:02:40,070 --> 01:02:43,700
case of just one element,
at which point you return.

1237
01:02:43,700 --> 01:02:46,040
And then now we start merging
pairs of these elements

1238
01:02:46,040 --> 01:02:47,550
together.

1239
01:02:47,550 --> 01:02:50,870
So now I have these arrays
of size 2 in sorted order.

1240
01:02:50,870 --> 01:02:53,480
And I merged pairs
of those arrays.

1241
01:02:53,480 --> 01:02:57,230
And I get subarrays of size 4.

1242
01:02:57,230 --> 01:02:59,030
And then finally, I
do this one more time

1243
01:02:59,030 --> 01:03:01,030
to get my sorted output.

1244
01:03:04,820 --> 01:03:05,320
OK.

1245
01:03:05,320 --> 01:03:09,320
So let's review the
work of merge sort.

1246
01:03:09,320 --> 01:03:11,740
So what's the recurrence
for merge sort

1247
01:03:11,740 --> 01:03:14,610
if we're computing the work?

1248
01:03:14,610 --> 01:03:15,610
Yes?

1249
01:03:15,610 --> 01:03:18,384
AUDIENCE: [INAUDIBLE].

1250
01:03:18,384 --> 01:03:19,190
JULIAN SHUN: Yeah.

1251
01:03:19,190 --> 01:03:19,960
So that's correct.

1252
01:03:19,960 --> 01:03:22,870
So I have two subproblems
of size N over 2.

1253
01:03:22,870 --> 01:03:25,780
And then I need to do theta
n work to do the merge.

1254
01:03:28,750 --> 01:03:32,480
And this is case two
of master theorem.

1255
01:03:32,480 --> 01:03:36,220
So I'm computing log base b of
a, which is log base 2 of 2.

1256
01:03:36,220 --> 01:03:37,840
And that's just 1.

1257
01:03:37,840 --> 01:03:40,600
And that's the same as
the exponent in the term

1258
01:03:40,600 --> 01:03:41,610
that I'm adding in.

1259
01:03:41,610 --> 01:03:45,220
So since they're the same, I
add in an additional log factor.

1260
01:03:45,220 --> 01:03:48,056
And my overall work is
just theta of n log n.

1261
01:03:51,560 --> 01:03:52,060
OK.

1262
01:03:52,060 --> 01:03:55,030
So now I'm going to solve
this recurrence again using

1263
01:03:55,030 --> 01:03:56,303
the recursion tree method.

1264
01:03:56,303 --> 01:03:57,970
I'm still going to
get theta of n log n.

1265
01:03:57,970 --> 01:03:59,512
But I'm doing this
because it's going

1266
01:03:59,512 --> 01:04:04,330
to be useful when we analyze
the cache complexity.

1267
01:04:04,330 --> 01:04:07,810
So at the top level I
have a problem of size n.

1268
01:04:07,810 --> 01:04:10,570
And I'm going to branch into
two problems of size n over 2.

1269
01:04:10,570 --> 01:04:12,028
And when I'm done
with them, I have

1270
01:04:12,028 --> 01:04:13,990
to do a merge, which
takes theta of n work.

1271
01:04:13,990 --> 01:04:15,157
And I'm just putting n here.

1272
01:04:15,157 --> 01:04:17,500
I'm ignoring the constants.

1273
01:04:17,500 --> 01:04:19,370
And I'm going to branch again.

1274
01:04:19,370 --> 01:04:23,560
And each one of these is going
to do n over 2 work to merge.

1275
01:04:23,560 --> 01:04:26,530
And I'm going to get all
the way down to my base case

1276
01:04:26,530 --> 01:04:30,460
after log base 2 of n levels.

1277
01:04:30,460 --> 01:04:32,590
The top level is doing n work.

1278
01:04:32,590 --> 01:04:34,660
Second level is
also doing n work.

1279
01:04:34,660 --> 01:04:37,450
And it's going to be the
same for all levels down

1280
01:04:37,450 --> 01:04:39,220
to the leaves.

1281
01:04:39,220 --> 01:04:41,950
Leaves is also doing a
linear amount of work.

1282
01:04:41,950 --> 01:04:45,610
So the overall work is just
the work per level times

1283
01:04:45,610 --> 01:04:46,660
the number of levels.

1284
01:04:46,660 --> 01:04:48,554
So it's just theta of n log n.

1285
01:04:58,870 --> 01:04:59,370
OK.

1286
01:04:59,370 --> 01:05:02,530
So now let's analyze
this with caching.

1287
01:05:02,530 --> 01:05:03,030
OK.

1288
01:05:03,030 --> 01:05:06,360
So we said earlier that
the cache complexity

1289
01:05:06,360 --> 01:05:10,200
of the merging subroutine
is theta of n over B.

1290
01:05:10,200 --> 01:05:12,780
And here's the
recurrence for the cache

1291
01:05:12,780 --> 01:05:14,680
complexity of merge sort.

1292
01:05:14,680 --> 01:05:20,100
So my base case here is when
n is less than or equal to cM,

1293
01:05:20,100 --> 01:05:23,430
for some sufficiently
small constant c.

1294
01:05:23,430 --> 01:05:25,470
And this is because
at this point,

1295
01:05:25,470 --> 01:05:27,600
my problem size fits into cache.

1296
01:05:27,600 --> 01:05:29,970
And everything else
I do for that problem

1297
01:05:29,970 --> 01:05:33,480
is still going to be in cache.

1298
01:05:33,480 --> 01:05:36,240
And to load it into
cache, the base case,

1299
01:05:36,240 --> 01:05:39,570
I need to incur theta
and over B cache misses.

1300
01:05:39,570 --> 01:05:43,290
And otherwise, I'm going to have
to recursive calls of size n

1301
01:05:43,290 --> 01:05:44,250
over 2.

1302
01:05:44,250 --> 01:05:46,710
And then I need to do
theta of n over B cache

1303
01:05:46,710 --> 01:05:49,390
misses to do the merge
of my two results.

1304
01:05:51,990 --> 01:05:57,530
So here, my base case is larger
than what I did for the work.

1305
01:05:57,530 --> 01:05:59,820
The algorithm actually
is still recursing down

1306
01:05:59,820 --> 01:06:01,830
to a constant size base case.

1307
01:06:01,830 --> 01:06:05,490
But just for analysis, I'm
stopping the recursion when

1308
01:06:05,490 --> 01:06:08,520
n is less than or equal to CM.

1309
01:06:08,520 --> 01:06:09,540
So let's analyze this.

1310
01:06:09,540 --> 01:06:12,590
So again, I'm going to
have the problems of size n

1311
01:06:12,590 --> 01:06:14,160
at the beginning.

1312
01:06:14,160 --> 01:06:17,170
And then I'm going to split into
two problems of size n over 2.

1313
01:06:17,170 --> 01:06:19,170
And then I'm going to
have to pay n over B cache

1314
01:06:19,170 --> 01:06:22,750
misses to merge the
results together.

1315
01:06:22,750 --> 01:06:24,660
Similarly for the
next level, now I'm

1316
01:06:24,660 --> 01:06:28,920
paying n over 2B cache misses
for each of my two problems

1317
01:06:28,920 --> 01:06:30,660
here to do the merge.

1318
01:06:30,660 --> 01:06:36,240
And I keep going down until
I get to a subproblem of size

1319
01:06:36,240 --> 01:06:37,830
theta of cM.

1320
01:06:37,830 --> 01:06:39,630
At that point, it's
going to fit in cache.

1321
01:06:39,630 --> 01:06:42,750
And I don't need to recurse
anymore in my analysis.

1322
01:06:45,380 --> 01:06:48,080
So number of levels
of this recursion tree

1323
01:06:48,080 --> 01:06:51,770
is just log base 2 of n over cM.

1324
01:06:51,770 --> 01:06:54,350
So I'm basically chopping off
the bottom up this recursion

1325
01:06:54,350 --> 01:06:55,640
tree.

1326
01:06:55,640 --> 01:07:00,590
The number of levels I had
below this is log base 2 of cM.

1327
01:07:00,590 --> 01:07:02,960
So I'm taking a log base
2 of n and subtracting

1328
01:07:02,960 --> 01:07:04,280
log base 2 of cM.

1329
01:07:04,280 --> 01:07:07,590
And that's equivalent to log
base 2 of n divided by cM.

1330
01:07:10,750 --> 01:07:15,220
The number of leaves I have
is n over cM since each leaf

1331
01:07:15,220 --> 01:07:18,310
is state of cM large.

1332
01:07:18,310 --> 01:07:20,740
And the number of cache
misses I need to incur--

1333
01:07:20,740 --> 01:07:25,990
to process a leaf is
just theta of m over B,

1334
01:07:25,990 --> 01:07:31,270
because I just need to load
the input for that subproblem

1335
01:07:31,270 --> 01:07:31,840
into cache.

1336
01:07:31,840 --> 01:07:33,731
And then everything
else fits in cache.

1337
01:07:36,890 --> 01:07:39,200
So for the top level,
I'm incurring n

1338
01:07:39,200 --> 01:07:41,540
over B cache misses.

1339
01:07:41,540 --> 01:07:43,955
The next level, I'm also
incurring n over B cache

1340
01:07:43,955 --> 01:07:44,780
misses.

1341
01:07:44,780 --> 01:07:46,460
Same with the third level.

1342
01:07:46,460 --> 01:07:51,260
And then for the leaves, I'm
incurring m over B times n

1343
01:07:51,260 --> 01:07:53,000
over cM cache misses.

1344
01:07:53,000 --> 01:07:55,250
The n over cM is
the number of leaves

1345
01:07:55,250 --> 01:07:58,355
I have and theta of m over
B is the number of cache

1346
01:07:58,355 --> 01:07:59,480
misses per leaf.

1347
01:07:59,480 --> 01:08:05,790
And that also equals
theta of n over B.

1348
01:08:05,790 --> 01:08:09,090
So overall, I multiply
theta of n over B

1349
01:08:09,090 --> 01:08:11,760
by the number of levels I have.

1350
01:08:11,760 --> 01:08:14,670
So the number of levels I have
is log base 2 of n over cM.

1351
01:08:14,670 --> 01:08:16,140
And I just got rid
of the constant

1352
01:08:16,140 --> 01:08:19,630
here, since doesn't affect
the asymptotic bound.

1353
01:08:19,630 --> 01:08:22,649
So the number of cache misses I
have is theta of n over B times

1354
01:08:22,649 --> 01:08:24,540
log base 2 of--

1355
01:08:24,540 --> 01:08:30,240
or any base for the
log of n over M.

1356
01:08:30,240 --> 01:08:32,658
So any questions
on this analysis?

1357
01:08:40,979 --> 01:08:48,560
So I am saving a factor of
B here in the first term.

1358
01:08:48,560 --> 01:08:50,450
So that does reduce.

1359
01:08:50,450 --> 01:08:52,520
That just gives me a
better cache complexity

1360
01:08:52,520 --> 01:08:54,529
than just a work bound.

1361
01:08:54,529 --> 01:08:56,960
But for the M
term, it's actually

1362
01:08:56,960 --> 01:08:59,450
inside the denominator
of the log.

1363
01:08:59,450 --> 01:09:02,240
And that doesn't actually
help me that much.

1364
01:09:02,240 --> 01:09:05,750
So let's look at how
much we actually save.

1365
01:09:05,750 --> 01:09:11,720
So one n is much greater than
M. Then log base 2 of n over M

1366
01:09:11,720 --> 01:09:14,090
is approximately equal
to log base 2 of n.

1367
01:09:14,090 --> 01:09:18,109
So the M term doesn't contribute
much to the asymptotic costs,

1368
01:09:18,109 --> 01:09:20,359
and therefore,
compared to the work,

1369
01:09:20,359 --> 01:09:24,585
we're only saving
a factor of B. When

1370
01:09:24,585 --> 01:09:29,180
n is approximately equal to
M, then log base 2 of n over m

1371
01:09:29,180 --> 01:09:36,020
is constant, and we're
saving a factor of B log n.

1372
01:09:36,020 --> 01:09:38,569
So we save more when
the memory size--

1373
01:09:38,569 --> 01:09:40,399
or when the problem
size is small.

1374
01:09:40,399 --> 01:09:42,080
But for large enough
problem sizes,

1375
01:09:42,080 --> 01:09:45,710
we're only saving
a factor of B. So

1376
01:09:45,710 --> 01:09:48,420
does anyone think that we
can do better than this?

1377
01:09:53,420 --> 01:09:55,623
So I've asked this question
several times before,

1378
01:09:55,623 --> 01:09:57,040
and the answer is
always the same.

1379
01:09:59,900 --> 01:10:00,440
Yes?

1380
01:10:00,440 --> 01:10:01,310
It's a good answer.

1381
01:10:04,170 --> 01:10:06,710
So we're going to do this
using multiway merging.

1382
01:10:06,710 --> 01:10:09,470
So instead of just merging
two sorted subarrays,

1383
01:10:09,470 --> 01:10:13,400
we're going to merge
together R sorted subarrays.

1384
01:10:13,400 --> 01:10:15,830
So we're going to have R
subarrays, each of size

1385
01:10:15,830 --> 01:10:18,650
n over R. And these are sorted.

1386
01:10:18,650 --> 01:10:20,150
And I'm going to
merge them together

1387
01:10:20,150 --> 01:10:23,840
using what's called
a tournament tree.

1388
01:10:23,840 --> 01:10:26,900
So how the tournament
tree works is I'm

1389
01:10:26,900 --> 01:10:31,590
going to compare the heads of
each pair of these subarrays

1390
01:10:31,590 --> 01:10:34,213
and store it in the node
of the tournament tree.

1391
01:10:34,213 --> 01:10:36,380
And then after I do that,
I compare these two nodes.

1392
01:10:36,380 --> 01:10:38,330
And I get the
minimum of those two.

1393
01:10:38,330 --> 01:10:42,387
Eventually, after I compare
all of the elements,

1394
01:10:42,387 --> 01:10:43,970
I'm just left with
the minimum element

1395
01:10:43,970 --> 01:10:45,428
at the root of the
tournament tree.

1396
01:10:45,428 --> 01:10:47,450
And then I can place
that into my output.

1397
01:10:50,990 --> 01:10:54,130
So the first time I want to
fill this tournament tree,

1398
01:10:54,130 --> 01:10:57,160
it's going to theta of
R work because there are

1399
01:10:57,160 --> 01:11:00,310
R nodes in my tournament tree.

1400
01:11:00,310 --> 01:11:04,570
So when I compare these two
elements, the smaller one is 6.

1401
01:11:04,570 --> 01:11:07,420
For these two, the
smaller one is 2.

1402
01:11:07,420 --> 01:11:10,910
And then I compare 2 and
6, take the smaller one.

1403
01:11:10,910 --> 01:11:13,130
And then on the other
side, I have a 7 here.

1404
01:11:13,130 --> 01:11:14,290
So I compare 2 and 7.

1405
01:11:14,290 --> 01:11:17,110
2 is smaller, so it
appears at the root.

1406
01:11:17,110 --> 01:11:18,750
And then I know
that that's going

1407
01:11:18,750 --> 01:11:22,660
to be my minimum element among
the heads of all of the R

1408
01:11:22,660 --> 01:11:24,010
subarrays that I'm passing it.

1409
01:11:27,670 --> 01:11:30,370
So the first time to
generate this tournament tree

1410
01:11:30,370 --> 01:11:35,350
takes theta of R work because
I have to fill in R nodes.

1411
01:11:35,350 --> 01:11:38,680
But once I generated
this tournament tree,

1412
01:11:38,680 --> 01:11:42,640
for all subsequent rounds, I
only need to fill in the path

1413
01:11:42,640 --> 01:11:45,280
from the element that one
to the output or right,

1414
01:11:45,280 --> 01:11:47,020
or to the root of
the tournament tree,

1415
01:11:47,020 --> 01:11:50,000
because those are the only
values that would have changed.

1416
01:11:50,000 --> 01:11:52,310
So now I'm going to
fill in this path here.

1417
01:11:52,310 --> 01:11:55,660
And this only takes theta
of log R work to do it,

1418
01:11:55,660 --> 01:11:57,910
because the height of
this tournament tree

1419
01:11:57,910 --> 01:12:01,150
is log base 2 of R.

1420
01:12:01,150 --> 01:12:02,810
So I'm going to fill this in.

1421
01:12:02,810 --> 01:12:04,120
Now 14 goes here.

1422
01:12:04,120 --> 01:12:05,650
6 is a smaller of the two.

1423
01:12:05,650 --> 01:12:07,720
And then 6 is a smaller
of the two again.

1424
01:12:07,720 --> 01:12:09,517
So my next element is 6.

1425
01:12:14,090 --> 01:12:17,040
And I keep doing this until
all of the elements for my R

1426
01:12:17,040 --> 01:12:21,680
subarrays get put
into the output.

1427
01:12:21,680 --> 01:12:23,750
The total work for
merging is going

1428
01:12:23,750 --> 01:12:26,750
to be theta of R
for the first round,

1429
01:12:26,750 --> 01:12:30,140
plus n log R for all
the remaining rounds.

1430
01:12:30,140 --> 01:12:33,740
And that's just equal
to theta of n log R,

1431
01:12:33,740 --> 01:12:37,530
because we're assuming that
n is bigger than R here.

1432
01:12:37,530 --> 01:12:40,743
Any questions on how the
multiway merge works?

1433
01:12:44,830 --> 01:12:45,330
No?

1434
01:12:45,330 --> 01:12:45,830
OK.

1435
01:12:45,830 --> 01:12:48,630
So let's analyze the
work of this multiway

1436
01:12:48,630 --> 01:12:53,260
merge when used
inside merge sort.

1437
01:12:53,260 --> 01:12:55,250
So the recurrence is
going to be as follows.

1438
01:12:55,250 --> 01:13:00,190
So if n is 1, then we just
do a constant amount of work.

1439
01:13:00,190 --> 01:13:03,600
Otherwise, we have R
subproblems of size n over R.

1440
01:13:03,600 --> 01:13:08,970
And then we're paying theta of n
log R to do the multiway merge.

1441
01:13:08,970 --> 01:13:11,130
So here's the recursion tree.

1442
01:13:11,130 --> 01:13:15,390
At the top level, we
have a problem of size n.

1443
01:13:15,390 --> 01:13:18,300
And then we're going to split
into R subproblems of size n/R.

1444
01:13:18,300 --> 01:13:20,250
And then we have
to pay n log R work

1445
01:13:20,250 --> 01:13:24,800
to merge the results of the
recursive call together.

1446
01:13:24,800 --> 01:13:26,040
And then we keep doing this.

1447
01:13:26,040 --> 01:13:29,040
Turns out that the
work at each level

1448
01:13:29,040 --> 01:13:35,370
sums up to n log base 2 of
R. And the number of levels

1449
01:13:35,370 --> 01:13:40,110
we have here is log base R
of n, because each time we're

1450
01:13:40,110 --> 01:13:45,750
branching by a factor of R. For
the leaves, we have n leaves.

1451
01:13:45,750 --> 01:13:47,490
And we just pay
linear work for that,

1452
01:13:47,490 --> 01:13:50,670
because we don't have
to pay for the merge.

1453
01:13:50,670 --> 01:13:52,890
We're not doing anymore
recursive calls.

1454
01:13:52,890 --> 01:13:57,260
So the overall work is
going to be theta of n log

1455
01:13:57,260 --> 01:14:00,960
R times log base R of n,
plus n for the leaves,

1456
01:14:00,960 --> 01:14:02,850
but that's just a
lower order term.

1457
01:14:02,850 --> 01:14:05,118
And if you work out the
math, some of these terms

1458
01:14:05,118 --> 01:14:06,660
are going to cancel
out, and you just

1459
01:14:06,660 --> 01:14:08,940
get theta of log n for the work.

1460
01:14:08,940 --> 01:14:14,070
So the work is the same
as the binary merge sort.

1461
01:14:14,070 --> 01:14:17,650
Let's now analyze
the cash complexity.

1462
01:14:17,650 --> 01:14:20,760
So let's assume that
we have R less than cM

1463
01:14:20,760 --> 01:14:24,130
over B for a sufficiently
small constant C less than

1464
01:14:24,130 --> 01:14:25,940
or equal to 1.

1465
01:14:25,940 --> 01:14:28,770
We're going to consider the
R way merging of contiguous

1466
01:14:28,770 --> 01:14:31,170
arrays of total size n.

1467
01:14:31,170 --> 01:14:33,990
And if R is less than
cM over B, then we

1468
01:14:33,990 --> 01:14:36,780
can fit the entire
tournament tree into cache.

1469
01:14:36,780 --> 01:14:40,530
And we can also fit one block
from each of the R subarrays

1470
01:14:40,530 --> 01:14:43,680
into cache.

1471
01:14:43,680 --> 01:14:46,410
And in that case, the
total number of cache

1472
01:14:46,410 --> 01:14:51,090
misses to do the multiway merge
is just theta of n over B,

1473
01:14:51,090 --> 01:14:54,630
because we just have to go over
the n elements in our input

1474
01:14:54,630 --> 01:14:57,230
arrays.

1475
01:14:57,230 --> 01:15:00,210
So the recurrence for the R
way merge sort is as follows.

1476
01:15:00,210 --> 01:15:04,390
So if n is less than cM,
then it fits in cache.

1477
01:15:04,390 --> 01:15:07,210
So the number of cache misses we
pay is just theta of n over B.

1478
01:15:07,210 --> 01:15:10,080
And otherwise, we have R
subproblems of size n over R.

1479
01:15:10,080 --> 01:15:11,710
And that we add
theta of n over B

1480
01:15:11,710 --> 01:15:15,750
to do the merge of the
results of the subproblems.

1481
01:15:15,750 --> 01:15:18,115
Any questions on
the recurrence here?

1482
01:15:23,540 --> 01:15:24,040
Yes?

1483
01:15:24,040 --> 01:15:27,090
AUDIENCE: So how do we
pick up the value of R?

1484
01:15:27,090 --> 01:15:29,450
Does it make it cache oblivious?

1485
01:15:29,450 --> 01:15:30,620
JULIAN SHUN: Good question.

1486
01:15:30,620 --> 01:15:32,740
So we didn't pick
the value of R.

1487
01:15:32,740 --> 01:15:34,891
So this is not a cache
oblivious algorithm.

1488
01:15:40,080 --> 01:15:44,990
And we'll see what to choose
for R in a couple of slides.

1489
01:15:44,990 --> 01:15:48,430
So let's analyze the cache
complexity of this algorithm

1490
01:15:48,430 --> 01:15:51,760
again, using the
recursion tree analysis.

1491
01:15:51,760 --> 01:15:57,090
So at the top level, we're going
to have R subproblems of size

1492
01:15:57,090 --> 01:15:59,170
n over R. And we have
to pay n over B cache

1493
01:15:59,170 --> 01:16:01,310
misses to merge them.

1494
01:16:01,310 --> 01:16:05,470
And it turns out that at each
level, the number of cache

1495
01:16:05,470 --> 01:16:09,300
misses we have to pay is n over
B, if you work out the math.

1496
01:16:09,300 --> 01:16:11,260
And the number of levels
of this recursion tree

1497
01:16:11,260 --> 01:16:14,260
is going to be log
base R of n over cM,

1498
01:16:14,260 --> 01:16:16,080
because we're going
to stop recurring

1499
01:16:16,080 --> 01:16:18,100
when our problem size is cM.

1500
01:16:18,100 --> 01:16:20,410
And on every level
of recursion, we're

1501
01:16:20,410 --> 01:16:27,670
branching by a factor of R. So
our leaf size is cM, therefore

1502
01:16:27,670 --> 01:16:30,550
the number of leaves
we have is n over cM.

1503
01:16:30,550 --> 01:16:34,270
And for each leaf, it's going
to take theta of m over B cache

1504
01:16:34,270 --> 01:16:36,190
misses to load it into cache.

1505
01:16:36,190 --> 01:16:39,220
And afterwards, we can
do everything in cache.

1506
01:16:39,220 --> 01:16:43,300
And multiplying the number of
leaves by the cost per leaf,

1507
01:16:43,300 --> 01:16:46,000
we get theta of n
over B cache misses.

1508
01:16:46,000 --> 01:16:48,670
And therefore, the
number of cache

1509
01:16:48,670 --> 01:16:50,860
misses is n over B times
the number of levels.

1510
01:16:50,860 --> 01:16:57,390
Number of levels is
log base R of n over M.

1511
01:16:57,390 --> 01:17:01,510
So compared to the binary
merge sort algorithm,

1512
01:17:01,510 --> 01:17:04,630
here we actually have a factor
of R in the base of the log.

1513
01:17:04,630 --> 01:17:09,610
Before, the base of
the log was just 2.

1514
01:17:09,610 --> 01:17:15,080
So now the question is what
we're going to set R to be.

1515
01:17:15,080 --> 01:17:17,720
So again, we have
a voodoo parameter.

1516
01:17:17,720 --> 01:17:20,330
This is not a cache
oblivious algorithm.

1517
01:17:20,330 --> 01:17:23,760
And we said that R has to be
less than or equal to cM over

1518
01:17:23,760 --> 01:17:25,670
B in order for the
analysis to work out.

1519
01:17:25,670 --> 01:17:27,650
So let's just make it
as large as possible.

1520
01:17:27,650 --> 01:17:32,240
Let's just set R equal
to theta of M over B.

1521
01:17:32,240 --> 01:17:37,250
And now we can see what this
complexity works out to be.

1522
01:17:37,250 --> 01:17:40,080
So we have the total
cache assumption.

1523
01:17:40,080 --> 01:17:43,550
We also have the fact that
log base M of n over M

1524
01:17:43,550 --> 01:17:48,230
is equal to theta
of log n over log M.

1525
01:17:48,230 --> 01:17:53,090
So the cache complexity is
theta of n over B times log base

1526
01:17:53,090 --> 01:17:57,740
M over B of n over M. But if we
have the tall cache assumption

1527
01:17:57,740 --> 01:18:00,870
that log base M over B is
the same as log base of M

1528
01:18:00,870 --> 01:18:02,030
asymptotically.

1529
01:18:02,030 --> 01:18:04,820
So that's how we get
to the second line.

1530
01:18:04,820 --> 01:18:07,550
And then you can
rearrange some terms

1531
01:18:07,550 --> 01:18:09,200
and cancel some terms out.

1532
01:18:09,200 --> 01:18:15,560
And we'll end up with theta
of n log n divided by B log M.

1533
01:18:15,560 --> 01:18:17,960
So we're saving
a factor of theta

1534
01:18:17,960 --> 01:18:23,030
of b log M compared to
the work of the algorithm,

1535
01:18:23,030 --> 01:18:25,340
whereas for the binary
version of merge sort,

1536
01:18:25,340 --> 01:18:28,550
we were only saving a factor
of B for large enough inputs.

1537
01:18:28,550 --> 01:18:36,050
So here we get another factor
of log M in our savings.

1538
01:18:36,050 --> 01:18:41,240
So as I said, the binary one
cache misses is n log n over B,

1539
01:18:41,240 --> 01:18:45,110
whereas the multiway one
is n log n over B log M.

1540
01:18:45,110 --> 01:18:47,870
And as longest as n is
much greater than M,

1541
01:18:47,870 --> 01:18:52,790
then we're actually saving much
more than the binary version.

1542
01:18:52,790 --> 01:18:55,970
So we're saving a factor
of log M in cache misses.

1543
01:18:55,970 --> 01:18:59,870
And let's just ignore the
constants here and look at what

1544
01:18:59,870 --> 01:19:02,060
log M can be in practice.

1545
01:19:02,060 --> 01:19:03,990
So here are some
typical cache sizes.

1546
01:19:03,990 --> 01:19:06,560
The L1 cache size
is 32 kilobytes,

1547
01:19:06,560 --> 01:19:08,330
so that's 2 to the 15th.

1548
01:19:08,330 --> 01:19:10,490
And log base 2 of that is 15.

1549
01:19:10,490 --> 01:19:12,980
So we get a 15x savings.

1550
01:19:12,980 --> 01:19:14,780
And then for the
larger cache sizes,

1551
01:19:14,780 --> 01:19:17,204
we get even larger saving.

1552
01:19:17,204 --> 01:19:19,589
So any questions on this?

1553
01:19:23,410 --> 01:19:25,870
So the problem
with this algorithm

1554
01:19:25,870 --> 01:19:27,580
is that it's not
cache oblivious.

1555
01:19:27,580 --> 01:19:31,610
We have to tune the value of
R for a particular machine.

1556
01:19:31,610 --> 01:19:33,670
And even when we're running
on the same machine,

1557
01:19:33,670 --> 01:19:35,110
there could be
other jobs running

1558
01:19:35,110 --> 01:19:37,420
that contend for the cache.

1559
01:19:37,420 --> 01:19:40,360
Turns out that there are
several cache oblivious sorting

1560
01:19:40,360 --> 01:19:40,870
algorithms.

1561
01:19:40,870 --> 01:19:44,740
The first one that was
developed was by paper

1562
01:19:44,740 --> 01:19:47,440
by Charles Leiserson, and
it's called funnel sort.

1563
01:19:47,440 --> 01:19:52,000
The idea here is to recursively
sort n to the 1/3 groups of n

1564
01:19:52,000 --> 01:19:55,390
to the 2/3 elements, and
then merge the sorted groups

1565
01:19:55,390 --> 01:19:59,020
with an n to the 1/3 funnel.

1566
01:19:59,020 --> 01:20:02,950
And this funnel is called
a k funnel, more generally,

1567
01:20:02,950 --> 01:20:07,180
and it merges together k cubed
elements in a k sorted list.

1568
01:20:07,180 --> 01:20:11,050
And the costs for doing
this merge is shown here.

1569
01:20:11,050 --> 01:20:13,270
And if you plug this
into the recurrence,

1570
01:20:13,270 --> 01:20:15,460
you'll get that, the
asymptotic number of cache

1571
01:20:15,460 --> 01:20:18,460
misses is the same as
that of the multiway

1572
01:20:18,460 --> 01:20:21,520
merge sort algorithm while
being cache oblivious.

1573
01:20:21,520 --> 01:20:24,508
And this bound is
actually optimal.

1574
01:20:24,508 --> 01:20:26,050
So I'm not going to
have time to talk

1575
01:20:26,050 --> 01:20:29,230
about the details of the
funnel sort algorithm.

1576
01:20:29,230 --> 01:20:30,700
But I do have time
to just show you

1577
01:20:30,700 --> 01:20:33,740
a pretty picture of what
the funnel looks like.

1578
01:20:33,740 --> 01:20:35,650
So this is what a k
funnel looks like.

1579
01:20:35,650 --> 01:20:37,720
It's recursively constructed.

1580
01:20:37,720 --> 01:20:40,727
We have a bunch of square
root of k funnels inside.

1581
01:20:40,727 --> 01:20:42,310
They're all connected
to some buffers.

1582
01:20:42,310 --> 01:20:44,170
So they feed
elements the buffer,

1583
01:20:44,170 --> 01:20:45,640
and then the
buffers feed element

1584
01:20:45,640 --> 01:20:50,483
to this output square root
of k funnel, which becomes

1585
01:20:50,483 --> 01:20:51,650
the output for the k funnel.

1586
01:20:51,650 --> 01:20:54,160
So this whole blue
thing is the k funnel.

1587
01:20:54,160 --> 01:20:58,900
And the small green triangles
are square root of k funnels.

1588
01:20:58,900 --> 01:21:01,900
And the number of cache misses
incurred by doing this merge

1589
01:21:01,900 --> 01:21:02,920
is shown here.

1590
01:21:02,920 --> 01:21:07,450
And it uses the tall cache
assumption for analysis.

1591
01:21:07,450 --> 01:21:08,590
So a pretty cool algorithm.

1592
01:21:08,590 --> 01:21:11,735
And we've posted a paper online
that describes this algorithm.

1593
01:21:11,735 --> 01:21:13,360
So if you're interested
in the details,

1594
01:21:13,360 --> 01:21:15,835
I encourage you to
read that paper.

1595
01:21:15,835 --> 01:21:18,460
There are also many other cache
oblivious algorithms out there.

1596
01:21:18,460 --> 01:21:20,830
So there's been
hundreds of papers

1597
01:21:20,830 --> 01:21:22,630
on cache oblivious algorithms.

1598
01:21:22,630 --> 01:21:23,800
Here are some of them.

1599
01:21:23,800 --> 01:21:25,850
Some of these are also
described in the paper.

1600
01:21:25,850 --> 01:21:27,730
In fact, I think all
of these are described

1601
01:21:27,730 --> 01:21:29,880
in the paper we posted.

1602
01:21:29,880 --> 01:21:32,605
There are also some cool cache
oblivious data structures

1603
01:21:32,605 --> 01:21:35,650
that have been developed,
such as for B-trees,

1604
01:21:35,650 --> 01:21:38,513
ordered-file maintenance,
and priority queues.

1605
01:21:38,513 --> 01:21:40,930
So it's a lot of literature
on cache oblivious algorithms.

1606
01:21:40,930 --> 01:21:43,150
And there's also a
lot of material online

1607
01:21:43,150 --> 01:21:45,780
if you're interested
in learning more.