1
00:00:09,000 --> 00:00:15,000
Today starts a two-lecture
sequence on the topic of

2
00:00:15,000 --> 00:00:23,000
hashing, which is a really great
technique that shows up in a lot

3
00:00:23,000 --> 00:00:28,000
of places.
So we're going to introduce it

4
00:00:28,000 --> 00:00:36,000
through a problem that comes up
often in compilers called the

5
00:00:36,000 --> 00:00:45,000
symbol table problem.
And the idea is that we have a

6
00:00:45,000 --> 00:00:57,000
table S holding n records where
each record, just to be a little

7
00:00:57,000 --> 00:01:05,000
more explicit here.
So each record typically has a

8
00:01:05,000 --> 00:01:10,000
bunch of, this is record x.
x is usually a pointer to the

9
00:01:10,000 --> 00:01:14,000
actual data.
So when we talk about the

10
00:01:14,000 --> 00:01:20,000
record x, what it usually means
some pointer to the data.

11
00:01:20,000 --> 00:01:23,000
And in the data,
in the record,

12
00:01:23,000 --> 00:01:28,000
so this is a record,
there is a key called a key of

13
00:01:28,000 --> 00:01:32,000
x.
In some languages it's key,

14
00:01:32,000 --> 00:01:38,000
it's x dot key or x arrow key,
OK, are other ways that that

15
00:01:38,000 --> 00:01:41,000
will be denoted in some
languages.

16
00:01:41,000 --> 00:01:46,000
And there's usually some
additional data called satellite

17
00:01:46,000 --> 00:01:50,000
data, which is carried around
with the key.

18
00:01:50,000 --> 00:01:56,000
This is also true in sorting,
but usually you're sorting

19
00:01:56,000 --> 00:01:59,000
records.
You're not sorting individual

20
00:01:59,000 --> 00:02:03,000
keys.
And so the idea is that we have

21
00:02:03,000 --> 00:02:09,000
a bunch of operations that we
would like to do on this data on

22
00:02:09,000 --> 00:02:15,000
this table.
So we want to be able to insert

23
00:02:15,000 --> 00:02:20,000
an item x into the table,
which just essentially means

24
00:02:20,000 --> 00:02:25,000
that we update the table by
adding the element x.

25
00:02:25,000 --> 00:02:31,000
We want to be able to delete an
item from the table --

26
00:02:38,000 --> 00:02:48,000
-- so removing the item x from
the set and we want to be able

27
00:02:48,000 --> 00:02:57,000
to search for a given key.
So this returns the value x

28
00:02:57,000 --> 00:03:06,000
such that key of x is equal to
k, where it returns nil if

29
00:03:06,000 --> 00:03:13,000
there's no such x.
So be able to insert items in,

30
00:03:13,000 --> 00:03:18,000
delete them and also look to
see if there's an item that has

31
00:03:18,000 --> 00:03:22,000
a particular key.
So notice that delete doesn't

32
00:03:22,000 --> 00:03:25,000
take a key.
Delete takes a record.

33
00:03:25,000 --> 00:03:30,000
OK, so if you want to delete
something of a particular key

34
00:03:30,000 --> 00:03:34,000
and you don't happen to have a
pointer to it,

35
00:03:34,000 --> 00:03:40,000
you have to say let me search
for it and then delete it.

36
00:03:40,000 --> 00:03:44,000
So these, whenever you have a
set operations,

37
00:03:44,000 --> 00:03:51,000
where operations that change
the set like in certain delete,

38
00:03:51,000 --> 00:03:57,000
we call it a dynamic set.
So these two operations make

39
00:03:57,000 --> 00:04:01,000
the set dynamic.
It changes over time.

40
00:04:01,000 --> 00:04:04,000
Sometimes you want to build a
fixed data structure.

41
00:04:04,000 --> 00:04:08,000
It's going to be a static set.
All you're going to do is do

42
00:04:08,000 --> 00:04:11,000
things like look it up and so
forth.

43
00:04:11,000 --> 00:04:13,000
But most often,
it turns out that in

44
00:04:13,000 --> 00:04:17,000
programming and so forth,
we want to have the set be

45
00:04:17,000 --> 00:04:19,000
dynamic.
Want to be able to add elements

46
00:04:19,000 --> 00:04:22,000
to it, delete elements to it and
so forth.

47
00:04:22,000 --> 00:04:26,000
And there may be other
operations that modify the set,

48
00:04:26,000 --> 00:04:30,000
modify membership in the set.
So the simplest implementation

49
00:04:30,000 --> 00:04:34,000
for this is actually often
overlooked.

50
00:04:34,000 --> 00:04:36,000
I'm actually surprised how
often people use more

51
00:04:36,000 --> 00:04:40,000
complicated data structures when
this simple data structure will

52
00:04:40,000 --> 00:04:42,000
work.
It's called a direct access

53
00:04:42,000 --> 00:04:44,000
table.
Doesn't always work.

54
00:04:44,000 --> 00:04:47,000
I'll give the conditions where
it does.

55
00:04:53,000 --> 00:04:58,000
So it works when the keys are
drawn from our small

56
00:04:58,000 --> 00:05:04,000
distribution.
So suppose the keys are drawn

57
00:05:04,000 --> 00:05:11,000
from a set U of m elements.
OK, zero to m minus one.

58
00:05:11,000 --> 00:05:19,000
And we're going to assume the
keys are distinct.

59
00:05:28,000 --> 00:05:32,000
So the way a direct access
table works is that you set up

60
00:05:32,000 --> 00:05:34,000
an array T --

61
00:05:41,000 --> 00:05:52,000
-- from zero to m minus one to
represent the dynamic set S --

62
00:05:58,000 --> 00:06:10,000
-- such that T of k is going to
be equal to x if x is in the set

63
00:06:10,000 --> 00:06:18,000
and its key is k and nil
otherwise.

64
00:06:24,000 --> 00:06:30,000
So you just simply have an
array and if you have a record

65
00:06:30,000 --> 00:06:35,000
whose key is some value k,
the key is 15 say,

66
00:06:35,000 --> 00:06:42,000
then slot 15 if the element is
there has the element.

67
00:06:42,000 --> 00:06:44,000
And if it's not in the set,
it's nil.

68
00:06:44,000 --> 00:06:48,000
Very simple data structure.
OK, insertion.

69
00:06:48,000 --> 00:06:52,000
Just go to that location and
set the value to the inserted

70
00:06:52,000 --> 00:06:54,000
value.
For deletion,

71
00:06:54,000 --> 00:06:57,000
just remove it from there.
And to look it up,

72
00:06:57,000 --> 00:07:02,000
you just index it and see
what's in that slot.

73
00:07:02,000 --> 00:07:09,000
OK, very simple data structure.
All these operations,

74
00:07:09,000 --> 00:07:15,000
therefore, take constant time
in the worst case.

75
00:07:15,000 --> 00:07:23,000
But as a practical matter,
the places you can use this

76
00:07:23,000 --> 00:07:31,000
strategy are pretty limited.
What's the issue of limitation

77
00:07:31,000 --> 00:07:34,000
here?
Yes.

78
00:07:38,000 --> 00:07:41,000
OK, so that's a limitation
surely.

79
00:07:41,000 --> 00:07:45,000
But there's actually a more
severe limitation.

80
00:07:45,000 --> 00:07:48,000
Yeah.
What does that mean,

81
00:07:48,000 --> 00:07:51,000
it's hard to draw?

82
00:08:05,000 --> 00:08:05,000
No.
Yeah.
m minus one could be a huge
number.

83
00:08:09,000 --> 00:08:14,000
Like for example,
suppose that I want to have my

84
00:08:14,000 --> 00:08:21,000
set drawn over 64 bit values.
OK, the things that I'm storing

85
00:08:21,000 --> 00:08:25,000
in my table is a set of 64-bit
numbers.

86
00:08:25,000 --> 00:08:30,000
And so, maybe a small set.
Maybe we only have a few

87
00:08:30,000 --> 00:08:37,000
thousand of these elements.
But they're drawn from a 64-bit

88
00:08:37,000 --> 00:08:41,000
value.
Then this strategy requires me

89
00:08:41,000 --> 00:08:47,000
to have an array that goes from
zero to 2 to the 64th minus one.

90
00:08:47,000 --> 00:08:51,000
How big is 2^64 minus one?
Approximately?

91
00:08:51,000 --> 00:08:55,000
It's like big.
It's like 18 quintillion or

92
00:08:55,000 --> 00:08:59,000
something.
I mean, it's zillions literally

93
00:08:59,000 --> 00:09:06,000
because it's like it's beyond
the illions we normally use.

94
00:09:06,000 --> 00:09:09,000
Not a billion or a trillion.
It's 18 quintillion.

95
00:09:09,000 --> 00:09:12,000
OK, so that's a really big
number.

96
00:09:12,000 --> 00:09:16,000
So, or even worse,
suppose the keys were drawn

97
00:09:16,000 --> 00:09:20,000
from character strings,
so people's names or something.

98
00:09:20,000 --> 00:09:24,000
This would be an awful way to
have to represent it.

99
00:09:24,000 --> 00:09:29,000
Because most of the table would
be empty for any reasonable set

100
00:09:29,000 --> 00:09:33,000
of values you would want to
keep.

101
00:09:33,000 --> 00:09:40,000
So the idea is we want to try
to keep something that's going

102
00:09:40,000 --> 00:09:47,000
to keep the table small,
while still preserving some of

103
00:09:47,000 --> 00:09:53,000
the properties.
And that's where hashing comes

104
00:09:53,000 --> 00:09:56,000
in.
So hashing is we use a hash

105
00:09:56,000 --> 00:10:03,000
function H which maps the keys
randomly.

106
00:10:03,000 --> 00:10:08,000
And I'm putting that in quotes
because it's not quite at

107
00:10:08,000 --> 00:10:11,000
random.
Into slots table T.

108
00:10:11,000 --> 00:10:16,000
So we call each of the array
indexes here a slot.

109
00:10:16,000 --> 00:10:23,000
So you can just sort of think
of it as a big table and you've

110
00:10:23,000 --> 00:10:30,000
got slots in the table where
you're storing your values.

111
00:10:30,000 --> 00:10:34,000
And so, we may have a big
universe of keys.

112
00:10:34,000 --> 00:10:39,000
Let's call that U.
And we have our table over here

113
00:10:39,000 --> 00:10:43,000
that we've set up that has --

114
00:10:50,000 --> 00:10:51,000
-- m slots.

115
00:10:56,000 --> 00:11:02,000
And so we actually have then a
set that we're actually going to

116
00:11:02,000 --> 00:11:07,000
try to represent S,
which is presumably a very

117
00:11:07,000 --> 00:11:13,000
small piece of the universe.
And what we'll do is we'll take

118
00:11:13,000 --> 00:11:18,000
an element from here and map it
to let's say to there and take

119
00:11:18,000 --> 00:11:23,000
another one and we apply the
hash function to the element.

120
00:11:23,000 --> 00:11:28,000
And what the hash function is
going to give us is it's going

121
00:11:28,000 --> 00:11:34,000
to give us a particular slot.
Here's one that might go up

122
00:11:34,000 --> 00:11:37,000
here.
Might have another one over

123
00:11:37,000 --> 00:11:44,000
here that goes down to there.
And so, we get it to distribute

124
00:11:44,000 --> 00:11:51,000
the elements over the table.
So what's the problem that's

125
00:11:51,000 --> 00:11:58,000
going to occur as we do this?
So far, I've been a little bit

126
00:11:58,000 --> 00:12:01,000
lucky.
What's the problem potentially

127
00:12:01,000 --> 00:12:02,000
going to be?

128
00:12:09,000 --> 00:12:11,000
Yeah, when two things are in S,
more specifically,

129
00:12:11,000 --> 00:12:15,000
get assigned to the same value.
So I may have a guy here and he

130
00:12:15,000 --> 00:12:18,000
gets mapped to the same slot
that somebody else has already

131
00:12:18,000 --> 00:12:21,000
been mapped to.
And when this happens,

132
00:12:21,000 --> 00:12:23,000
we call that a collision.

133
00:12:29,000 --> 00:12:33,000
So we're trying to map these
things down into a small set but

134
00:12:33,000 --> 00:12:38,000
we could get unlucky in our
mapping, particularly if we map

135
00:12:38,000 --> 00:12:41,000
enough of these guys.
They're not going to fit.

136
00:12:41,000 --> 00:12:43,000
So when a record --

137
00:12:54,000 --> 00:13:07,000
-- to be inserted maps to an
already occupied slot --

138
00:13:19,000 --> 00:13:21,000
-- a collision occurs.

139
00:13:32,000 --> 00:13:34,000
OK.
So looks like this method's no

140
00:13:34,000 --> 00:13:37,000
good.
But no, there's a pretty simple

141
00:13:37,000 --> 00:13:40,000
thing we can do.
What should we do when two

142
00:13:40,000 --> 00:13:44,000
things map to the same slot?
If we want to represent the

143
00:13:44,000 --> 00:13:49,000
whole set, but you can't lose
any data, can't treat it like a

144
00:13:49,000 --> 00:13:52,000
cache.
In a cache what you do is it

145
00:13:52,000 --> 00:13:54,000
uses a hashing scheme,
but in a cache,

146
00:13:54,000 --> 00:13:58,000
you just kick it out because
you don't care about

147
00:13:58,000 --> 00:14:04,000
representing a set precisely.
But in a hash table you're

148
00:14:04,000 --> 00:14:08,000
programming, you often want to
make sure that the values you

149
00:14:08,000 --> 00:14:14,000
have are exactly the values in
the sets so you can tell whether

150
00:14:14,000 --> 00:14:16,000
something belongs to the set or
not.

151
00:14:16,000 --> 00:14:19,000
So what's a good strategy here?
Yeah.

152
00:14:19,000 --> 00:14:24,000
Create a list for each slot and
just put all the elements that

153
00:14:24,000 --> 00:14:27,000
hash to the same slot into the
list.

154
00:14:27,000 --> 00:14:33,000
And that's called resolving
collisions by chaining.

155
00:14:38,000 --> 00:14:47,000
And the idea is to link records
in the same slot --

156
00:14:52,000 --> 00:14:56,000
-- into a list.
So for example,

157
00:14:56,000 --> 00:15:07,000
imagine this is my hash table
and this for example is slot i.

158
00:15:07,000 --> 00:15:13,000
I may have several things that
are, so I'm going to put the key

159
00:15:13,000 --> 00:15:15,000
value --

160
00:15:22,000 --> 00:15:28,000
-- have several things that may
have been inserted into this

161
00:15:28,000 --> 00:15:35,000
table that are elements of S.
And what I'll do is just link

162
00:15:35,000 --> 00:15:38,000
them together.
OK, so nil pointer here.

163
00:15:38,000 --> 00:15:43,000
And this is the satellite data
and these are the keys.

164
00:15:43,000 --> 00:15:46,000
So if they're all linked
together in slot i,

165
00:15:46,000 --> 00:15:51,000
then the hash function applied
to 49 has got to be equal to the

166
00:15:51,000 --> 00:15:55,000
hash function of 86 is equal to
the hash function of 52,

167
00:15:55,000 --> 00:15:58,000
which equals what?

168
00:16:08,000 --> 00:16:10,000
There's only one thing I
haven't.

169
00:16:10,000 --> 00:16:11,000
i.
Good.

170
00:16:11,000 --> 00:16:16,000
Even if you don't understand
it, your quizmanship should tell

171
00:16:16,000 --> 00:16:18,000
you.
He didn't mention i.

172
00:16:18,000 --> 00:16:22,000
That's equal to i.
So the point is when I hash 49,

173
00:16:22,000 --> 00:16:26,000
the hash of 49 produces me some
index in the table,

174
00:16:26,000 --> 00:16:31,000
say i, and everything that
hashes to that same location is

175
00:16:31,000 --> 00:16:35,000
linked together into a list
OK.

176
00:16:35,000 --> 00:16:39,000
Every record.
Any questions about what the

177
00:16:39,000 --> 00:16:44,000
mechanics of this.
I hope that most of you have

178
00:16:44,000 --> 00:16:49,000
seen this, seen hashing,
basic hashing in 6.001,

179
00:16:49,000 --> 00:16:51,000
right?
They teach it in?

180
00:16:51,000 --> 00:16:55,000
They used to teach it 6.001.
Yeah.

181
00:16:55,000 --> 00:16:58,000
OK.
Some people are saying maybe.

182
00:16:58,000 --> 00:17:01,000
They used to teach it.
Good.

183
00:17:01,000 --> 00:17:07,000
So let's analyze this strategy.
The analysis.

184
00:17:07,000 --> 00:17:12,000
We'll first do worst case.

185
00:17:18,000 --> 00:17:22,000
So what happens in the worst
case?

186
00:17:22,000 --> 00:17:27,000
With hashing?
Yeah, raise your hand so that I

187
00:17:27,000 --> 00:17:30,000
could call on you.
Yeah.

188
00:17:30,000 --> 00:17:37,000
Yeah, all hash keys,
well all, all the keys in S.

189
00:17:37,000 --> 00:17:46,000
I happen to pick a set S where
my hash function happens to map

190
00:17:46,000 --> 00:17:54,000
them all to the same value.
That would be bad.

191
00:17:54,000 --> 00:18:01,000
So every key hashes to the same
slot.

192
00:18:01,000 --> 00:18:04,000
And so, therefore if that
happens, then what I've

193
00:18:04,000 --> 00:18:08,000
essentially built is a fancy
linked list for keeping this

194
00:18:08,000 --> 00:18:11,000
data structure.
All this stuff with the tables,

195
00:18:11,000 --> 00:18:13,000
the hashing,
etc., irrelevant.

196
00:18:13,000 --> 00:18:17,000
All that matters is that I have
a long linked list.

197
00:18:17,000 --> 00:18:20,000
And then how long does an
access take?

198
00:18:20,000 --> 00:18:23,000
How long does it take me to
insert something or well,

199
00:18:23,000 --> 00:18:26,000
more importantly,
to search for something.

200
00:18:26,000 --> 00:18:29,000
Find out whether something's in
there.

201
00:18:29,000 --> 00:18:35,000
In the worst case.
Yeah, it takes order n time.

202
00:18:35,000 --> 00:18:41,000
Because they're all just a
link, we just have a linked

203
00:18:41,000 --> 00:18:46,000
list.
So access takes data n time if

204
00:18:46,000 --> 00:18:50,000
as we assume the size of S is
equal to n.

205
00:18:50,000 --> 00:18:57,000
So from a worst case point of
view, this doesn't look so

206
00:18:57,000 --> 00:19:02,000
attractive.
And we will see data structures

207
00:19:02,000 --> 00:19:05,000
that in worst case do very well
for this problem.

208
00:19:05,000 --> 00:19:09,000
But they don't do as good as
the average case of hashing.

209
00:19:09,000 --> 00:19:12,000
So let's analyze the average
case.

210
00:19:18,000 --> 00:19:21,000
In order to analyze the average
case, I have to,

211
00:19:21,000 --> 00:19:25,000
whenever you have averages,
whenever you have probability,

212
00:19:25,000 --> 00:19:27,000
you have to state your
assumptions.

213
00:19:27,000 --> 00:19:31,000
You have to say what is the
assumption about the behavior of

214
00:19:31,000 --> 00:19:34,000
the system.
And it's very hard to do that

215
00:19:34,000 --> 00:19:36,000
because you don't know
necessarily what the hash

216
00:19:36,000 --> 00:19:39,000
function is.
Well, let's imagine an ideal

217
00:19:39,000 --> 00:19:41,000
hash function.
What should an ideal hash

218
00:19:41,000 --> 00:19:42,000
function do?

219
00:19:54,000 --> 00:20:00,000
Yeah, map the keys essentially
at random to a slot.

220
00:20:00,000 --> 00:20:03,000
Should really distribute them
randomly.

221
00:20:03,000 --> 00:20:06,000
So we call this the assumption
--

222
00:20:11,000 --> 00:20:18,000
-- of simple uniform hashing.

223
00:20:24,000 --> 00:20:35,000
And what it means is that each
key k in S is equally likely --

224
00:20:41,000 --> 00:20:50,000
-- to be hashed to any slot in
T and we're actually have to

225
00:20:50,000 --> 00:21:00,000
make an independence assumption.
Independent of where other

226
00:21:00,000 --> 00:21:07,000
records, other keys are hashed.

227
00:21:18,000 --> 00:21:23,000
So we're going to make this
assumption and includes n an

228
00:21:23,000 --> 00:21:27,000
independence assumption.
That if I have two keys the

229
00:21:27,000 --> 00:21:34,000
odds that they're hashed to the
same place is therefore what?

230
00:21:34,000 --> 00:21:39,000
What are the odds that two keys
under this assumption are hashed

231
00:21:39,000 --> 00:21:42,000
to the same slot,
if I have, say,

232
00:21:42,000 --> 00:21:44,000
m slots?
One over m.

233
00:21:44,000 --> 00:21:48,000
What are the odds that one key
is hashed to slot 15?

234
00:21:48,000 --> 00:21:51,000
One over m.
Because they're being

235
00:21:51,000 --> 00:21:56,000
distributed, but the odds in
particular two keys are hashed

236
00:21:56,000 --> 00:22:00,000
to the same slot,
one over m.

237
00:22:08,000 --> 00:22:14,000
So let's define.
Is there a question?

238
00:22:14,000 --> 00:22:16,000
No.
OK.

239
00:22:16,000 --> 00:22:27,000
The load factor of a hash table
with n keys at m slots to be

240
00:22:27,000 --> 00:22:38,000
alpha which is equal to n over
m, which is also if you think

241
00:22:38,000 --> 00:22:50,000
about it, just the average
number of keys per slot.

242
00:22:58,000 --> 00:23:02,000
So alpha is the average number
of keys per, we call it the load

243
00:23:02,000 --> 00:23:04,000
factor of the table.
OK.

244
00:23:04,000 --> 00:23:07,000
How many on average keys do I
have?

245
00:23:07,000 --> 00:23:11,000
So the expected,
we'll look first at

246
00:23:11,000 --> 00:23:17,000
unsuccessful search time.
So by unsuccessful search,

247
00:23:17,000 --> 00:23:22,000
I mean I'm looking for
something that's actually not in

248
00:23:22,000 --> 00:23:26,000
the table.
It's going to return nil.

249
00:23:26,000 --> 00:23:32,000
I look for a key that's not in
the table.

250
00:23:32,000 --> 00:23:35,000
It's going to be what?
It's going to be order.

251
00:23:35,000 --> 00:23:40,000
Well, I have to do a certain
amount of work just to calculate

252
00:23:40,000 --> 00:23:46,000
the hash function and so forth.
It's going to be order at least

253
00:23:46,000 --> 00:23:51,000
one plus, then I have to search
the list and on average how much

254
00:23:51,000 --> 00:23:55,000
of the list do I have to search?

255
00:24:01,000 --> 00:24:04,000
What's the cost of searching
that list?

256
00:24:04,000 --> 00:24:08,000
On average.
If I'm searching at random.

257
00:24:08,000 --> 00:24:13,000
If I'm searching for a key
that's not in the table.

258
00:24:13,000 --> 00:24:18,000
Whichever one it is,
I got to search to the end of

259
00:24:18,000 --> 00:24:23,000
the list, right?
So what's the average cost over

260
00:24:23,000 --> 00:24:26,000
all the slots in the table?
Alpha.

261
00:24:26,000 --> 00:24:27,000
Right?
Alpha.

262
00:24:27,000 --> 00:24:33,000
That's the average length of a
list.

263
00:24:33,000 --> 00:24:40,000
So this is essentially the cost
of doing the hash and then

264
00:24:40,000 --> 00:24:47,000
accessing the slot and that is
just the cost of searching the

265
00:24:47,000 --> 00:24:49,000
list.

266
00:24:54,000 --> 00:24:58,000
So the expected unsuccessful
search time is proportional

267
00:24:58,000 --> 00:25:02,000
essentially to alpha and if
alpha's bigger than one,

268
00:25:02,000 --> 00:25:05,000
it's order alpha.
If alpha's less than one,

269
00:25:05,000 --> 00:25:07,000
it's constant.

270
00:25:13,000 --> 00:25:15,000
So when is the expected search
time --

271
00:25:26,000 --> 00:25:27,000
-- equal to order one?

272
00:25:34,000 --> 00:25:35,000
So when is this order one?

273
00:25:46,000 --> 00:25:48,000
Simple questions,
by the way.

274
00:25:48,000 --> 00:25:53,000
I only ask simple questions.
Some guys ask hard questions.

275
00:25:53,000 --> 00:25:55,000
Yeah.
Or in terms first we'll get

276
00:25:55,000 --> 00:25:57,000
there in two steps,
OK.

277
00:25:57,000 --> 00:26:01,000
In terms of alpha,
it's when?

278
00:26:01,000 --> 00:26:06,000
When alpha is constant.
If alpha in particular is.

279
00:26:06,000 --> 00:26:10,000
Alpha doesn't have to be
constant.

280
00:26:10,000 --> 00:26:15,000
It could be less than constant.
It's O of one,

281
00:26:15,000 --> 00:26:18,000
right.
OK, or equivalently,

282
00:26:18,000 --> 00:26:22,000
which is what you said,
if n is O of m.

283
00:26:22,000 --> 00:26:29,000
OK, which is to say if the
number of elements in the table

284
00:26:29,000 --> 00:26:36,000
is order, is upper bounded by a
constant times n.

285
00:26:36,000 --> 00:26:38,000
Then the search cost is
constant.

286
00:26:38,000 --> 00:26:42,000
So a lot of people will tell
you oh, a hash table runs in

287
00:26:42,000 --> 00:26:45,000
constant search time.
OK, that's actually wrong.

288
00:26:45,000 --> 00:26:48,000
It depends upon the load factor
of the hash table.

289
00:26:48,000 --> 00:26:52,000
And people have made
programming errors based on that

290
00:26:52,000 --> 00:26:56,000
misunderstanding of hash tables.
Because they have a hash table

291
00:26:56,000 --> 00:27:00,000
that's too small for the number
of elements they're putting in

292
00:27:00,000 --> 00:27:03,000
there.
Doesn't help.

293
00:27:03,000 --> 00:27:07,000
The number may in fact will
grow with the,

294
00:27:07,000 --> 00:27:14,000
since this is one plus n over
m, it actually grows with n.

295
00:27:14,000 --> 00:27:18,000
So unless you make sure that m
keeps up with n,

296
00:27:18,000 --> 00:27:24,000
this doesn't stay constant.
Now it turns out for a

297
00:27:24,000 --> 00:27:30,000
successful search,
it's also one plus alpha.

298
00:27:30,000 --> 00:27:34,000
And for that you need to do a
little bit more mathematics

299
00:27:34,000 --> 00:27:38,000
because you now have to
condition on searching for the

300
00:27:38,000 --> 00:27:41,000
items in the table.
But it turns out it's also one

301
00:27:41,000 --> 00:27:45,000
plus alpha and that you can read
about in the book.

302
00:27:45,000 --> 00:27:49,000
And also, there's a more
rigorous proof of this.

303
00:27:49,000 --> 00:27:53,000
I sort of have glossed over the
expectation stuff here,

304
00:27:53,000 --> 00:27:55,000
doing sort of a more intuitive
proof.

305
00:27:55,000 --> 00:28:01,000
So both of those things you
should look for in the book.

306
00:28:01,000 --> 00:28:05,000
So this is one reason why
hashing is such a popular

307
00:28:05,000 --> 00:28:10,000
method, is it basically lets you
represent a dynamic set with

308
00:28:10,000 --> 00:28:14,000
order one cost per operation,
constant cost per operation,

309
00:28:14,000 --> 00:28:19,000
inserting, deleting and so
forth, as long as the table that

310
00:28:19,000 --> 00:28:24,000
you're keeping is not much
smaller than the number of items

311
00:28:24,000 --> 00:28:29,000
that you're putting in there.
And then all the operations end

312
00:28:29,000 --> 00:28:33,000
up being constant time.
But it depends upon,

313
00:28:33,000 --> 00:28:37,000
strongly upon this assumption
of simple uniform hashing.

314
00:28:37,000 --> 00:28:41,000
And so no matter what hash
function you pick,

315
00:28:41,000 --> 00:28:45,000
I can always find a set of
elements that are going to hash,

316
00:28:45,000 --> 00:28:49,000
that that hash function is
going to hash badly.

317
00:28:49,000 --> 00:28:53,000
I just could generate a whole
bunch of them and look to see

318
00:28:53,000 --> 00:28:58,000
where the hash function takes
them and in the end pick a whole

319
00:28:58,000 --> 00:29:02,000
bunch that hash to the same
place.

320
00:29:02,000 --> 00:29:05,000
We're actually going to see a
way of countering that,

321
00:29:05,000 --> 00:29:09,000
but in practice people
understand that most programs

322
00:29:09,000 --> 00:29:14,000
that need to use things aren't
really reverse engineering the

323
00:29:14,000 --> 00:29:17,000
hash function.
And so, there's some very

324
00:29:17,000 --> 00:29:21,000
simple hash functions that seem
to work fairly well in practice.

325
00:29:21,000 --> 00:29:25,000
So in choosing a hash function
--

326
00:29:32,000 --> 00:29:34,000
-- we would like it to
distribute

327
00:29:40,000 --> 00:29:51,000
-- keys uniformly into slots
and we also would like that

328
00:29:51,000 --> 00:29:59,000
regularity in the key
distributions --

329
00:30:06,000 --> 00:30:08,000
-- should not affect
uniformity.

330
00:30:08,000 --> 00:30:12,000
For example,
a regularity that you often see

331
00:30:12,000 --> 00:30:17,000
is that all the keys that are
being inserted are even numbers.

332
00:30:17,000 --> 00:30:21,000
Somebody happens to have that
property of his data,

333
00:30:21,000 --> 00:30:24,000
that they're only inserting
even numbers.

334
00:30:24,000 --> 00:30:29,000
In fact, on many machines,
since they use byte pointers,

335
00:30:29,000 --> 00:30:33,000
if they're sorting things that
are for example,

336
00:30:33,000 --> 00:30:37,000
indexes to arrays or something
like that, in fact,

337
00:30:37,000 --> 00:30:43,000
they're numbers that are
typically divisible by four.

338
00:30:43,000 --> 00:30:45,000
Or by eight.
So you don't want regularity in

339
00:30:45,000 --> 00:30:49,000
the key distribution to affect
the fact that you're

340
00:30:49,000 --> 00:30:52,000
distributing slots.
So probably the most popular

341
00:30:52,000 --> 00:30:56,000
method that's used just for a
quick hash function is what's

342
00:30:56,000 --> 00:30:59,000
called the division method.

343
00:31:07,000 --> 00:31:11,000
And the idea here is that you
simply let h of k for a key

344
00:31:11,000 --> 00:31:15,000
equal k modulo m,
where m is the number of slots

345
00:31:15,000 --> 00:31:17,000
in your table.

346
00:31:24,000 --> 00:31:28,000
And this works reasonably well
in practice, but you want to be

347
00:31:28,000 --> 00:31:31,000
careful about your choice of
modulus.

348
00:31:31,000 --> 00:31:33,000
In other words,
it turns out it doesn't work

349
00:31:33,000 --> 00:31:36,000
well for every possible size of
table you might want to pick.

350
00:31:36,000 --> 00:31:38,000
Fortunately when you're
building hash tables,

351
00:31:38,000 --> 00:31:42,000
you don't usually care about
the specific size of the table.

352
00:31:42,000 --> 00:31:45,000
If you pick it around some
size, that's probably fine

353
00:31:45,000 --> 00:31:47,000
because it's not going to affect
their performance.

354
00:31:47,000 --> 00:31:50,000
So there's no need to pick a
specific value.

355
00:31:50,000 --> 00:31:53,000
In particular,
you don't want to pick --

356
00:32:00,000 --> 00:32:04,000
-- m with a small divisor --

357
00:32:11,000 --> 00:32:14,000
-- and let me illustrate why
that's a bad idea for this

358
00:32:14,000 --> 00:32:16,000
particular hash function.

359
00:32:27,000 --> 00:32:29,000
I should have said small
divisor d.

360
00:32:35,000 --> 00:32:36,000
So for example --

361
00:32:40,000 --> 00:32:45,000
-- if D is two,
in other words m is an even

362
00:32:45,000 --> 00:32:52,000
number, and it turns out that we
have the situation I just

363
00:32:52,000 --> 00:33:00,000
mentioned, all keys are even,
what happens to my usage of the

364
00:33:00,000 --> 00:33:04,000
hash table?
So I have an even slot,

365
00:33:04,000 --> 00:33:09,000
even number of slots,
and all the keys that the user

366
00:33:09,000 --> 00:33:14,000
of the hash table chooses to
pick happen to be even numbers,

367
00:33:14,000 --> 00:33:19,000
what's going to happen in terms
of my use of the hash table?

368
00:33:19,000 --> 00:33:24,000
Well, in the worst case,
they are always all going to

369
00:33:24,000 --> 00:33:30,000
point in the same slot no matter
what hash function I pick.

370
00:33:30,000 --> 00:33:35,000
But here, let's say that,
in fact, my hash function does

371
00:33:35,000 --> 00:33:39,000
do a pretty good job of
distributing,

372
00:33:39,000 --> 00:33:45,000
but I have this property.
What's a property that's going

373
00:33:45,000 --> 00:33:51,000
to have no matter what set of
keys I pick that satisfies this

374
00:33:51,000 --> 00:33:55,000
property?
What's going to happen to the

375
00:33:55,000 --> 00:33:58,000
hash table?
So, I have even number,

376
00:33:58,000 --> 00:34:04,000
mod an even number.
What does that say about the

377
00:34:04,000 --> 00:34:08,000
hash function?
It's even, right?

378
00:34:08,000 --> 00:34:11,000
I have an even number mod.
It's even.

379
00:34:11,000 --> 00:34:16,000
So, what's going to happen to
my use of the table?

380
00:34:16,000 --> 00:34:22,000
Yeah, you're never going to
hash anything to an odd-numbered

381
00:34:22,000 --> 00:34:26,000
slot.
You wasted half your slots.

382
00:34:26,000 --> 00:34:32,000
It doesn't matter what the key
distribution is.

383
00:34:32,000 --> 00:34:38,000
OK, as long as they're all
even, OK, that means the odds

384
00:34:38,000 --> 00:34:43,000
slots are never used.
OK, an extreme example,

385
00:34:43,000 --> 00:34:49,000
here's another example,
imagine that m is equal to two

386
00:34:49,000 --> 00:34:52,000
to the r.
In other words,

387
00:34:52,000 --> 00:34:58,000
all its factors are small
divisors, OK?

388
00:34:58,000 --> 00:35:06,000
In that case,
if I think about taking k mod

389
00:35:06,000 --> 00:35:18,000
n, OK, the hash doesn't even
depend on all the bits of k,

390
00:35:18,000 --> 00:35:22,000
OK?
So, for example,

391
00:35:22,000 --> 00:35:31,000
suppose I had one...,
and r equals six,

392
00:35:31,000 --> 00:35:43,000
OK, so m is two to the sixth.
So, I take this binary number,

393
00:35:43,000 --> 00:35:50,000
mod two to the sixth,
what's the hash value?

394
00:35:50,000 --> 00:35:59,000
If I take something mod a power
of two, what does it do?

395
00:35:59,000 --> 00:36:06,000
So, I hash this function.
This is k, OK,

396
00:36:06,000 --> 00:36:12,000
in binary.
And I take it mod two to the

397
00:36:12,000 --> 00:36:17,000
sixth.
Well, if I took it mod two,

398
00:36:17,000 --> 00:36:24,000
what's the answer?
What's this number mod two?

399
00:36:24,000 --> 00:36:29,000
Zero, right.
OK, what's this number mod

400
00:36:29,000 --> 00:36:32,000
four?
One zero.

401
00:36:32,000 --> 00:36:35,000
What is it mod two to the
sixth?

402
00:36:35,000 --> 00:36:39,000
Yeah, it's just these last six
bits.

403
00:36:39,000 --> 00:36:43,000
This is H of k.
OK, when you take something mod

404
00:36:43,000 --> 00:36:48,000
a power of two,
all you're doing is taking its

405
00:36:48,000 --> 00:36:51,000
low order bits.
OK, mod two to the r,

406
00:36:51,000 --> 00:36:54,000
you are taking its r low order
bits.

407
00:36:54,000 --> 00:37:02,000
So, the hash function doesn't
even depend on what's up here.

408
00:37:02,000 --> 00:37:05,000
So, that's a pretty bad
situation because generally you

409
00:37:05,000 --> 00:37:09,000
would like a very common
regularity that you'll see in

410
00:37:09,000 --> 00:37:12,000
data is that all the low order
bits are the same,

411
00:37:12,000 --> 00:37:16,000
and all the high order bits
differ, or vice versa.

412
00:37:16,000 --> 00:37:20,000
So, this particular is not a
very good one.

413
00:37:20,000 --> 00:37:25,000
So, good heuristics for this is
to pick m to be a prime,

414
00:37:25,000 --> 00:37:31,000
not too close to a power of two
or ten because those are the two

415
00:37:31,000 --> 00:37:36,000
common bases that you see
regularity in the world.

416
00:37:36,000 --> 00:37:39,000
A prime is sometimes
inconvenient,

417
00:37:39,000 --> 00:37:41,000
however.
But generally,

418
00:37:41,000 --> 00:37:44,000
it's fairly easy to find
primes.

419
00:37:44,000 --> 00:37:49,000
And there's a lot of nice
theorems about primes.

420
00:37:49,000 --> 00:37:54,000
So, generally what you do,
if you're just coding up

421
00:37:54,000 --> 00:38:00,000
something and you know what it
is, you can pick a prime out of

422
00:38:00,000 --> 00:38:06,000
a textbook or look it up on the
web or write a little program,

423
00:38:06,000 --> 00:38:11,000
or whatever,
and pick a prime.

424
00:38:11,000 --> 00:38:15,000
Not too close to a power of two
or ten, and it will probably

425
00:38:15,000 --> 00:38:18,000
work pretty well.
It will probably work pretty

426
00:38:18,000 --> 00:38:20,000
well.
So, this is a very popular

427
00:38:20,000 --> 00:38:24,000
method, the division method.
OK, but the next method we are

428
00:38:24,000 --> 00:38:27,000
going to see is actually usually
superior.

429
00:38:27,000 --> 00:38:32,000
The reason people do this is
because they can write in-line

430
00:38:32,000 --> 00:38:36,000
in their code.
OK, but it's not usually the

431
00:38:36,000 --> 00:38:39,000
best method.
And the reason is because

432
00:38:39,000 --> 00:38:44,000
division, one of the reasons is
division tends to take a lot of

433
00:38:44,000 --> 00:38:48,000
cycles to compute on most
computers compared with

434
00:38:48,000 --> 00:38:51,000
multiplication or addition.
OK, in fact,

435
00:38:51,000 --> 00:38:55,000
it's usually done with taking
several multiplications.

436
00:38:55,000 --> 00:38:59,000
So, the next method is actually
generally better,

437
00:38:59,000 --> 00:39:03,000
but none of the hash function
methods that we are talking

438
00:39:03,000 --> 00:39:06,000
about today are,
in some sense,

439
00:39:06,000 --> 00:39:12,000
provably good hash functions.
OK, so for the multiplication

440
00:39:12,000 --> 00:39:18,000
method, the nice thing about it
is just essentially requires

441
00:39:18,000 --> 00:39:22,000
multiplication to do.
And, for that is,

442
00:39:22,000 --> 00:39:28,000
also, we are going to assume
that the number of slots is a

443
00:39:28,000 --> 00:39:32,000
power of two which is also often
very convenient.

444
00:39:32,000 --> 00:39:37,000
OK, and for this,
we're going to assume that the

445
00:39:37,000 --> 00:39:44,000
computer has w bit words.
So, it would be convenient on a

446
00:39:44,000 --> 00:39:50,000
computer with 32 bits,
or 64 bits, for example.

447
00:39:50,000 --> 00:39:54,000
OK, this would be very
convenient.

448
00:39:54,000 --> 00:39:59,000
So, the hash function is the
following.

449
00:39:59,000 --> 00:40:04,000
h of k is equal to A times k
mod, two to the w,

450
00:40:04,000 --> 00:40:12,000
right shifted by w minus r.
OK, so the key part of this is

451
00:40:12,000 --> 00:40:20,000
A, which has chosen to be an odd
integer in the range between two

452
00:40:20,000 --> 00:40:24,000
to the w minus one and two to
the w.

453
00:40:24,000 --> 00:40:31,000
OK, so it's an odd integer that
the full width of the computer

454
00:40:31,000 --> 00:40:36,000
word.
OK, and what you do is multiply

455
00:40:36,000 --> 00:40:42,000
it by whatever your key is,
by this funny integer.

456
00:40:42,000 --> 00:40:47,000
And, then take it mod two to
the w.

457
00:40:47,000 --> 00:40:54,000
And then, you take the result
and right shift it by this fixed

458
00:40:54,000 --> 00:41:00,000
amount, w minus r.
So, this is a bit wise right

459
00:41:00,000 --> 00:41:06,000
shift.
OK, so let's look at what this

460
00:41:06,000 --> 00:41:12,000
does.
But first, let me just give you

461
00:41:12,000 --> 00:41:21,000
a couple of tips on how you
pick, or what you don't pick for

462
00:41:21,000 --> 00:41:27,000
A.
So, you don't pick A too close

463
00:41:27,000 --> 00:41:34,000
to a power of two.
And, it's generally a pretty

464
00:41:34,000 --> 00:41:42,000
fast method because
multiplication mod two to the w

465
00:41:42,000 --> 00:41:49,000
is faster than division.
And the other thing is that a

466
00:41:49,000 --> 00:41:52,000
right shift is fast,
especially because this is a

467
00:41:52,000 --> 00:41:55,000
known shift.
OK, you know it before you are

468
00:41:55,000 --> 00:41:59,000
computing the hash function.
Both w and r are known in

469
00:41:59,000 --> 00:42:02,000
advance.
So, the compiler can often do

470
00:42:02,000 --> 00:42:06,000
tricks there to make it go even
faster.

471
00:42:06,000 --> 00:42:11,000
So, let's do an example to
understand how this hash

472
00:42:11,000 --> 00:42:14,000
function works.
So, we will have,

473
00:42:14,000 --> 00:42:18,000
in this case,
a number of slots will be

474
00:42:18,000 --> 00:42:22,000
eight, which is two to the
three.

475
00:42:22,000 --> 00:42:26,000
And, we'll have a bizarre word
size of seven bits.

476
00:42:26,000 --> 00:42:33,000
Anybody know any seven bit
computers out there?

477
00:42:33,000 --> 00:42:39,000
OK, well, here's one.
So, A is our fixed value that's

478
00:42:39,000 --> 00:42:45,000
used for hashing all our keys.
And, in this case,

479
00:42:45,000 --> 00:42:50,000
let's say it's 1011001.
So, that's A.

480
00:42:50,000 --> 00:42:57,000
And, I take in some value for k
that I'm going to multiply.

481
00:42:57,000 --> 00:43:04,000
So, k is going to be 1101011.
So, that's my k.

482
00:43:04,000 --> 00:43:07,000
And, I multiply them.
What I multiply two,

483
00:43:07,000 --> 00:43:10,000
each of these is the full word
width.

484
00:43:10,000 --> 00:43:14,000
You can view it as the full
word width of the machine,

485
00:43:14,000 --> 00:43:16,000
in this case,
seven bits.

486
00:43:16,000 --> 00:43:20,000
So, in general,
this would be like a 32 bit

487
00:43:20,000 --> 00:43:24,000
number, and my key,
I'd be multiplying two 32 bit

488
00:43:24,000 --> 00:43:28,000
numbers, for example.
OK, and so, when I multiply

489
00:43:28,000 --> 00:43:33,000
that out, I get a 2w bit answer.
So, when you multiply two w bit

490
00:43:33,000 --> 00:43:38,000
numbers, you get a 2w bit
answer.

491
00:43:38,000 --> 00:43:44,000
In this case,
it happens to be that number,

492
00:43:44,000 --> 00:43:49,000
OK?
So, that's the product part,

493
00:43:49,000 --> 00:43:54,000
OK?
And then we take it mod two to

494
00:43:54,000 --> 00:43:59,000
the w.
Well, what mod two to the w

495
00:43:59,000 --> 00:44:09,000
says is that I'm just taking,
ignoring the high order bits of

496
00:44:09,000 --> 00:44:16,000
this product.
So, all of these are ignored,

497
00:44:16,000 --> 00:44:22,000
because, remember that if I
take something,

498
00:44:22,000 --> 00:44:30,000
mod, a power of two,
that's just the low order bits.

499
00:44:30,000 --> 00:44:33,000
So, I just get these low order
bits as being the mod.

500
00:44:33,000 --> 00:44:38,000
And then, the right shift
operation, and that's good also,

501
00:44:38,000 --> 00:44:42,000
by the way, because a lot of
machines, when I multiply two 32

502
00:44:42,000 --> 00:44:46,000
bit numbers, they'll have an
instruction that gives you just

503
00:44:46,000 --> 00:44:49,000
the 32 lower bits.
And, it's usually an

504
00:44:49,000 --> 00:44:54,000
instruction that's faster than
the instruction that gives you

505
00:44:54,000 --> 00:44:58,000
the full 64 bit answer.
OK, so, that's very convenient.

506
00:44:58,000 --> 00:45:01,000
And, the second thing is,
then, that I want just the,

507
00:45:01,000 --> 00:45:04,000
in this case,
three bits that are the high

508
00:45:04,000 --> 00:45:11,000
order bits of this word.
So, this ends up being my H of

509
00:45:11,000 --> 00:45:13,000
k.
And these end up getting

510
00:45:13,000 --> 00:45:18,000
removed by right shifting this
word over.

511
00:45:18,000 --> 00:45:23,000
So, you just right shift that
in, zeros come in,

512
00:45:23,000 --> 00:45:28,000
in a high order bit,
and you end up getting that

513
00:45:28,000 --> 00:45:32,000
value of H of k.
OK, so to understand what's

514
00:45:32,000 --> 00:45:36,000
going on here,
why this is a pretty good

515
00:45:36,000 --> 00:45:43,000
method, or what's happening with
it, you can imagine that one way

516
00:45:43,000 --> 00:45:52,000
to think about it is to think of
A as being a binary fraction.

517
00:45:52,000 --> 00:45:55,000
So, imagine that the decimal
point is here,

518
00:45:55,000 --> 00:46:00,000
sorry, the binary point,
OK, the radix point is here.

519
00:46:00,000 --> 00:46:03,000
Then when I multiply things,
I'm just taking,

520
00:46:03,000 --> 00:46:06,000
the binary point ends up being
there.

521
00:46:06,000 --> 00:46:09,000
OK, so if you just imagine that
conceptually,

522
00:46:09,000 --> 00:46:14,000
we don't have to actually put
this into the hardware because

523
00:46:14,000 --> 00:46:16,000
we just do what the hardware
does.

524
00:46:16,000 --> 00:46:20,000
But, I can imagine that it's
there, and that it's here.

525
00:46:20,000 --> 00:46:25,000
And so, what I'm really taking
is the fractional part of this

526
00:46:25,000 --> 00:46:29,000
product if I treat A as a
fraction of a number.

527
00:46:29,000 --> 00:46:35,000
So, we can certainly look at
that as sort of a modular wheel.

528
00:46:35,000 --> 00:46:39,000
So, here I have a wheel where
this is going to be,

529
00:46:39,000 --> 00:46:43,000
that I'm going to divide into
eight parts, OK,

530
00:46:43,000 --> 00:46:48,000
where this point is zero.
And then, I go around,

531
00:46:48,000 --> 00:46:52,000
and this point is then one.
And, I go around,

532
00:46:52,000 --> 00:46:55,000
and this point is two,
and so forth,

533
00:46:55,000 --> 00:47:01,000
so that all the integers,
if I wrap it around this unit

534
00:47:01,000 --> 00:47:06,000
wheel, all the integers lined up
at the zero point here,

535
00:47:06,000 --> 00:47:10,000
OK?
And then, we can divide this

536
00:47:10,000 --> 00:47:14,000
into the fractional pieces.
So, that's essentially the zero

537
00:47:14,000 --> 00:47:17,000
point.
This is the one eighth,

538
00:47:17,000 --> 00:47:20,000
because we are dividing into
eight, two, three,

539
00:47:20,000 --> 00:47:23,000
four, five, six,
seven.

540
00:47:23,000 --> 00:47:28,000
So, if I have one times A,
in this case,

541
00:47:28,000 --> 00:47:33,000
I'm basically saying,
well, one times A,

542
00:47:33,000 --> 00:47:39,000
if I multiply,
is basically going around to

543
00:47:39,000 --> 00:47:45,000
about there, five and a half I
think, right,

544
00:47:45,000 --> 00:47:51,000
because one times A is about
five and a half,

545
00:47:51,000 --> 00:47:59,000
OK, or five halves of 5.5
eighths, essentially.

546
00:47:59,000 --> 00:48:04,000
So, it takes me about to there.
That's A.

547
00:48:04,000 --> 00:48:09,000
And, if I do 2^A,
that continues around,

548
00:48:09,000 --> 00:48:12,000
and takes me up to about,
where?

549
00:48:12,000 --> 00:48:18,000
About, a little past three,
about to there.

550
00:48:18,000 --> 00:48:22,000
So, that's 2^A.
OK, and 3^A takes me,

551
00:48:22,000 --> 00:48:28,000
then, around to somewhere like
about there.

552
00:48:28,000 --> 00:48:35,000
So, each time I add another A,
it's taking me another A's

553
00:48:35,000 --> 00:48:41,000
distance around.
And, the idea is that if A is,

554
00:48:41,000 --> 00:48:44,000
for example,
odd, and it's not too close to

555
00:48:44,000 --> 00:48:48,000
a power of two,
then what's happening is sort

556
00:48:48,000 --> 00:48:52,000
of throwing it into another slot
on a different thing.

557
00:48:52,000 --> 00:48:57,000
So, if I now go around,
if I have k being very big,

558
00:48:57,000 --> 00:49:01,000
then k times A is going around
k times.

559
00:49:01,000 --> 00:49:04,000
Where does it end up?
It's like spinning a wheel of

560
00:49:04,000 --> 00:49:06,000
fortune or something.
OK, it ends somewhere.

561
00:49:06,000 --> 00:49:09,000
OK, and so that's basically the
notion.

562
00:49:09,000 --> 00:49:12,000
That's basically the notion,
that it's going to end up in

563
00:49:12,000 --> 00:49:15,000
some place.
So, you're basically looking

564
00:49:15,000 --> 00:49:18,000
at, where does ka end up?
Well, it sort of whirls around,

565
00:49:18,000 --> 00:49:22,000
and ends up at some point.
OK, and so that's why that

566
00:49:22,000 --> 00:49:26,000
tends to be a fairly good one.
But, these are only heuristic

567
00:49:26,000 --> 00:49:29,000
methods for hashing,
because for any hash function,

568
00:49:29,000 --> 00:49:32,000
you can always find a set of
keys that's going to make it

569
00:49:32,000 --> 00:49:38,000
operate badly.
So, the question is,

570
00:49:38,000 --> 00:49:44,000
well, what do you use in
practice?

571
00:49:44,000 --> 00:49:52,000
OK, the second topic that I
want to tie it,

572
00:49:52,000 --> 00:50:03,000
so, we talked about resolving
collisions by chaining.

573
00:50:03,000 --> 00:50:11,000
OK, there's another way of
resolving collisions,

574
00:50:11,000 --> 00:50:19,000
which is often useful,
which is resolving collisions

575
00:50:19,000 --> 00:50:25,000
by what's called open
addressing.

576
00:50:25,000 --> 00:50:31,000
OK, and the idea is,
in this method,

577
00:50:31,000 --> 00:50:38,000
is we have no storage for
links.

578
00:50:38,000 --> 00:50:43,000
So, when I result by chaining,
I'd need an extra linked field

579
00:50:43,000 --> 00:50:47,000
in each record in order to be
able to do that.

580
00:50:47,000 --> 00:50:51,000
Now, that's not necessarily a
big overhead,

581
00:50:51,000 --> 00:50:57,000
but for some applications,
I don't want to have to touch

582
00:50:57,000 --> 00:51:00,000
those records at all.
OK, and for those,

583
00:51:00,000 --> 00:51:07,000
open addressing is a useful way
to resolve collisions.

584
00:51:07,000 --> 00:51:10,000
So, the idea is,
with open addressing,

585
00:51:10,000 --> 00:51:15,000
is if I hash to a given slot,
and the slot is full,

586
00:51:15,000 --> 00:51:21,000
OK, what I do is I just hash
again with a different hash

587
00:51:21,000 --> 00:51:25,000
function, with my second hash
function.

588
00:51:25,000 --> 00:51:29,000
I check that slot.
OK, if that slot is full,

589
00:51:29,000 --> 00:51:34,000
OK, then I hash again.
And, I keep this probe

590
00:51:34,000 --> 00:51:39,000
sequence, which hopefully is a
permutation so that I'm not

591
00:51:39,000 --> 00:51:43,000
going back and checking things
that I've already checked until

592
00:51:43,000 --> 00:51:47,000
I find a place to put it.
And, if I got a good probe

593
00:51:47,000 --> 00:51:52,000
sequence that I will hopefully,
then, find a place to put it

594
00:51:52,000 --> 00:51:55,000
fairly quickly.
OK, and then to search,

595
00:51:55,000 --> 00:51:59,000
I just follow the same probe
sequence.

596
00:51:59,000 --> 00:52:05,000
So, the idea,
here, is we probe the table

597
00:52:05,000 --> 00:52:12,000
systematically until an empty
slot is found,

598
00:52:12,000 --> 00:52:17,000
OK?
And so, we can extend that by

599
00:52:17,000 --> 00:52:25,000
looking as if the sequence of
hash functions were,

600
00:52:25,000 --> 00:52:32,000
in fact, a hash function that
took two arguments:

601
00:52:32,000 --> 00:52:40,000
a key and a probe step.
In other words,

602
00:52:40,000 --> 00:52:44,000
is it the zero of one our first
one?

603
00:52:44,000 --> 00:52:48,000
It's the second one,
etc.

604
00:52:48,000 --> 00:52:55,000
So, it takes two arguments.
So, H is then going to map our

605
00:52:55,000 --> 00:53:04,000
universe of keys cross,
our probe number into a slot.

606
00:53:04,000 --> 00:53:10,000
So, this is the universe of
keys.

607
00:53:10,000 --> 00:53:20,000
This is the probe number.
And, this is going to be the

608
00:53:20,000 --> 00:53:25,000
slot.
Now, as I mentioned,

609
00:53:25,000 --> 00:53:34,000
the probe sequence should be
permutation.

610
00:53:34,000 --> 00:53:38,000
In other words,
it should just be the numbers

611
00:53:38,000 --> 00:53:44,000
from zero to n minus one in some
fairly random order.

612
00:53:44,000 --> 00:53:48,000
OK, it should just be
rearranged.

613
00:53:48,000 --> 00:53:54,000
And the other thing about open
addressing is that you don't

614
00:53:54,000 --> 00:54:01,000
have to worry about n chaining
is that the table may actually

615
00:54:01,000 --> 00:54:05,000
fill up.
So, you have to have that the

616
00:54:05,000 --> 00:54:10,000
number of elements in the table
is less than or equal to the

617
00:54:10,000 --> 00:54:16,000
table size, the number of slots
because the table may fill up.

618
00:54:16,000 --> 00:54:19,000
And, if it's full,
you're going to probe

619
00:54:19,000 --> 00:54:23,000
everywhere.
You are never going to get a

620
00:54:23,000 --> 00:54:27,000
place to put it.
And, the final thing is that in

621
00:54:27,000 --> 00:54:32,000
this type of scheme,
deletion is difficult.

622
00:54:32,000 --> 00:54:34,000
It's not impossible.
There are schemes for doing

623
00:54:34,000 --> 00:54:36,000
deletion.
But, it's basically hard

624
00:54:36,000 --> 00:54:40,000
because the danger is that you
remove a key out of the table,

625
00:54:40,000 --> 00:54:44,000
and now, somebody who's doing a
probe sequence who would have

626
00:54:44,000 --> 00:54:47,000
hit that key and gone to find
his element now finds that it's

627
00:54:47,000 --> 00:54:49,000
an empty slot.
And he says,

628
00:54:49,000 --> 00:54:52,000
oh, the key I am looking for
probably isn't there.

629
00:54:52,000 --> 00:54:54,000
OK, so you have that issue to
deal with.

630
00:54:54,000 --> 00:54:57,000
So, you can delete things but
keep them marked,

631
00:54:57,000 --> 00:55:00,000
and there's all kinds of
schemes that people have for

632
00:55:00,000 --> 00:55:04,000
doing deletion.
But it's difficult.

633
00:55:04,000 --> 00:55:07,000
It's messy compared to
chaining, where you can just

634
00:55:07,000 --> 00:55:09,000
remove the element out of the
chain.

635
00:55:09,000 --> 00:55:12,000
So, let's do an example --

636
00:55:25,000 --> 00:55:37,000
-- just so that we make sure
we're on the same page.

637
00:55:37,000 --> 00:55:45,000
So, we'll insert a key.
k is 496.

638
00:55:45,000 --> 00:55:57,000
OK, so here's my table.
And, I've got some values in

639
00:55:57,000 --> 00:56:06,000
it, 586, 133,
204, 481, etc.

640
00:56:06,000 --> 00:56:13,000
So, the table looks like that;
the other places are empty.

641
00:56:13,000 --> 00:56:18,000
So, on my zero step,
I probe H of 496,

642
00:56:18,000 --> 00:56:22,000
zero.
OK, and let's say that takes me

643
00:56:22,000 --> 00:56:28,000
to the slot where there's 204.
And so, I say,

644
00:56:28,000 --> 00:56:36,000
oh, there's something there.
I have to probe again.

645
00:56:36,000 --> 00:56:41,000
So then, I probe H of 496,
one.

646
00:56:41,000 --> 00:56:47,000
Maybe that maps me there,
and I discover,

647
00:56:47,000 --> 00:56:55,000
oh, there's something there.
So, now, I probe H of 496,

648
00:56:55,000 --> 00:57:02,000
two.
Maybe that takes me to there.

649
00:57:02,000 --> 00:57:04,000
It's empty.
So, if I'm doing a search,

650
00:57:04,000 --> 00:57:07,000
I report nil.
If I'm doing in the insert,

651
00:57:07,000 --> 00:57:11,000
I put it there.
And then, if I'm looking for

652
00:57:11,000 --> 00:57:15,000
that value, if I put it there,
then when I'm looking,

653
00:57:15,000 --> 00:57:18,000
I go through exactly the same
sequence.

654
00:57:18,000 --> 00:57:21,000
I'll find these things are
busy, and then,

655
00:57:21,000 --> 00:57:26,000
eventually, I'll come up and
discover the value.

656
00:57:26,000 --> 00:57:29,000
OK, and there are various
heuristics that people use,

657
00:57:29,000 --> 00:57:34,000
as well, like keeping track of
the longest probe sequence

658
00:57:34,000 --> 00:57:37,000
because there's no point in
probing beyond the largest

659
00:57:37,000 --> 00:57:41,000
number of probes that need to be
done globally to do an

660
00:57:41,000 --> 00:57:44,000
insertion.
OK, so if it took me 5,

661
00:57:44,000 --> 00:57:48,000
5 is the maximum number of
probes I ever did for an

662
00:57:48,000 --> 00:57:51,000
insertion.
A search never has to look more

663
00:57:51,000 --> 00:57:54,000
than five, OK,
and so sometimes hash tables

664
00:57:54,000 --> 00:57:58,000
will keep that auxiliary value
so that it can quit rather than

665
00:57:58,000 --> 00:58:04,000
continuing to probe until it
doesn't find something.

666
00:58:04,000 --> 00:58:13,000
OK, so, search is the same
probe sequence.

667
00:58:13,000 --> 00:58:23,000
And, if it's successful,
it finds the record.

668
00:58:23,000 --> 00:58:34,000
And, if it's unsuccessful,
you find a nil.

669
00:58:34,000 --> 00:58:37,000
OK, so it's pretty
straightforward.

670
00:58:37,000 --> 00:58:42,000
So, once again,
as with just hash functions to

671
00:58:42,000 --> 00:58:49,000
begin with, there are a lot of
ideas about how you should form

672
00:58:49,000 --> 00:58:55,000
a probe sequence,
ways of doing this effectively.

673
00:59:06,000 --> 00:59:14,000
OK, so the simplest one is
called linear probing,

674
00:59:14,000 --> 00:59:22,000
and what you do there is you
have H of k comma i.

675
00:59:22,000 --> 00:59:33,000
You just make that be some H
prime of k, zero plus i mod m.

676
00:59:33,000 --> 00:59:36,000
Sorry, no prime there.
OK, so what happens is,

677
00:59:36,000 --> 00:59:41,000
so, the idea here is that all
you are doing on the I'th probe

678
00:59:41,000 --> 00:59:44,000
is, on the zero'th probe,
you look at H of k zero.

679
00:59:44,000 --> 00:59:48,000
On probe one,
you just look at the slot after

680
00:59:48,000 --> 00:59:50,000
that.
Probe two, you look at the slot

681
00:59:50,000 --> 00:59:53,000
after that.
So, you're just simply,

682
00:59:53,000 --> 00:59:56,000
rather than sort of jumping
around like this,

683
00:59:56,000 --> 01:00:01,542
you probe there and then just
find the next one that will fit

684
01:00:01,542 --> 01:00:04,785
in.
OK, so you just scan down mod

685
01:00:04,785 --> 01:00:06,509
m.
So, if you hit the bottom,

686
01:00:06,509 --> 01:00:08,848
you go to the top.
OK, so the I'th one,

687
01:00:08,848 --> 01:00:12,050
so that's fairly easy to do
because you don't have to

688
01:00:12,050 --> 01:00:14,574
recomputed a full hash function
each time.

689
01:00:14,574 --> 01:00:18,083
All you have to do is add one
each time you go because the

690
01:00:18,083 --> 01:00:21,531
difference between this and the
previous one is just one.

691
01:00:21,531 --> 01:00:24,794
OK, so you just go down.
Now, the problem with that is

692
01:00:24,794 --> 01:00:27,195
that you get a phenomenon of
clustering.

693
01:00:27,195 --> 01:00:30,458
If you get a few things in a
given area, then suddenly

694
01:00:30,458 --> 01:00:33,906
everything, everybody has to
keep searching to the end of

695
01:00:33,906 --> 01:00:38,277
those things.
OK, so that turns out not to be

696
01:00:38,277 --> 01:00:42,246
one of the better schemes,
although it's not bad if you

697
01:00:42,246 --> 01:00:45,258
just need to do something quick
and dirty.

698
01:00:45,258 --> 01:00:49,594
So, it suffers from primary
clustering, where regions of the

699
01:00:49,594 --> 01:00:53,635
hash table get very full.
And then, anything that hashes

700
01:00:53,635 --> 01:00:57,750
into that region has to look
through all the stuff that's

701
01:00:57,750 --> 01:01:02,030
there.
OK, so: long runs of filled

702
01:01:02,030 --> 01:01:05,846
slots.
OK, there's also things like

703
01:01:05,846 --> 01:01:11,459
quadratic clustering,
where you basically make this

704
01:01:11,459 --> 01:01:17,744
be, instead of adding one each
time, you add i each time.

705
01:01:17,744 --> 01:01:23,581
OK, but probably the most
effective popular scheme is

706
01:01:23,581 --> 01:01:29,867
what's called double hashing.
And, you can do statistical

707
01:01:29,867 --> 01:01:35,715
studies.
People have done statistical

708
01:01:35,715 --> 01:01:41,819
studies to show that this is a
good scheme, OK,

709
01:01:41,819 --> 01:01:48,056
where you let H of k,
i, let me do it below here

710
01:01:48,056 --> 01:01:54,957
because I have for them.
So, H of k, i is equal to an

711
01:01:54,957 --> 01:02:03,467
H_1 of k plus i times H_2 of k.
So, you have two hash functions

712
01:02:03,467 --> 01:02:07,157
on m.
You have two hash functions,

713
01:02:07,157 --> 01:02:13,085
H_1 of k and H_2 of k.
OK, so you compute the two hash

714
01:02:13,085 --> 01:02:19,907
functions, and what you do is
you start by just using H_1 of k

715
01:02:19,907 --> 01:02:23,486
for the zero probe,
because here,

716
01:02:23,486 --> 01:02:26,282
i, then, will be zero.
OK.

717
01:02:26,282 --> 01:02:34,000
Then, for the probe number one,
OK, you just add H_2 of k.

718
01:02:34,000 --> 01:02:37,466
For probe number two,
you just add that hash function

719
01:02:37,466 --> 01:02:40,266
amount again.
You just keep adding H_2 of k

720
01:02:40,266 --> 01:02:42,533
for each successive probe you
make.

721
01:02:42,533 --> 01:02:45,933
So, it's fairly easy;
you compute two hash functions

722
01:02:45,933 --> 01:02:48,599
up front, OK,
or you can delay the second

723
01:02:48,599 --> 01:02:50,400
one, in case.
But basically,

724
01:02:50,400 --> 01:02:54,000
you compute two up front,
and then you just keep adding

725
01:02:54,000 --> 01:02:57,066
the second one in.
You start at the location of

726
01:02:57,066 --> 01:03:00,066
the first one,
and keep adding the second one,

727
01:03:00,066 --> 01:03:04,000
mod m, to determine your probe
sequences.

728
01:03:04,000 --> 01:03:07,757
So, this is an excellent
method.

729
01:03:07,757 --> 01:03:14,181
OK, it does a fine job,
and you usually pick m to be a

730
01:03:14,181 --> 01:03:19,393
power of two here,
OK, so that you're using,

731
01:03:19,393 --> 01:03:25,939
usually people use this with
the multiplication method,

732
01:03:25,939 --> 01:03:30,787
for example,
so that m is a power of two,

733
01:03:30,787 --> 01:03:36,000
and H_2 of k you force to be
odd.

734
01:03:36,000 --> 01:03:40,578
OK, so we don't use and even
value there, because otherwise

735
01:03:40,578 --> 01:03:44,210
for any particular key,
you'd be skipping over.

736
01:03:44,210 --> 01:03:49,105
Once again, you would have the
problem that everything could be

737
01:03:49,105 --> 01:03:53,526
even, or everything could be odd
as you're going through.

738
01:03:53,526 --> 01:03:57,788
But, if you make H_2 of k odd,
and m is a power of two,

739
01:03:57,788 --> 01:04:00,631
you are guaranteed to hit every
slot.

740
01:04:00,631 --> 01:04:03,157
OK, so let's analyze this
scheme.

741
01:04:03,157 --> 01:04:09,000
This turns out to be a pretty
interesting scheme to analyze.

742
01:04:09,000 --> 01:04:14,080
It's got some nice math in it.
So, once again,

743
01:04:14,080 --> 01:04:18,032
in the worst case,
hashing is lousy.

744
01:04:18,032 --> 01:04:23,000
So, we're going to analyze
average case.

745
01:04:35,000 --> 01:04:45,615
OK, and for this,
we need a little bit stronger

746
01:04:45,615 --> 01:04:59,230
assumption than for chaining.
And, we call it the assumption

747
01:04:59,230 --> 01:05:09,846
of uniform hashing,
which says that each key is

748
01:05:09,846 --> 01:05:19,769
equally likely,
OK, to have any one of the m

749
01:05:19,769 --> 01:05:32,000
factorial permutations as its
probe sequence,

750
01:05:32,000 --> 01:05:34,000
independent of other keys.

751
01:05:45,000 --> 01:05:55,291
And, the theorem we're going to
prove is that the expected

752
01:05:55,291 --> 01:06:03,777
number of probes is,
at most, one over one minus

753
01:06:03,777 --> 01:06:11,000
alpha if alpha is less than one,
OK,

754
01:06:11,000 --> 01:06:17,000
that is, if the number of keys
in the table is less than number

755
01:06:17,000 --> 01:06:20,870
of slots.
OK, so we're going to show that

756
01:06:20,870 --> 01:06:26,000
the number of probes is one over
one minus alpha.

757
01:06:34,000 --> 01:06:38,700
So, alpha is the load factor,
and of course,

758
01:06:38,700 --> 01:06:44,057
for open addressing,
we want the load factor to be

759
01:06:44,057 --> 01:06:49,852
less than one because if we have
more keys than slots,

760
01:06:49,852 --> 01:06:56,520
open addressing simply doesn't
work, OK, because you've got to

761
01:06:56,520 --> 01:07:00,784
find a place for every key in
the table.

762
01:07:00,784 --> 01:07:05,485
So, the proof,
we'll look at an unsuccessful

763
01:07:05,485 --> 01:07:12,908
search, OK?
So, the first thing is that one

764
01:07:12,908 --> 01:07:21,141
probe is always necessary.
OK, so if I have n over m,

765
01:07:21,141 --> 01:07:29,533
sorry, if I have n items stored
in m slots, what's the

766
01:07:29,533 --> 01:07:38,875
probability that when I do that
probe I get a collision with

767
01:07:38,875 --> 01:07:46,000
something that's already in the
table?

768
01:07:46,000 --> 01:07:51,526
What's the probability that I
get a collision?

769
01:07:51,526 --> 01:07:53,982
Yeah?
Yeah, n over m,

770
01:07:53,982 --> 01:07:57,298
right?
So, with probability,

771
01:07:57,298 --> 01:08:04,052
n over m, we have a collision
because my table has got n

772
01:08:04,052 --> 01:08:08,487
things in there.
I'm hashing,

773
01:08:08,487 --> 01:08:15,551
at random, to one of them.
OK, so, what are the odds I hit

774
01:08:15,551 --> 01:08:21,376
something, n over m?
And then, a second probe is

775
01:08:21,376 --> 01:08:24,102
necessary.
OK, so then,

776
01:08:24,102 --> 01:08:30,175
I do a second probe.
And, with what probability on

777
01:08:30,175 --> 01:08:36,000
the second probe do I get a
collision?

778
01:08:36,000 --> 01:08:40,158
So, we're going to make the
assumption of uniform hashing.

779
01:08:40,158 --> 01:08:44,536
Each key is equally likely to
have any one of the m factorial

780
01:08:44,536 --> 01:08:47,017
permutations as its probe
sequence.

781
01:08:47,017 --> 01:08:50,810
So, what is the probability
that on the second probe,

782
01:08:50,810 --> 01:08:53,000
OK, I get a collision?

783
01:09:10,000 --> 01:09:14,778
Yeah?
If it's a permutation,

784
01:09:14,778 --> 01:09:21,504
you're not, right?
Something like that.

785
01:09:21,504 --> 01:09:30,000
What is it exactly?
So, that's the question.

786
01:09:30,000 --> 01:09:35,478
OK, so you are not going to hit
the same slot because it's going

787
01:09:35,478 --> 01:09:37,652
to be a permutation.
Yeah?

788
01:09:37,652 --> 01:09:41,913
That's exactly right.
n minus one over m minus one

789
01:09:41,913 --> 01:09:45,652
because I'm now,
I've essentially eliminated

790
01:09:45,652 --> 01:09:48,694
that slot that I hit the first
time.

791
01:09:48,694 --> 01:09:52,694
And so, I have,
now, and there was a key there.

792
01:09:52,694 --> 01:09:56,347
So, now I'm essentially
looking, at random,

793
01:09:56,347 --> 01:10:00,782
into the remaining n minus one
slots where there are

794
01:10:00,782 --> 01:10:06,000
aggregately n minus one keys in
those slots.

795
01:10:06,000 --> 01:10:11,306
OK, everybody got that?
OK, so with that probability,

796
01:10:11,306 --> 01:10:16,204
I get a collision.
That means that I need a third

797
01:10:16,204 --> 01:10:18,142
probe necessary,
OK?

798
01:10:18,142 --> 01:10:23,346
And, we keep going on.
OK, so what is it going to be

799
01:10:23,346 --> 01:10:27,836
the next time?
Yeah, it's going to be n minus

800
01:10:27,836 --> 01:10:33,939
two over m minus two.
So, let's note,

801
01:10:33,939 --> 01:10:44,716
OK, that n minus i over m minus
i is less than n over m,

802
01:10:44,716 --> 01:10:49,027
which equals alpha,
OK?

803
01:10:49,027 --> 01:11:00,000
So, n minus i over m minus i is
less than n over m.

804
01:11:00,000 --> 01:11:05,505
And, the way you can sort of
reason that is that if n is less

805
01:11:05,505 --> 01:11:11,287
than m, I'm subtracting a larger
fraction of n when I subtract i

806
01:11:11,287 --> 01:11:14,682
than I am subtracting a fraction
of m.

807
01:11:14,682 --> 01:11:18,720
OK, so therefore,
n minus i over m minus i is

808
01:11:18,720 --> 01:11:23,858
going to be less than n over m.
OK, so, or you can do the

809
01:11:23,858 --> 01:11:27,070
algebra.
I think it's always helpful

810
01:11:27,070 --> 01:11:31,842
when you do algebra to sort of
think about it sort of

811
01:11:31,842 --> 01:11:36,705
quantitatively as well,
you know, qualitatively what's

812
01:11:36,705 --> 01:11:42,119
going on.
So, the expected number of

813
01:11:42,119 --> 01:11:46,559
probes is, then,
going to be equal to,

814
01:11:46,559 --> 01:11:53,399
it's going to be equal to
because we're going to need some

815
01:11:53,399 --> 01:12:00,600
space, well, we have one which
is forced because we've got to

816
01:12:00,600 --> 01:12:09,308
do one probe,
plus with probability n over m,

817
01:12:09,308 --> 01:12:21,313
I have to do another probe plus
with probability of n over m

818
01:12:21,313 --> 01:12:33,930
minus one I have to do another
probe up until I do one plus one

819
01:12:33,930 --> 01:12:40,276
over m minus n.
OK, so each one is cascading

820
01:12:40,276 --> 01:12:42,553
what's happened.
In the book,

821
01:12:42,553 --> 01:12:47,432
there is a more rigorous proof
of this using indicator random

822
01:12:47,432 --> 01:12:50,767
variables.
I'm going to give you the short

823
01:12:50,767 --> 01:12:52,800
version.
OK, so basically,

824
01:12:52,800 --> 01:12:56,784
this is my first probe.
With probability n over m,

825
01:12:56,784 --> 01:13:01,338
I had to do a second one.
And, the result of that is that

826
01:13:01,338 --> 01:13:04,997
with probability n minus one
over m minus one,

827
01:13:04,997 --> 01:13:08,982
I have to do another.
And, with probability n over

828
01:13:08,982 --> 01:13:12,397
two minus m over two,
I have to do another,

829
01:13:12,397 --> 01:13:18,857
and so forth.
So, that's how many probes I'm

830
01:13:18,857 --> 01:13:25,542
going to end up doing.
So, this is less than or equal

831
01:13:25,542 --> 01:13:31,457
to one plus alpha.
There's one plus alpha times

832
01:13:31,457 --> 01:13:39,042
one plus alpha times one plus
alpha, OK, just using the fact

833
01:13:39,042 --> 01:13:45,536
that I had here.
OK, and that is less than or

834
01:13:45,536 --> 01:13:51,347
equal to one plus I just
multiply through here.

835
01:13:51,347 --> 01:13:57,410
Alpha plus alpha squared plus
alpha cubed plus k.

836
01:13:57,410 --> 01:14:01,957
I can just take that out to
infinity.

837
01:14:01,957 --> 01:14:10,206
It's going to bound this.
OK, does everybody see the math

838
01:14:10,206 --> 01:14:14,954
there?
OK, and that is just the sum,

839
01:14:14,954 --> 01:14:20,653
I, equals zero to infinity,
alpha to the I,

840
01:14:20,653 --> 01:14:28,929
which is equal to one over one
minus alpha using your familiar

841
01:14:28,929 --> 01:14:34,615
geometric series bound.
OK, and there's also,

842
01:14:34,615 --> 01:14:38,076
in the textbook,
an analysis of the successful

843
01:14:38,076 --> 01:14:41,230
search, which,
once again, is a little bit

844
01:14:41,230 --> 01:14:45,384
more technical because you have
to worry about what the

845
01:14:45,384 --> 01:14:50,000
distribution is that you happen
to have in the table when you

846
01:14:50,000 --> 01:14:54,230
are searching for something
that's already in the table.

847
01:14:54,230 --> 01:14:58,538
But, it turns out it's also
bounded by one over one minus

848
01:14:58,538 --> 01:15:04,920
alpha.
So, let's just look to see what

849
01:15:04,920 --> 01:15:11,269
that means.
So, if alpha is less than one

850
01:15:11,269 --> 01:15:18,253
is a constant,
it implies that it takes order

851
01:15:18,253 --> 01:15:24,761
one probes.
OK, so if alpha is a constant,

852
01:15:24,761 --> 01:15:33,621
it takes order one probes.
OK, but it's helpful to

853
01:15:33,621 --> 01:15:40,706
understand what's happening with
the constant.

854
01:15:40,706 --> 01:15:47,161
So, for example,
if the table is 50% full,

855
01:15:47,161 --> 01:15:54,719
so alpha is a half,
what's the expected number of

856
01:15:54,719 --> 01:16:03,378
probes by this analysis?
Two, because one over one minus

857
01:16:03,378 --> 01:16:11,531
a half is two.
If I let the table fill up to

858
01:16:11,531 --> 01:16:17,937
90%, how many probes do I need
on average?

859
01:16:17,937 --> 01:16:22,781
Ten.
So, you can see that as you

860
01:16:22,781 --> 01:16:30,437
fill up the table,
the cost is going dramatically,

861
01:16:30,437 --> 01:16:33,955
OK?
And so, typically,

862
01:16:33,955 --> 01:16:37,865
you don't let the table get too
full.

863
01:16:37,865 --> 01:16:43,297
OK, you don't want to be
pushing 99.9% utilization.

864
01:16:43,297 --> 01:16:49,706
Oh, I got this great hash table
that's got full utilization.

865
01:16:49,706 --> 01:16:52,964
It's like, yeah,
and it's slow.

866
01:16:52,964 --> 01:16:55,571
It's really,
really slow,

867
01:16:55,571 --> 01:17:02,415
OK, because as alpha approaches
one, the time is approaching and

868
01:17:02,415 --> 01:17:06,000
essentially m,
or n.

869
01:17:06,000 --> 01:17:08,050
Good.
So, next time,

870
01:17:08,050 --> 01:17:14,419
we are going to address head-on
in what was one of the most,

871
01:17:14,419 --> 01:17:18,737
I think, interesting ideas in
algorithms.

872
01:17:18,737 --> 01:17:25,213
We are going to talk about how
you solve this problem that no

873
01:17:25,213 --> 01:17:31,798
matter what hash function you
pick, there's a bad set of keys.

874
01:17:31,798 --> 01:17:38,058
OK, so next time we're going to
show that there are ways of

875
01:17:38,058 --> 01:17:42,592
confronting that problem,
very clever ways.

876
01:17:42,592 --> 01:17:45,000
And we use a lot of math for it
so will be a really fun lecture.