1 00:00:09,000 --> 00:00:15,000 Today starts a two-lecture sequence on the topic of 2 00:00:15,000 --> 00:00:23,000 hashing, which is a really great technique that shows up in a lot 3 00:00:23,000 --> 00:00:28,000 of places. So we're going to introduce it 4 00:00:28,000 --> 00:00:36,000 through a problem that comes up often in compilers called the 5 00:00:36,000 --> 00:00:45,000 symbol table problem. And the idea is that we have a 6 00:00:45,000 --> 00:00:57,000 table S holding n records where each record, just to be a little 7 00:00:57,000 --> 00:01:05,000 more explicit here. So each record typically has a 8 00:01:05,000 --> 00:01:10,000 bunch of, this is record x. x is usually a pointer to the 9 00:01:10,000 --> 00:01:14,000 actual data. So when we talk about the 10 00:01:14,000 --> 00:01:20,000 record x, what it usually means some pointer to the data. 11 00:01:20,000 --> 00:01:23,000 And in the data, in the record, 12 00:01:23,000 --> 00:01:28,000 so this is a record, there is a key called a key of 13 00:01:28,000 --> 00:01:32,000 x. In some languages it's key, 14 00:01:32,000 --> 00:01:38,000 it's x dot key or x arrow key, OK, are other ways that that 15 00:01:38,000 --> 00:01:41,000 will be denoted in some languages. 16 00:01:41,000 --> 00:01:46,000 And there's usually some additional data called satellite 17 00:01:46,000 --> 00:01:50,000 data, which is carried around with the key. 18 00:01:50,000 --> 00:01:56,000 This is also true in sorting, but usually you're sorting 19 00:01:56,000 --> 00:01:59,000 records. You're not sorting individual 20 00:01:59,000 --> 00:02:03,000 keys. And so the idea is that we have 21 00:02:03,000 --> 00:02:09,000 a bunch of operations that we would like to do on this data on 22 00:02:09,000 --> 00:02:15,000 this table. So we want to be able to insert 23 00:02:15,000 --> 00:02:20,000 an item x into the table, which just essentially means 24 00:02:20,000 --> 00:02:25,000 that we update the table by adding the element x. 25 00:02:25,000 --> 00:02:31,000 We want to be able to delete an item from the table -- 26 00:02:38,000 --> 00:02:48,000 -- so removing the item x from the set and we want to be able 27 00:02:48,000 --> 00:02:57,000 to search for a given key. So this returns the value x 28 00:02:57,000 --> 00:03:06,000 such that key of x is equal to k, where it returns nil if 29 00:03:06,000 --> 00:03:13,000 there's no such x. So be able to insert items in, 30 00:03:13,000 --> 00:03:18,000 delete them and also look to see if there's an item that has 31 00:03:18,000 --> 00:03:22,000 a particular key. So notice that delete doesn't 32 00:03:22,000 --> 00:03:25,000 take a key. Delete takes a record. 33 00:03:25,000 --> 00:03:30,000 OK, so if you want to delete something of a particular key 34 00:03:30,000 --> 00:03:34,000 and you don't happen to have a pointer to it, 35 00:03:34,000 --> 00:03:40,000 you have to say let me search for it and then delete it. 36 00:03:40,000 --> 00:03:44,000 So these, whenever you have a set operations, 37 00:03:44,000 --> 00:03:51,000 where operations that change the set like in certain delete, 38 00:03:51,000 --> 00:03:57,000 we call it a dynamic set. So these two operations make 39 00:03:57,000 --> 00:04:01,000 the set dynamic. It changes over time. 40 00:04:01,000 --> 00:04:04,000 Sometimes you want to build a fixed data structure. 41 00:04:04,000 --> 00:04:08,000 It's going to be a static set. All you're going to do is do 42 00:04:08,000 --> 00:04:11,000 things like look it up and so forth. 43 00:04:11,000 --> 00:04:13,000 But most often, it turns out that in 44 00:04:13,000 --> 00:04:17,000 programming and so forth, we want to have the set be 45 00:04:17,000 --> 00:04:19,000 dynamic. Want to be able to add elements 46 00:04:19,000 --> 00:04:22,000 to it, delete elements to it and so forth. 47 00:04:22,000 --> 00:04:26,000 And there may be other operations that modify the set, 48 00:04:26,000 --> 00:04:30,000 modify membership in the set. So the simplest implementation 49 00:04:30,000 --> 00:04:34,000 for this is actually often overlooked. 50 00:04:34,000 --> 00:04:36,000 I'm actually surprised how often people use more 51 00:04:36,000 --> 00:04:40,000 complicated data structures when this simple data structure will 52 00:04:40,000 --> 00:04:42,000 work. It's called a direct access 53 00:04:42,000 --> 00:04:44,000 table. Doesn't always work. 54 00:04:44,000 --> 00:04:47,000 I'll give the conditions where it does. 55 00:04:53,000 --> 00:04:58,000 So it works when the keys are drawn from our small 56 00:04:58,000 --> 00:05:04,000 distribution. So suppose the keys are drawn 57 00:05:04,000 --> 00:05:11,000 from a set U of m elements. OK, zero to m minus one. 58 00:05:11,000 --> 00:05:19,000 And we're going to assume the keys are distinct. 59 00:05:28,000 --> 00:05:32,000 So the way a direct access table works is that you set up 60 00:05:32,000 --> 00:05:34,000 an array T -- 61 00:05:41,000 --> 00:05:52,000 -- from zero to m minus one to represent the dynamic set S -- 62 00:05:58,000 --> 00:06:10,000 -- such that T of k is going to be equal to x if x is in the set 63 00:06:10,000 --> 00:06:18,000 and its key is k and nil otherwise. 64 00:06:24,000 --> 00:06:30,000 So you just simply have an array and if you have a record 65 00:06:30,000 --> 00:06:35,000 whose key is some value k, the key is 15 say, 66 00:06:35,000 --> 00:06:42,000 then slot 15 if the element is there has the element. 67 00:06:42,000 --> 00:06:44,000 And if it's not in the set, it's nil. 68 00:06:44,000 --> 00:06:48,000 Very simple data structure. OK, insertion. 69 00:06:48,000 --> 00:06:52,000 Just go to that location and set the value to the inserted 70 00:06:52,000 --> 00:06:54,000 value. For deletion, 71 00:06:54,000 --> 00:06:57,000 just remove it from there. And to look it up, 72 00:06:57,000 --> 00:07:02,000 you just index it and see what's in that slot. 73 00:07:02,000 --> 00:07:09,000 OK, very simple data structure. All these operations, 74 00:07:09,000 --> 00:07:15,000 therefore, take constant time in the worst case. 75 00:07:15,000 --> 00:07:23,000 But as a practical matter, the places you can use this 76 00:07:23,000 --> 00:07:31,000 strategy are pretty limited. What's the issue of limitation 77 00:07:31,000 --> 00:07:34,000 here? Yes. 78 00:07:38,000 --> 00:07:41,000 OK, so that's a limitation surely. 79 00:07:41,000 --> 00:07:45,000 But there's actually a more severe limitation. 80 00:07:45,000 --> 00:07:48,000 Yeah. What does that mean, 81 00:07:48,000 --> 00:07:51,000 it's hard to draw? 82 00:08:05,000 --> 00:08:05,000 No. Yeah. m minus one could be a huge number. 83 00:08:09,000 --> 00:08:14,000 Like for example, suppose that I want to have my 84 00:08:14,000 --> 00:08:21,000 set drawn over 64 bit values. OK, the things that I'm storing 85 00:08:21,000 --> 00:08:25,000 in my table is a set of 64-bit numbers. 86 00:08:25,000 --> 00:08:30,000 And so, maybe a small set. Maybe we only have a few 87 00:08:30,000 --> 00:08:37,000 thousand of these elements. But they're drawn from a 64-bit 88 00:08:37,000 --> 00:08:41,000 value. Then this strategy requires me 89 00:08:41,000 --> 00:08:47,000 to have an array that goes from zero to 2 to the 64th minus one. 90 00:08:47,000 --> 00:08:51,000 How big is 2^64 minus one? Approximately? 91 00:08:51,000 --> 00:08:55,000 It's like big. It's like 18 quintillion or 92 00:08:55,000 --> 00:08:59,000 something. I mean, it's zillions literally 93 00:08:59,000 --> 00:09:06,000 because it's like it's beyond the illions we normally use. 94 00:09:06,000 --> 00:09:09,000 Not a billion or a trillion. It's 18 quintillion. 95 00:09:09,000 --> 00:09:12,000 OK, so that's a really big number. 96 00:09:12,000 --> 00:09:16,000 So, or even worse, suppose the keys were drawn 97 00:09:16,000 --> 00:09:20,000 from character strings, so people's names or something. 98 00:09:20,000 --> 00:09:24,000 This would be an awful way to have to represent it. 99 00:09:24,000 --> 00:09:29,000 Because most of the table would be empty for any reasonable set 100 00:09:29,000 --> 00:09:33,000 of values you would want to keep. 101 00:09:33,000 --> 00:09:40,000 So the idea is we want to try to keep something that's going 102 00:09:40,000 --> 00:09:47,000 to keep the table small, while still preserving some of 103 00:09:47,000 --> 00:09:53,000 the properties. And that's where hashing comes 104 00:09:53,000 --> 00:09:56,000 in. So hashing is we use a hash 105 00:09:56,000 --> 00:10:03,000 function H which maps the keys randomly. 106 00:10:03,000 --> 00:10:08,000 And I'm putting that in quotes because it's not quite at 107 00:10:08,000 --> 00:10:11,000 random. Into slots table T. 108 00:10:11,000 --> 00:10:16,000 So we call each of the array indexes here a slot. 109 00:10:16,000 --> 00:10:23,000 So you can just sort of think of it as a big table and you've 110 00:10:23,000 --> 00:10:30,000 got slots in the table where you're storing your values. 111 00:10:30,000 --> 00:10:34,000 And so, we may have a big universe of keys. 112 00:10:34,000 --> 00:10:39,000 Let's call that U. And we have our table over here 113 00:10:39,000 --> 00:10:43,000 that we've set up that has -- 114 00:10:50,000 --> 00:10:51,000 -- m slots. 115 00:10:56,000 --> 00:11:02,000 And so we actually have then a set that we're actually going to 116 00:11:02,000 --> 00:11:07,000 try to represent S, which is presumably a very 117 00:11:07,000 --> 00:11:13,000 small piece of the universe. And what we'll do is we'll take 118 00:11:13,000 --> 00:11:18,000 an element from here and map it to let's say to there and take 119 00:11:18,000 --> 00:11:23,000 another one and we apply the hash function to the element. 120 00:11:23,000 --> 00:11:28,000 And what the hash function is going to give us is it's going 121 00:11:28,000 --> 00:11:34,000 to give us a particular slot. Here's one that might go up 122 00:11:34,000 --> 00:11:37,000 here. Might have another one over 123 00:11:37,000 --> 00:11:44,000 here that goes down to there. And so, we get it to distribute 124 00:11:44,000 --> 00:11:51,000 the elements over the table. So what's the problem that's 125 00:11:51,000 --> 00:11:58,000 going to occur as we do this? So far, I've been a little bit 126 00:11:58,000 --> 00:12:01,000 lucky. What's the problem potentially 127 00:12:01,000 --> 00:12:02,000 going to be? 128 00:12:09,000 --> 00:12:11,000 Yeah, when two things are in S, more specifically, 129 00:12:11,000 --> 00:12:15,000 get assigned to the same value. So I may have a guy here and he 130 00:12:15,000 --> 00:12:18,000 gets mapped to the same slot that somebody else has already 131 00:12:18,000 --> 00:12:21,000 been mapped to. And when this happens, 132 00:12:21,000 --> 00:12:23,000 we call that a collision. 133 00:12:29,000 --> 00:12:33,000 So we're trying to map these things down into a small set but 134 00:12:33,000 --> 00:12:38,000 we could get unlucky in our mapping, particularly if we map 135 00:12:38,000 --> 00:12:41,000 enough of these guys. They're not going to fit. 136 00:12:41,000 --> 00:12:43,000 So when a record -- 137 00:12:54,000 --> 00:13:07,000 -- to be inserted maps to an already occupied slot -- 138 00:13:19,000 --> 00:13:21,000 -- a collision occurs. 139 00:13:32,000 --> 00:13:34,000 OK. So looks like this method's no 140 00:13:34,000 --> 00:13:37,000 good. But no, there's a pretty simple 141 00:13:37,000 --> 00:13:40,000 thing we can do. What should we do when two 142 00:13:40,000 --> 00:13:44,000 things map to the same slot? If we want to represent the 143 00:13:44,000 --> 00:13:49,000 whole set, but you can't lose any data, can't treat it like a 144 00:13:49,000 --> 00:13:52,000 cache. In a cache what you do is it 145 00:13:52,000 --> 00:13:54,000 uses a hashing scheme, but in a cache, 146 00:13:54,000 --> 00:13:58,000 you just kick it out because you don't care about 147 00:13:58,000 --> 00:14:04,000 representing a set precisely. But in a hash table you're 148 00:14:04,000 --> 00:14:08,000 programming, you often want to make sure that the values you 149 00:14:08,000 --> 00:14:14,000 have are exactly the values in the sets so you can tell whether 150 00:14:14,000 --> 00:14:16,000 something belongs to the set or not. 151 00:14:16,000 --> 00:14:19,000 So what's a good strategy here? Yeah. 152 00:14:19,000 --> 00:14:24,000 Create a list for each slot and just put all the elements that 153 00:14:24,000 --> 00:14:27,000 hash to the same slot into the list. 154 00:14:27,000 --> 00:14:33,000 And that's called resolving collisions by chaining. 155 00:14:38,000 --> 00:14:47,000 And the idea is to link records in the same slot -- 156 00:14:52,000 --> 00:14:56,000 -- into a list. So for example, 157 00:14:56,000 --> 00:15:07,000 imagine this is my hash table and this for example is slot i. 158 00:15:07,000 --> 00:15:13,000 I may have several things that are, so I'm going to put the key 159 00:15:13,000 --> 00:15:15,000 value -- 160 00:15:22,000 --> 00:15:28,000 -- have several things that may have been inserted into this 161 00:15:28,000 --> 00:15:35,000 table that are elements of S. And what I'll do is just link 162 00:15:35,000 --> 00:15:38,000 them together. OK, so nil pointer here. 163 00:15:38,000 --> 00:15:43,000 And this is the satellite data and these are the keys. 164 00:15:43,000 --> 00:15:46,000 So if they're all linked together in slot i, 165 00:15:46,000 --> 00:15:51,000 then the hash function applied to 49 has got to be equal to the 166 00:15:51,000 --> 00:15:55,000 hash function of 86 is equal to the hash function of 52, 167 00:15:55,000 --> 00:15:58,000 which equals what? 168 00:16:08,000 --> 00:16:10,000 There's only one thing I haven't. 169 00:16:10,000 --> 00:16:11,000 i. Good. 170 00:16:11,000 --> 00:16:16,000 Even if you don't understand it, your quizmanship should tell 171 00:16:16,000 --> 00:16:18,000 you. He didn't mention i. 172 00:16:18,000 --> 00:16:22,000 That's equal to i. So the point is when I hash 49, 173 00:16:22,000 --> 00:16:26,000 the hash of 49 produces me some index in the table, 174 00:16:26,000 --> 00:16:31,000 say i, and everything that hashes to that same location is 175 00:16:31,000 --> 00:16:35,000 linked together into a list OK. 176 00:16:35,000 --> 00:16:39,000 Every record. Any questions about what the 177 00:16:39,000 --> 00:16:44,000 mechanics of this. I hope that most of you have 178 00:16:44,000 --> 00:16:49,000 seen this, seen hashing, basic hashing in 6.001, 179 00:16:49,000 --> 00:16:51,000 right? They teach it in? 180 00:16:51,000 --> 00:16:55,000 They used to teach it 6.001. Yeah. 181 00:16:55,000 --> 00:16:58,000 OK. Some people are saying maybe. 182 00:16:58,000 --> 00:17:01,000 They used to teach it. Good. 183 00:17:01,000 --> 00:17:07,000 So let's analyze this strategy. The analysis. 184 00:17:07,000 --> 00:17:12,000 We'll first do worst case. 185 00:17:18,000 --> 00:17:22,000 So what happens in the worst case? 186 00:17:22,000 --> 00:17:27,000 With hashing? Yeah, raise your hand so that I 187 00:17:27,000 --> 00:17:30,000 could call on you. Yeah. 188 00:17:30,000 --> 00:17:37,000 Yeah, all hash keys, well all, all the keys in S. 189 00:17:37,000 --> 00:17:46,000 I happen to pick a set S where my hash function happens to map 190 00:17:46,000 --> 00:17:54,000 them all to the same value. That would be bad. 191 00:17:54,000 --> 00:18:01,000 So every key hashes to the same slot. 192 00:18:01,000 --> 00:18:04,000 And so, therefore if that happens, then what I've 193 00:18:04,000 --> 00:18:08,000 essentially built is a fancy linked list for keeping this 194 00:18:08,000 --> 00:18:11,000 data structure. All this stuff with the tables, 195 00:18:11,000 --> 00:18:13,000 the hashing, etc., irrelevant. 196 00:18:13,000 --> 00:18:17,000 All that matters is that I have a long linked list. 197 00:18:17,000 --> 00:18:20,000 And then how long does an access take? 198 00:18:20,000 --> 00:18:23,000 How long does it take me to insert something or well, 199 00:18:23,000 --> 00:18:26,000 more importantly, to search for something. 200 00:18:26,000 --> 00:18:29,000 Find out whether something's in there. 201 00:18:29,000 --> 00:18:35,000 In the worst case. Yeah, it takes order n time. 202 00:18:35,000 --> 00:18:41,000 Because they're all just a link, we just have a linked 203 00:18:41,000 --> 00:18:46,000 list. So access takes data n time if 204 00:18:46,000 --> 00:18:50,000 as we assume the size of S is equal to n. 205 00:18:50,000 --> 00:18:57,000 So from a worst case point of view, this doesn't look so 206 00:18:57,000 --> 00:19:02,000 attractive. And we will see data structures 207 00:19:02,000 --> 00:19:05,000 that in worst case do very well for this problem. 208 00:19:05,000 --> 00:19:09,000 But they don't do as good as the average case of hashing. 209 00:19:09,000 --> 00:19:12,000 So let's analyze the average case. 210 00:19:18,000 --> 00:19:21,000 In order to analyze the average case, I have to, 211 00:19:21,000 --> 00:19:25,000 whenever you have averages, whenever you have probability, 212 00:19:25,000 --> 00:19:27,000 you have to state your assumptions. 213 00:19:27,000 --> 00:19:31,000 You have to say what is the assumption about the behavior of 214 00:19:31,000 --> 00:19:34,000 the system. And it's very hard to do that 215 00:19:34,000 --> 00:19:36,000 because you don't know necessarily what the hash 216 00:19:36,000 --> 00:19:39,000 function is. Well, let's imagine an ideal 217 00:19:39,000 --> 00:19:41,000 hash function. What should an ideal hash 218 00:19:41,000 --> 00:19:42,000 function do? 219 00:19:54,000 --> 00:20:00,000 Yeah, map the keys essentially at random to a slot. 220 00:20:00,000 --> 00:20:03,000 Should really distribute them randomly. 221 00:20:03,000 --> 00:20:06,000 So we call this the assumption -- 222 00:20:11,000 --> 00:20:18,000 -- of simple uniform hashing. 223 00:20:24,000 --> 00:20:35,000 And what it means is that each key k in S is equally likely -- 224 00:20:41,000 --> 00:20:50,000 -- to be hashed to any slot in T and we're actually have to 225 00:20:50,000 --> 00:21:00,000 make an independence assumption. Independent of where other 226 00:21:00,000 --> 00:21:07,000 records, other keys are hashed. 227 00:21:18,000 --> 00:21:23,000 So we're going to make this assumption and includes n an 228 00:21:23,000 --> 00:21:27,000 independence assumption. That if I have two keys the 229 00:21:27,000 --> 00:21:34,000 odds that they're hashed to the same place is therefore what? 230 00:21:34,000 --> 00:21:39,000 What are the odds that two keys under this assumption are hashed 231 00:21:39,000 --> 00:21:42,000 to the same slot, if I have, say, 232 00:21:42,000 --> 00:21:44,000 m slots? One over m. 233 00:21:44,000 --> 00:21:48,000 What are the odds that one key is hashed to slot 15? 234 00:21:48,000 --> 00:21:51,000 One over m. Because they're being 235 00:21:51,000 --> 00:21:56,000 distributed, but the odds in particular two keys are hashed 236 00:21:56,000 --> 00:22:00,000 to the same slot, one over m. 237 00:22:08,000 --> 00:22:14,000 So let's define. Is there a question? 238 00:22:14,000 --> 00:22:16,000 No. OK. 239 00:22:16,000 --> 00:22:27,000 The load factor of a hash table with n keys at m slots to be 240 00:22:27,000 --> 00:22:38,000 alpha which is equal to n over m, which is also if you think 241 00:22:38,000 --> 00:22:50,000 about it, just the average number of keys per slot. 242 00:22:58,000 --> 00:23:02,000 So alpha is the average number of keys per, we call it the load 243 00:23:02,000 --> 00:23:04,000 factor of the table. OK. 244 00:23:04,000 --> 00:23:07,000 How many on average keys do I have? 245 00:23:07,000 --> 00:23:11,000 So the expected, we'll look first at 246 00:23:11,000 --> 00:23:17,000 unsuccessful search time. So by unsuccessful search, 247 00:23:17,000 --> 00:23:22,000 I mean I'm looking for something that's actually not in 248 00:23:22,000 --> 00:23:26,000 the table. It's going to return nil. 249 00:23:26,000 --> 00:23:32,000 I look for a key that's not in the table. 250 00:23:32,000 --> 00:23:35,000 It's going to be what? It's going to be order. 251 00:23:35,000 --> 00:23:40,000 Well, I have to do a certain amount of work just to calculate 252 00:23:40,000 --> 00:23:46,000 the hash function and so forth. It's going to be order at least 253 00:23:46,000 --> 00:23:51,000 one plus, then I have to search the list and on average how much 254 00:23:51,000 --> 00:23:55,000 of the list do I have to search? 255 00:24:01,000 --> 00:24:04,000 What's the cost of searching that list? 256 00:24:04,000 --> 00:24:08,000 On average. If I'm searching at random. 257 00:24:08,000 --> 00:24:13,000 If I'm searching for a key that's not in the table. 258 00:24:13,000 --> 00:24:18,000 Whichever one it is, I got to search to the end of 259 00:24:18,000 --> 00:24:23,000 the list, right? So what's the average cost over 260 00:24:23,000 --> 00:24:26,000 all the slots in the table? Alpha. 261 00:24:26,000 --> 00:24:27,000 Right? Alpha. 262 00:24:27,000 --> 00:24:33,000 That's the average length of a list. 263 00:24:33,000 --> 00:24:40,000 So this is essentially the cost of doing the hash and then 264 00:24:40,000 --> 00:24:47,000 accessing the slot and that is just the cost of searching the 265 00:24:47,000 --> 00:24:49,000 list. 266 00:24:54,000 --> 00:24:58,000 So the expected unsuccessful search time is proportional 267 00:24:58,000 --> 00:25:02,000 essentially to alpha and if alpha's bigger than one, 268 00:25:02,000 --> 00:25:05,000 it's order alpha. If alpha's less than one, 269 00:25:05,000 --> 00:25:07,000 it's constant. 270 00:25:13,000 --> 00:25:15,000 So when is the expected search time -- 271 00:25:26,000 --> 00:25:27,000 -- equal to order one? 272 00:25:34,000 --> 00:25:35,000 So when is this order one? 273 00:25:46,000 --> 00:25:48,000 Simple questions, by the way. 274 00:25:48,000 --> 00:25:53,000 I only ask simple questions. Some guys ask hard questions. 275 00:25:53,000 --> 00:25:55,000 Yeah. Or in terms first we'll get 276 00:25:55,000 --> 00:25:57,000 there in two steps, OK. 277 00:25:57,000 --> 00:26:01,000 In terms of alpha, it's when? 278 00:26:01,000 --> 00:26:06,000 When alpha is constant. If alpha in particular is. 279 00:26:06,000 --> 00:26:10,000 Alpha doesn't have to be constant. 280 00:26:10,000 --> 00:26:15,000 It could be less than constant. It's O of one, 281 00:26:15,000 --> 00:26:18,000 right. OK, or equivalently, 282 00:26:18,000 --> 00:26:22,000 which is what you said, if n is O of m. 283 00:26:22,000 --> 00:26:29,000 OK, which is to say if the number of elements in the table 284 00:26:29,000 --> 00:26:36,000 is order, is upper bounded by a constant times n. 285 00:26:36,000 --> 00:26:38,000 Then the search cost is constant. 286 00:26:38,000 --> 00:26:42,000 So a lot of people will tell you oh, a hash table runs in 287 00:26:42,000 --> 00:26:45,000 constant search time. OK, that's actually wrong. 288 00:26:45,000 --> 00:26:48,000 It depends upon the load factor of the hash table. 289 00:26:48,000 --> 00:26:52,000 And people have made programming errors based on that 290 00:26:52,000 --> 00:26:56,000 misunderstanding of hash tables. Because they have a hash table 291 00:26:56,000 --> 00:27:00,000 that's too small for the number of elements they're putting in 292 00:27:00,000 --> 00:27:03,000 there. Doesn't help. 293 00:27:03,000 --> 00:27:07,000 The number may in fact will grow with the, 294 00:27:07,000 --> 00:27:14,000 since this is one plus n over m, it actually grows with n. 295 00:27:14,000 --> 00:27:18,000 So unless you make sure that m keeps up with n, 296 00:27:18,000 --> 00:27:24,000 this doesn't stay constant. Now it turns out for a 297 00:27:24,000 --> 00:27:30,000 successful search, it's also one plus alpha. 298 00:27:30,000 --> 00:27:34,000 And for that you need to do a little bit more mathematics 299 00:27:34,000 --> 00:27:38,000 because you now have to condition on searching for the 300 00:27:38,000 --> 00:27:41,000 items in the table. But it turns out it's also one 301 00:27:41,000 --> 00:27:45,000 plus alpha and that you can read about in the book. 302 00:27:45,000 --> 00:27:49,000 And also, there's a more rigorous proof of this. 303 00:27:49,000 --> 00:27:53,000 I sort of have glossed over the expectation stuff here, 304 00:27:53,000 --> 00:27:55,000 doing sort of a more intuitive proof. 305 00:27:55,000 --> 00:28:01,000 So both of those things you should look for in the book. 306 00:28:01,000 --> 00:28:05,000 So this is one reason why hashing is such a popular 307 00:28:05,000 --> 00:28:10,000 method, is it basically lets you represent a dynamic set with 308 00:28:10,000 --> 00:28:14,000 order one cost per operation, constant cost per operation, 309 00:28:14,000 --> 00:28:19,000 inserting, deleting and so forth, as long as the table that 310 00:28:19,000 --> 00:28:24,000 you're keeping is not much smaller than the number of items 311 00:28:24,000 --> 00:28:29,000 that you're putting in there. And then all the operations end 312 00:28:29,000 --> 00:28:33,000 up being constant time. But it depends upon, 313 00:28:33,000 --> 00:28:37,000 strongly upon this assumption of simple uniform hashing. 314 00:28:37,000 --> 00:28:41,000 And so no matter what hash function you pick, 315 00:28:41,000 --> 00:28:45,000 I can always find a set of elements that are going to hash, 316 00:28:45,000 --> 00:28:49,000 that that hash function is going to hash badly. 317 00:28:49,000 --> 00:28:53,000 I just could generate a whole bunch of them and look to see 318 00:28:53,000 --> 00:28:58,000 where the hash function takes them and in the end pick a whole 319 00:28:58,000 --> 00:29:02,000 bunch that hash to the same place. 320 00:29:02,000 --> 00:29:05,000 We're actually going to see a way of countering that, 321 00:29:05,000 --> 00:29:09,000 but in practice people understand that most programs 322 00:29:09,000 --> 00:29:14,000 that need to use things aren't really reverse engineering the 323 00:29:14,000 --> 00:29:17,000 hash function. And so, there's some very 324 00:29:17,000 --> 00:29:21,000 simple hash functions that seem to work fairly well in practice. 325 00:29:21,000 --> 00:29:25,000 So in choosing a hash function -- 326 00:29:32,000 --> 00:29:34,000 -- we would like it to distribute 327 00:29:40,000 --> 00:29:51,000 -- keys uniformly into slots and we also would like that 328 00:29:51,000 --> 00:29:59,000 regularity in the key distributions -- 329 00:30:06,000 --> 00:30:08,000 -- should not affect uniformity. 330 00:30:08,000 --> 00:30:12,000 For example, a regularity that you often see 331 00:30:12,000 --> 00:30:17,000 is that all the keys that are being inserted are even numbers. 332 00:30:17,000 --> 00:30:21,000 Somebody happens to have that property of his data, 333 00:30:21,000 --> 00:30:24,000 that they're only inserting even numbers. 334 00:30:24,000 --> 00:30:29,000 In fact, on many machines, since they use byte pointers, 335 00:30:29,000 --> 00:30:33,000 if they're sorting things that are for example, 336 00:30:33,000 --> 00:30:37,000 indexes to arrays or something like that, in fact, 337 00:30:37,000 --> 00:30:43,000 they're numbers that are typically divisible by four. 338 00:30:43,000 --> 00:30:45,000 Or by eight. So you don't want regularity in 339 00:30:45,000 --> 00:30:49,000 the key distribution to affect the fact that you're 340 00:30:49,000 --> 00:30:52,000 distributing slots. So probably the most popular 341 00:30:52,000 --> 00:30:56,000 method that's used just for a quick hash function is what's 342 00:30:56,000 --> 00:30:59,000 called the division method. 343 00:31:07,000 --> 00:31:11,000 And the idea here is that you simply let h of k for a key 344 00:31:11,000 --> 00:31:15,000 equal k modulo m, where m is the number of slots 345 00:31:15,000 --> 00:31:17,000 in your table. 346 00:31:24,000 --> 00:31:28,000 And this works reasonably well in practice, but you want to be 347 00:31:28,000 --> 00:31:31,000 careful about your choice of modulus. 348 00:31:31,000 --> 00:31:33,000 In other words, it turns out it doesn't work 349 00:31:33,000 --> 00:31:36,000 well for every possible size of table you might want to pick. 350 00:31:36,000 --> 00:31:38,000 Fortunately when you're building hash tables, 351 00:31:38,000 --> 00:31:42,000 you don't usually care about the specific size of the table. 352 00:31:42,000 --> 00:31:45,000 If you pick it around some size, that's probably fine 353 00:31:45,000 --> 00:31:47,000 because it's not going to affect their performance. 354 00:31:47,000 --> 00:31:50,000 So there's no need to pick a specific value. 355 00:31:50,000 --> 00:31:53,000 In particular, you don't want to pick -- 356 00:32:00,000 --> 00:32:04,000 -- m with a small divisor -- 357 00:32:11,000 --> 00:32:14,000 -- and let me illustrate why that's a bad idea for this 358 00:32:14,000 --> 00:32:16,000 particular hash function. 359 00:32:27,000 --> 00:32:29,000 I should have said small divisor d. 360 00:32:35,000 --> 00:32:36,000 So for example -- 361 00:32:40,000 --> 00:32:45,000 -- if D is two, in other words m is an even 362 00:32:45,000 --> 00:32:52,000 number, and it turns out that we have the situation I just 363 00:32:52,000 --> 00:33:00,000 mentioned, all keys are even, what happens to my usage of the 364 00:33:00,000 --> 00:33:04,000 hash table? So I have an even slot, 365 00:33:04,000 --> 00:33:09,000 even number of slots, and all the keys that the user 366 00:33:09,000 --> 00:33:14,000 of the hash table chooses to pick happen to be even numbers, 367 00:33:14,000 --> 00:33:19,000 what's going to happen in terms of my use of the hash table? 368 00:33:19,000 --> 00:33:24,000 Well, in the worst case, they are always all going to 369 00:33:24,000 --> 00:33:30,000 point in the same slot no matter what hash function I pick. 370 00:33:30,000 --> 00:33:35,000 But here, let's say that, in fact, my hash function does 371 00:33:35,000 --> 00:33:39,000 do a pretty good job of distributing, 372 00:33:39,000 --> 00:33:45,000 but I have this property. What's a property that's going 373 00:33:45,000 --> 00:33:51,000 to have no matter what set of keys I pick that satisfies this 374 00:33:51,000 --> 00:33:55,000 property? What's going to happen to the 375 00:33:55,000 --> 00:33:58,000 hash table? So, I have even number, 376 00:33:58,000 --> 00:34:04,000 mod an even number. What does that say about the 377 00:34:04,000 --> 00:34:08,000 hash function? It's even, right? 378 00:34:08,000 --> 00:34:11,000 I have an even number mod. It's even. 379 00:34:11,000 --> 00:34:16,000 So, what's going to happen to my use of the table? 380 00:34:16,000 --> 00:34:22,000 Yeah, you're never going to hash anything to an odd-numbered 381 00:34:22,000 --> 00:34:26,000 slot. You wasted half your slots. 382 00:34:26,000 --> 00:34:32,000 It doesn't matter what the key distribution is. 383 00:34:32,000 --> 00:34:38,000 OK, as long as they're all even, OK, that means the odds 384 00:34:38,000 --> 00:34:43,000 slots are never used. OK, an extreme example, 385 00:34:43,000 --> 00:34:49,000 here's another example, imagine that m is equal to two 386 00:34:49,000 --> 00:34:52,000 to the r. In other words, 387 00:34:52,000 --> 00:34:58,000 all its factors are small divisors, OK? 388 00:34:58,000 --> 00:35:06,000 In that case, if I think about taking k mod 389 00:35:06,000 --> 00:35:18,000 n, OK, the hash doesn't even depend on all the bits of k, 390 00:35:18,000 --> 00:35:22,000 OK? So, for example, 391 00:35:22,000 --> 00:35:31,000 suppose I had one..., and r equals six, 392 00:35:31,000 --> 00:35:43,000 OK, so m is two to the sixth. So, I take this binary number, 393 00:35:43,000 --> 00:35:50,000 mod two to the sixth, what's the hash value? 394 00:35:50,000 --> 00:35:59,000 If I take something mod a power of two, what does it do? 395 00:35:59,000 --> 00:36:06,000 So, I hash this function. This is k, OK, 396 00:36:06,000 --> 00:36:12,000 in binary. And I take it mod two to the 397 00:36:12,000 --> 00:36:17,000 sixth. Well, if I took it mod two, 398 00:36:17,000 --> 00:36:24,000 what's the answer? What's this number mod two? 399 00:36:24,000 --> 00:36:29,000 Zero, right. OK, what's this number mod 400 00:36:29,000 --> 00:36:32,000 four? One zero. 401 00:36:32,000 --> 00:36:35,000 What is it mod two to the sixth? 402 00:36:35,000 --> 00:36:39,000 Yeah, it's just these last six bits. 403 00:36:39,000 --> 00:36:43,000 This is H of k. OK, when you take something mod 404 00:36:43,000 --> 00:36:48,000 a power of two, all you're doing is taking its 405 00:36:48,000 --> 00:36:51,000 low order bits. OK, mod two to the r, 406 00:36:51,000 --> 00:36:54,000 you are taking its r low order bits. 407 00:36:54,000 --> 00:37:02,000 So, the hash function doesn't even depend on what's up here. 408 00:37:02,000 --> 00:37:05,000 So, that's a pretty bad situation because generally you 409 00:37:05,000 --> 00:37:09,000 would like a very common regularity that you'll see in 410 00:37:09,000 --> 00:37:12,000 data is that all the low order bits are the same, 411 00:37:12,000 --> 00:37:16,000 and all the high order bits differ, or vice versa. 412 00:37:16,000 --> 00:37:20,000 So, this particular is not a very good one. 413 00:37:20,000 --> 00:37:25,000 So, good heuristics for this is to pick m to be a prime, 414 00:37:25,000 --> 00:37:31,000 not too close to a power of two or ten because those are the two 415 00:37:31,000 --> 00:37:36,000 common bases that you see regularity in the world. 416 00:37:36,000 --> 00:37:39,000 A prime is sometimes inconvenient, 417 00:37:39,000 --> 00:37:41,000 however. But generally, 418 00:37:41,000 --> 00:37:44,000 it's fairly easy to find primes. 419 00:37:44,000 --> 00:37:49,000 And there's a lot of nice theorems about primes. 420 00:37:49,000 --> 00:37:54,000 So, generally what you do, if you're just coding up 421 00:37:54,000 --> 00:38:00,000 something and you know what it is, you can pick a prime out of 422 00:38:00,000 --> 00:38:06,000 a textbook or look it up on the web or write a little program, 423 00:38:06,000 --> 00:38:11,000 or whatever, and pick a prime. 424 00:38:11,000 --> 00:38:15,000 Not too close to a power of two or ten, and it will probably 425 00:38:15,000 --> 00:38:18,000 work pretty well. It will probably work pretty 426 00:38:18,000 --> 00:38:20,000 well. So, this is a very popular 427 00:38:20,000 --> 00:38:24,000 method, the division method. OK, but the next method we are 428 00:38:24,000 --> 00:38:27,000 going to see is actually usually superior. 429 00:38:27,000 --> 00:38:32,000 The reason people do this is because they can write in-line 430 00:38:32,000 --> 00:38:36,000 in their code. OK, but it's not usually the 431 00:38:36,000 --> 00:38:39,000 best method. And the reason is because 432 00:38:39,000 --> 00:38:44,000 division, one of the reasons is division tends to take a lot of 433 00:38:44,000 --> 00:38:48,000 cycles to compute on most computers compared with 434 00:38:48,000 --> 00:38:51,000 multiplication or addition. OK, in fact, 435 00:38:51,000 --> 00:38:55,000 it's usually done with taking several multiplications. 436 00:38:55,000 --> 00:38:59,000 So, the next method is actually generally better, 437 00:38:59,000 --> 00:39:03,000 but none of the hash function methods that we are talking 438 00:39:03,000 --> 00:39:06,000 about today are, in some sense, 439 00:39:06,000 --> 00:39:12,000 provably good hash functions. OK, so for the multiplication 440 00:39:12,000 --> 00:39:18,000 method, the nice thing about it is just essentially requires 441 00:39:18,000 --> 00:39:22,000 multiplication to do. And, for that is, 442 00:39:22,000 --> 00:39:28,000 also, we are going to assume that the number of slots is a 443 00:39:28,000 --> 00:39:32,000 power of two which is also often very convenient. 444 00:39:32,000 --> 00:39:37,000 OK, and for this, we're going to assume that the 445 00:39:37,000 --> 00:39:44,000 computer has w bit words. So, it would be convenient on a 446 00:39:44,000 --> 00:39:50,000 computer with 32 bits, or 64 bits, for example. 447 00:39:50,000 --> 00:39:54,000 OK, this would be very convenient. 448 00:39:54,000 --> 00:39:59,000 So, the hash function is the following. 449 00:39:59,000 --> 00:40:04,000 h of k is equal to A times k mod, two to the w, 450 00:40:04,000 --> 00:40:12,000 right shifted by w minus r. OK, so the key part of this is 451 00:40:12,000 --> 00:40:20,000 A, which has chosen to be an odd integer in the range between two 452 00:40:20,000 --> 00:40:24,000 to the w minus one and two to the w. 453 00:40:24,000 --> 00:40:31,000 OK, so it's an odd integer that the full width of the computer 454 00:40:31,000 --> 00:40:36,000 word. OK, and what you do is multiply 455 00:40:36,000 --> 00:40:42,000 it by whatever your key is, by this funny integer. 456 00:40:42,000 --> 00:40:47,000 And, then take it mod two to the w. 457 00:40:47,000 --> 00:40:54,000 And then, you take the result and right shift it by this fixed 458 00:40:54,000 --> 00:41:00,000 amount, w minus r. So, this is a bit wise right 459 00:41:00,000 --> 00:41:06,000 shift. OK, so let's look at what this 460 00:41:06,000 --> 00:41:12,000 does. But first, let me just give you 461 00:41:12,000 --> 00:41:21,000 a couple of tips on how you pick, or what you don't pick for 462 00:41:21,000 --> 00:41:27,000 A. So, you don't pick A too close 463 00:41:27,000 --> 00:41:34,000 to a power of two. And, it's generally a pretty 464 00:41:34,000 --> 00:41:42,000 fast method because multiplication mod two to the w 465 00:41:42,000 --> 00:41:49,000 is faster than division. And the other thing is that a 466 00:41:49,000 --> 00:41:52,000 right shift is fast, especially because this is a 467 00:41:52,000 --> 00:41:55,000 known shift. OK, you know it before you are 468 00:41:55,000 --> 00:41:59,000 computing the hash function. Both w and r are known in 469 00:41:59,000 --> 00:42:02,000 advance. So, the compiler can often do 470 00:42:02,000 --> 00:42:06,000 tricks there to make it go even faster. 471 00:42:06,000 --> 00:42:11,000 So, let's do an example to understand how this hash 472 00:42:11,000 --> 00:42:14,000 function works. So, we will have, 473 00:42:14,000 --> 00:42:18,000 in this case, a number of slots will be 474 00:42:18,000 --> 00:42:22,000 eight, which is two to the three. 475 00:42:22,000 --> 00:42:26,000 And, we'll have a bizarre word size of seven bits. 476 00:42:26,000 --> 00:42:33,000 Anybody know any seven bit computers out there? 477 00:42:33,000 --> 00:42:39,000 OK, well, here's one. So, A is our fixed value that's 478 00:42:39,000 --> 00:42:45,000 used for hashing all our keys. And, in this case, 479 00:42:45,000 --> 00:42:50,000 let's say it's 1011001. So, that's A. 480 00:42:50,000 --> 00:42:57,000 And, I take in some value for k that I'm going to multiply. 481 00:42:57,000 --> 00:43:04,000 So, k is going to be 1101011. So, that's my k. 482 00:43:04,000 --> 00:43:07,000 And, I multiply them. What I multiply two, 483 00:43:07,000 --> 00:43:10,000 each of these is the full word width. 484 00:43:10,000 --> 00:43:14,000 You can view it as the full word width of the machine, 485 00:43:14,000 --> 00:43:16,000 in this case, seven bits. 486 00:43:16,000 --> 00:43:20,000 So, in general, this would be like a 32 bit 487 00:43:20,000 --> 00:43:24,000 number, and my key, I'd be multiplying two 32 bit 488 00:43:24,000 --> 00:43:28,000 numbers, for example. OK, and so, when I multiply 489 00:43:28,000 --> 00:43:33,000 that out, I get a 2w bit answer. So, when you multiply two w bit 490 00:43:33,000 --> 00:43:38,000 numbers, you get a 2w bit answer. 491 00:43:38,000 --> 00:43:44,000 In this case, it happens to be that number, 492 00:43:44,000 --> 00:43:49,000 OK? So, that's the product part, 493 00:43:49,000 --> 00:43:54,000 OK? And then we take it mod two to 494 00:43:54,000 --> 00:43:59,000 the w. Well, what mod two to the w 495 00:43:59,000 --> 00:44:09,000 says is that I'm just taking, ignoring the high order bits of 496 00:44:09,000 --> 00:44:16,000 this product. So, all of these are ignored, 497 00:44:16,000 --> 00:44:22,000 because, remember that if I take something, 498 00:44:22,000 --> 00:44:30,000 mod, a power of two, that's just the low order bits. 499 00:44:30,000 --> 00:44:33,000 So, I just get these low order bits as being the mod. 500 00:44:33,000 --> 00:44:38,000 And then, the right shift operation, and that's good also, 501 00:44:38,000 --> 00:44:42,000 by the way, because a lot of machines, when I multiply two 32 502 00:44:42,000 --> 00:44:46,000 bit numbers, they'll have an instruction that gives you just 503 00:44:46,000 --> 00:44:49,000 the 32 lower bits. And, it's usually an 504 00:44:49,000 --> 00:44:54,000 instruction that's faster than the instruction that gives you 505 00:44:54,000 --> 00:44:58,000 the full 64 bit answer. OK, so, that's very convenient. 506 00:44:58,000 --> 00:45:01,000 And, the second thing is, then, that I want just the, 507 00:45:01,000 --> 00:45:04,000 in this case, three bits that are the high 508 00:45:04,000 --> 00:45:11,000 order bits of this word. So, this ends up being my H of 509 00:45:11,000 --> 00:45:13,000 k. And these end up getting 510 00:45:13,000 --> 00:45:18,000 removed by right shifting this word over. 511 00:45:18,000 --> 00:45:23,000 So, you just right shift that in, zeros come in, 512 00:45:23,000 --> 00:45:28,000 in a high order bit, and you end up getting that 513 00:45:28,000 --> 00:45:32,000 value of H of k. OK, so to understand what's 514 00:45:32,000 --> 00:45:36,000 going on here, why this is a pretty good 515 00:45:36,000 --> 00:45:43,000 method, or what's happening with it, you can imagine that one way 516 00:45:43,000 --> 00:45:52,000 to think about it is to think of A as being a binary fraction. 517 00:45:52,000 --> 00:45:55,000 So, imagine that the decimal point is here, 518 00:45:55,000 --> 00:46:00,000 sorry, the binary point, OK, the radix point is here. 519 00:46:00,000 --> 00:46:03,000 Then when I multiply things, I'm just taking, 520 00:46:03,000 --> 00:46:06,000 the binary point ends up being there. 521 00:46:06,000 --> 00:46:09,000 OK, so if you just imagine that conceptually, 522 00:46:09,000 --> 00:46:14,000 we don't have to actually put this into the hardware because 523 00:46:14,000 --> 00:46:16,000 we just do what the hardware does. 524 00:46:16,000 --> 00:46:20,000 But, I can imagine that it's there, and that it's here. 525 00:46:20,000 --> 00:46:25,000 And so, what I'm really taking is the fractional part of this 526 00:46:25,000 --> 00:46:29,000 product if I treat A as a fraction of a number. 527 00:46:29,000 --> 00:46:35,000 So, we can certainly look at that as sort of a modular wheel. 528 00:46:35,000 --> 00:46:39,000 So, here I have a wheel where this is going to be, 529 00:46:39,000 --> 00:46:43,000 that I'm going to divide into eight parts, OK, 530 00:46:43,000 --> 00:46:48,000 where this point is zero. And then, I go around, 531 00:46:48,000 --> 00:46:52,000 and this point is then one. And, I go around, 532 00:46:52,000 --> 00:46:55,000 and this point is two, and so forth, 533 00:46:55,000 --> 00:47:01,000 so that all the integers, if I wrap it around this unit 534 00:47:01,000 --> 00:47:06,000 wheel, all the integers lined up at the zero point here, 535 00:47:06,000 --> 00:47:10,000 OK? And then, we can divide this 536 00:47:10,000 --> 00:47:14,000 into the fractional pieces. So, that's essentially the zero 537 00:47:14,000 --> 00:47:17,000 point. This is the one eighth, 538 00:47:17,000 --> 00:47:20,000 because we are dividing into eight, two, three, 539 00:47:20,000 --> 00:47:23,000 four, five, six, seven. 540 00:47:23,000 --> 00:47:28,000 So, if I have one times A, in this case, 541 00:47:28,000 --> 00:47:33,000 I'm basically saying, well, one times A, 542 00:47:33,000 --> 00:47:39,000 if I multiply, is basically going around to 543 00:47:39,000 --> 00:47:45,000 about there, five and a half I think, right, 544 00:47:45,000 --> 00:47:51,000 because one times A is about five and a half, 545 00:47:51,000 --> 00:47:59,000 OK, or five halves of 5.5 eighths, essentially. 546 00:47:59,000 --> 00:48:04,000 So, it takes me about to there. That's A. 547 00:48:04,000 --> 00:48:09,000 And, if I do 2^A, that continues around, 548 00:48:09,000 --> 00:48:12,000 and takes me up to about, where? 549 00:48:12,000 --> 00:48:18,000 About, a little past three, about to there. 550 00:48:18,000 --> 00:48:22,000 So, that's 2^A. OK, and 3^A takes me, 551 00:48:22,000 --> 00:48:28,000 then, around to somewhere like about there. 552 00:48:28,000 --> 00:48:35,000 So, each time I add another A, it's taking me another A's 553 00:48:35,000 --> 00:48:41,000 distance around. And, the idea is that if A is, 554 00:48:41,000 --> 00:48:44,000 for example, odd, and it's not too close to 555 00:48:44,000 --> 00:48:48,000 a power of two, then what's happening is sort 556 00:48:48,000 --> 00:48:52,000 of throwing it into another slot on a different thing. 557 00:48:52,000 --> 00:48:57,000 So, if I now go around, if I have k being very big, 558 00:48:57,000 --> 00:49:01,000 then k times A is going around k times. 559 00:49:01,000 --> 00:49:04,000 Where does it end up? It's like spinning a wheel of 560 00:49:04,000 --> 00:49:06,000 fortune or something. OK, it ends somewhere. 561 00:49:06,000 --> 00:49:09,000 OK, and so that's basically the notion. 562 00:49:09,000 --> 00:49:12,000 That's basically the notion, that it's going to end up in 563 00:49:12,000 --> 00:49:15,000 some place. So, you're basically looking 564 00:49:15,000 --> 00:49:18,000 at, where does ka end up? Well, it sort of whirls around, 565 00:49:18,000 --> 00:49:22,000 and ends up at some point. OK, and so that's why that 566 00:49:22,000 --> 00:49:26,000 tends to be a fairly good one. But, these are only heuristic 567 00:49:26,000 --> 00:49:29,000 methods for hashing, because for any hash function, 568 00:49:29,000 --> 00:49:32,000 you can always find a set of keys that's going to make it 569 00:49:32,000 --> 00:49:38,000 operate badly. So, the question is, 570 00:49:38,000 --> 00:49:44,000 well, what do you use in practice? 571 00:49:44,000 --> 00:49:52,000 OK, the second topic that I want to tie it, 572 00:49:52,000 --> 00:50:03,000 so, we talked about resolving collisions by chaining. 573 00:50:03,000 --> 00:50:11,000 OK, there's another way of resolving collisions, 574 00:50:11,000 --> 00:50:19,000 which is often useful, which is resolving collisions 575 00:50:19,000 --> 00:50:25,000 by what's called open addressing. 576 00:50:25,000 --> 00:50:31,000 OK, and the idea is, in this method, 577 00:50:31,000 --> 00:50:38,000 is we have no storage for links. 578 00:50:38,000 --> 00:50:43,000 So, when I result by chaining, I'd need an extra linked field 579 00:50:43,000 --> 00:50:47,000 in each record in order to be able to do that. 580 00:50:47,000 --> 00:50:51,000 Now, that's not necessarily a big overhead, 581 00:50:51,000 --> 00:50:57,000 but for some applications, I don't want to have to touch 582 00:50:57,000 --> 00:51:00,000 those records at all. OK, and for those, 583 00:51:00,000 --> 00:51:07,000 open addressing is a useful way to resolve collisions. 584 00:51:07,000 --> 00:51:10,000 So, the idea is, with open addressing, 585 00:51:10,000 --> 00:51:15,000 is if I hash to a given slot, and the slot is full, 586 00:51:15,000 --> 00:51:21,000 OK, what I do is I just hash again with a different hash 587 00:51:21,000 --> 00:51:25,000 function, with my second hash function. 588 00:51:25,000 --> 00:51:29,000 I check that slot. OK, if that slot is full, 589 00:51:29,000 --> 00:51:34,000 OK, then I hash again. And, I keep this probe 590 00:51:34,000 --> 00:51:39,000 sequence, which hopefully is a permutation so that I'm not 591 00:51:39,000 --> 00:51:43,000 going back and checking things that I've already checked until 592 00:51:43,000 --> 00:51:47,000 I find a place to put it. And, if I got a good probe 593 00:51:47,000 --> 00:51:52,000 sequence that I will hopefully, then, find a place to put it 594 00:51:52,000 --> 00:51:55,000 fairly quickly. OK, and then to search, 595 00:51:55,000 --> 00:51:59,000 I just follow the same probe sequence. 596 00:51:59,000 --> 00:52:05,000 So, the idea, here, is we probe the table 597 00:52:05,000 --> 00:52:12,000 systematically until an empty slot is found, 598 00:52:12,000 --> 00:52:17,000 OK? And so, we can extend that by 599 00:52:17,000 --> 00:52:25,000 looking as if the sequence of hash functions were, 600 00:52:25,000 --> 00:52:32,000 in fact, a hash function that took two arguments: 601 00:52:32,000 --> 00:52:40,000 a key and a probe step. In other words, 602 00:52:40,000 --> 00:52:44,000 is it the zero of one our first one? 603 00:52:44,000 --> 00:52:48,000 It's the second one, etc. 604 00:52:48,000 --> 00:52:55,000 So, it takes two arguments. So, H is then going to map our 605 00:52:55,000 --> 00:53:04,000 universe of keys cross, our probe number into a slot. 606 00:53:04,000 --> 00:53:10,000 So, this is the universe of keys. 607 00:53:10,000 --> 00:53:20,000 This is the probe number. And, this is going to be the 608 00:53:20,000 --> 00:53:25,000 slot. Now, as I mentioned, 609 00:53:25,000 --> 00:53:34,000 the probe sequence should be permutation. 610 00:53:34,000 --> 00:53:38,000 In other words, it should just be the numbers 611 00:53:38,000 --> 00:53:44,000 from zero to n minus one in some fairly random order. 612 00:53:44,000 --> 00:53:48,000 OK, it should just be rearranged. 613 00:53:48,000 --> 00:53:54,000 And the other thing about open addressing is that you don't 614 00:53:54,000 --> 00:54:01,000 have to worry about n chaining is that the table may actually 615 00:54:01,000 --> 00:54:05,000 fill up. So, you have to have that the 616 00:54:05,000 --> 00:54:10,000 number of elements in the table is less than or equal to the 617 00:54:10,000 --> 00:54:16,000 table size, the number of slots because the table may fill up. 618 00:54:16,000 --> 00:54:19,000 And, if it's full, you're going to probe 619 00:54:19,000 --> 00:54:23,000 everywhere. You are never going to get a 620 00:54:23,000 --> 00:54:27,000 place to put it. And, the final thing is that in 621 00:54:27,000 --> 00:54:32,000 this type of scheme, deletion is difficult. 622 00:54:32,000 --> 00:54:34,000 It's not impossible. There are schemes for doing 623 00:54:34,000 --> 00:54:36,000 deletion. But, it's basically hard 624 00:54:36,000 --> 00:54:40,000 because the danger is that you remove a key out of the table, 625 00:54:40,000 --> 00:54:44,000 and now, somebody who's doing a probe sequence who would have 626 00:54:44,000 --> 00:54:47,000 hit that key and gone to find his element now finds that it's 627 00:54:47,000 --> 00:54:49,000 an empty slot. And he says, 628 00:54:49,000 --> 00:54:52,000 oh, the key I am looking for probably isn't there. 629 00:54:52,000 --> 00:54:54,000 OK, so you have that issue to deal with. 630 00:54:54,000 --> 00:54:57,000 So, you can delete things but keep them marked, 631 00:54:57,000 --> 00:55:00,000 and there's all kinds of schemes that people have for 632 00:55:00,000 --> 00:55:04,000 doing deletion. But it's difficult. 633 00:55:04,000 --> 00:55:07,000 It's messy compared to chaining, where you can just 634 00:55:07,000 --> 00:55:09,000 remove the element out of the chain. 635 00:55:09,000 --> 00:55:12,000 So, let's do an example -- 636 00:55:25,000 --> 00:55:37,000 -- just so that we make sure we're on the same page. 637 00:55:37,000 --> 00:55:45,000 So, we'll insert a key. k is 496. 638 00:55:45,000 --> 00:55:57,000 OK, so here's my table. And, I've got some values in 639 00:55:57,000 --> 00:56:06,000 it, 586, 133, 204, 481, etc. 640 00:56:06,000 --> 00:56:13,000 So, the table looks like that; the other places are empty. 641 00:56:13,000 --> 00:56:18,000 So, on my zero step, I probe H of 496, 642 00:56:18,000 --> 00:56:22,000 zero. OK, and let's say that takes me 643 00:56:22,000 --> 00:56:28,000 to the slot where there's 204. And so, I say, 644 00:56:28,000 --> 00:56:36,000 oh, there's something there. I have to probe again. 645 00:56:36,000 --> 00:56:41,000 So then, I probe H of 496, one. 646 00:56:41,000 --> 00:56:47,000 Maybe that maps me there, and I discover, 647 00:56:47,000 --> 00:56:55,000 oh, there's something there. So, now, I probe H of 496, 648 00:56:55,000 --> 00:57:02,000 two. Maybe that takes me to there. 649 00:57:02,000 --> 00:57:04,000 It's empty. So, if I'm doing a search, 650 00:57:04,000 --> 00:57:07,000 I report nil. If I'm doing in the insert, 651 00:57:07,000 --> 00:57:11,000 I put it there. And then, if I'm looking for 652 00:57:11,000 --> 00:57:15,000 that value, if I put it there, then when I'm looking, 653 00:57:15,000 --> 00:57:18,000 I go through exactly the same sequence. 654 00:57:18,000 --> 00:57:21,000 I'll find these things are busy, and then, 655 00:57:21,000 --> 00:57:26,000 eventually, I'll come up and discover the value. 656 00:57:26,000 --> 00:57:29,000 OK, and there are various heuristics that people use, 657 00:57:29,000 --> 00:57:34,000 as well, like keeping track of the longest probe sequence 658 00:57:34,000 --> 00:57:37,000 because there's no point in probing beyond the largest 659 00:57:37,000 --> 00:57:41,000 number of probes that need to be done globally to do an 660 00:57:41,000 --> 00:57:44,000 insertion. OK, so if it took me 5, 661 00:57:44,000 --> 00:57:48,000 5 is the maximum number of probes I ever did for an 662 00:57:48,000 --> 00:57:51,000 insertion. A search never has to look more 663 00:57:51,000 --> 00:57:54,000 than five, OK, and so sometimes hash tables 664 00:57:54,000 --> 00:57:58,000 will keep that auxiliary value so that it can quit rather than 665 00:57:58,000 --> 00:58:04,000 continuing to probe until it doesn't find something. 666 00:58:04,000 --> 00:58:13,000 OK, so, search is the same probe sequence. 667 00:58:13,000 --> 00:58:23,000 And, if it's successful, it finds the record. 668 00:58:23,000 --> 00:58:34,000 And, if it's unsuccessful, you find a nil. 669 00:58:34,000 --> 00:58:37,000 OK, so it's pretty straightforward. 670 00:58:37,000 --> 00:58:42,000 So, once again, as with just hash functions to 671 00:58:42,000 --> 00:58:49,000 begin with, there are a lot of ideas about how you should form 672 00:58:49,000 --> 00:58:55,000 a probe sequence, ways of doing this effectively. 673 00:59:06,000 --> 00:59:14,000 OK, so the simplest one is called linear probing, 674 00:59:14,000 --> 00:59:22,000 and what you do there is you have H of k comma i. 675 00:59:22,000 --> 00:59:33,000 You just make that be some H prime of k, zero plus i mod m. 676 00:59:33,000 --> 00:59:36,000 Sorry, no prime there. OK, so what happens is, 677 00:59:36,000 --> 00:59:41,000 so, the idea here is that all you are doing on the I'th probe 678 00:59:41,000 --> 00:59:44,000 is, on the zero'th probe, you look at H of k zero. 679 00:59:44,000 --> 00:59:48,000 On probe one, you just look at the slot after 680 00:59:48,000 --> 00:59:50,000 that. Probe two, you look at the slot 681 00:59:50,000 --> 00:59:53,000 after that. So, you're just simply, 682 00:59:53,000 --> 00:59:56,000 rather than sort of jumping around like this, 683 00:59:56,000 --> 01:00:01,542 you probe there and then just find the next one that will fit 684 01:00:01,542 --> 01:00:04,785 in. OK, so you just scan down mod 685 01:00:04,785 --> 01:00:06,509 m. So, if you hit the bottom, 686 01:00:06,509 --> 01:00:08,848 you go to the top. OK, so the I'th one, 687 01:00:08,848 --> 01:00:12,050 so that's fairly easy to do because you don't have to 688 01:00:12,050 --> 01:00:14,574 recomputed a full hash function each time. 689 01:00:14,574 --> 01:00:18,083 All you have to do is add one each time you go because the 690 01:00:18,083 --> 01:00:21,531 difference between this and the previous one is just one. 691 01:00:21,531 --> 01:00:24,794 OK, so you just go down. Now, the problem with that is 692 01:00:24,794 --> 01:00:27,195 that you get a phenomenon of clustering. 693 01:00:27,195 --> 01:00:30,458 If you get a few things in a given area, then suddenly 694 01:00:30,458 --> 01:00:33,906 everything, everybody has to keep searching to the end of 695 01:00:33,906 --> 01:00:38,277 those things. OK, so that turns out not to be 696 01:00:38,277 --> 01:00:42,246 one of the better schemes, although it's not bad if you 697 01:00:42,246 --> 01:00:45,258 just need to do something quick and dirty. 698 01:00:45,258 --> 01:00:49,594 So, it suffers from primary clustering, where regions of the 699 01:00:49,594 --> 01:00:53,635 hash table get very full. And then, anything that hashes 700 01:00:53,635 --> 01:00:57,750 into that region has to look through all the stuff that's 701 01:00:57,750 --> 01:01:02,030 there. OK, so: long runs of filled 702 01:01:02,030 --> 01:01:05,846 slots. OK, there's also things like 703 01:01:05,846 --> 01:01:11,459 quadratic clustering, where you basically make this 704 01:01:11,459 --> 01:01:17,744 be, instead of adding one each time, you add i each time. 705 01:01:17,744 --> 01:01:23,581 OK, but probably the most effective popular scheme is 706 01:01:23,581 --> 01:01:29,867 what's called double hashing. And, you can do statistical 707 01:01:29,867 --> 01:01:35,715 studies. People have done statistical 708 01:01:35,715 --> 01:01:41,819 studies to show that this is a good scheme, OK, 709 01:01:41,819 --> 01:01:48,056 where you let H of k, i, let me do it below here 710 01:01:48,056 --> 01:01:54,957 because I have for them. So, H of k, i is equal to an 711 01:01:54,957 --> 01:02:03,467 H_1 of k plus i times H_2 of k. So, you have two hash functions 712 01:02:03,467 --> 01:02:07,157 on m. You have two hash functions, 713 01:02:07,157 --> 01:02:13,085 H_1 of k and H_2 of k. OK, so you compute the two hash 714 01:02:13,085 --> 01:02:19,907 functions, and what you do is you start by just using H_1 of k 715 01:02:19,907 --> 01:02:23,486 for the zero probe, because here, 716 01:02:23,486 --> 01:02:26,282 i, then, will be zero. OK. 717 01:02:26,282 --> 01:02:34,000 Then, for the probe number one, OK, you just add H_2 of k. 718 01:02:34,000 --> 01:02:37,466 For probe number two, you just add that hash function 719 01:02:37,466 --> 01:02:40,266 amount again. You just keep adding H_2 of k 720 01:02:40,266 --> 01:02:42,533 for each successive probe you make. 721 01:02:42,533 --> 01:02:45,933 So, it's fairly easy; you compute two hash functions 722 01:02:45,933 --> 01:02:48,599 up front, OK, or you can delay the second 723 01:02:48,599 --> 01:02:50,400 one, in case. But basically, 724 01:02:50,400 --> 01:02:54,000 you compute two up front, and then you just keep adding 725 01:02:54,000 --> 01:02:57,066 the second one in. You start at the location of 726 01:02:57,066 --> 01:03:00,066 the first one, and keep adding the second one, 727 01:03:00,066 --> 01:03:04,000 mod m, to determine your probe sequences. 728 01:03:04,000 --> 01:03:07,757 So, this is an excellent method. 729 01:03:07,757 --> 01:03:14,181 OK, it does a fine job, and you usually pick m to be a 730 01:03:14,181 --> 01:03:19,393 power of two here, OK, so that you're using, 731 01:03:19,393 --> 01:03:25,939 usually people use this with the multiplication method, 732 01:03:25,939 --> 01:03:30,787 for example, so that m is a power of two, 733 01:03:30,787 --> 01:03:36,000 and H_2 of k you force to be odd. 734 01:03:36,000 --> 01:03:40,578 OK, so we don't use and even value there, because otherwise 735 01:03:40,578 --> 01:03:44,210 for any particular key, you'd be skipping over. 736 01:03:44,210 --> 01:03:49,105 Once again, you would have the problem that everything could be 737 01:03:49,105 --> 01:03:53,526 even, or everything could be odd as you're going through. 738 01:03:53,526 --> 01:03:57,788 But, if you make H_2 of k odd, and m is a power of two, 739 01:03:57,788 --> 01:04:00,631 you are guaranteed to hit every slot. 740 01:04:00,631 --> 01:04:03,157 OK, so let's analyze this scheme. 741 01:04:03,157 --> 01:04:09,000 This turns out to be a pretty interesting scheme to analyze. 742 01:04:09,000 --> 01:04:14,080 It's got some nice math in it. So, once again, 743 01:04:14,080 --> 01:04:18,032 in the worst case, hashing is lousy. 744 01:04:18,032 --> 01:04:23,000 So, we're going to analyze average case. 745 01:04:35,000 --> 01:04:45,615 OK, and for this, we need a little bit stronger 746 01:04:45,615 --> 01:04:59,230 assumption than for chaining. And, we call it the assumption 747 01:04:59,230 --> 01:05:09,846 of uniform hashing, which says that each key is 748 01:05:09,846 --> 01:05:19,769 equally likely, OK, to have any one of the m 749 01:05:19,769 --> 01:05:32,000 factorial permutations as its probe sequence, 750 01:05:32,000 --> 01:05:34,000 independent of other keys. 751 01:05:45,000 --> 01:05:55,291 And, the theorem we're going to prove is that the expected 752 01:05:55,291 --> 01:06:03,777 number of probes is, at most, one over one minus 753 01:06:03,777 --> 01:06:11,000 alpha if alpha is less than one, OK, 754 01:06:11,000 --> 01:06:17,000 that is, if the number of keys in the table is less than number 755 01:06:17,000 --> 01:06:20,870 of slots. OK, so we're going to show that 756 01:06:20,870 --> 01:06:26,000 the number of probes is one over one minus alpha. 757 01:06:34,000 --> 01:06:38,700 So, alpha is the load factor, and of course, 758 01:06:38,700 --> 01:06:44,057 for open addressing, we want the load factor to be 759 01:06:44,057 --> 01:06:49,852 less than one because if we have more keys than slots, 760 01:06:49,852 --> 01:06:56,520 open addressing simply doesn't work, OK, because you've got to 761 01:06:56,520 --> 01:07:00,784 find a place for every key in the table. 762 01:07:00,784 --> 01:07:05,485 So, the proof, we'll look at an unsuccessful 763 01:07:05,485 --> 01:07:12,908 search, OK? So, the first thing is that one 764 01:07:12,908 --> 01:07:21,141 probe is always necessary. OK, so if I have n over m, 765 01:07:21,141 --> 01:07:29,533 sorry, if I have n items stored in m slots, what's the 766 01:07:29,533 --> 01:07:38,875 probability that when I do that probe I get a collision with 767 01:07:38,875 --> 01:07:46,000 something that's already in the table? 768 01:07:46,000 --> 01:07:51,526 What's the probability that I get a collision? 769 01:07:51,526 --> 01:07:53,982 Yeah? Yeah, n over m, 770 01:07:53,982 --> 01:07:57,298 right? So, with probability, 771 01:07:57,298 --> 01:08:04,052 n over m, we have a collision because my table has got n 772 01:08:04,052 --> 01:08:08,487 things in there. I'm hashing, 773 01:08:08,487 --> 01:08:15,551 at random, to one of them. OK, so, what are the odds I hit 774 01:08:15,551 --> 01:08:21,376 something, n over m? And then, a second probe is 775 01:08:21,376 --> 01:08:24,102 necessary. OK, so then, 776 01:08:24,102 --> 01:08:30,175 I do a second probe. And, with what probability on 777 01:08:30,175 --> 01:08:36,000 the second probe do I get a collision? 778 01:08:36,000 --> 01:08:40,158 So, we're going to make the assumption of uniform hashing. 779 01:08:40,158 --> 01:08:44,536 Each key is equally likely to have any one of the m factorial 780 01:08:44,536 --> 01:08:47,017 permutations as its probe sequence. 781 01:08:47,017 --> 01:08:50,810 So, what is the probability that on the second probe, 782 01:08:50,810 --> 01:08:53,000 OK, I get a collision? 783 01:09:10,000 --> 01:09:14,778 Yeah? If it's a permutation, 784 01:09:14,778 --> 01:09:21,504 you're not, right? Something like that. 785 01:09:21,504 --> 01:09:30,000 What is it exactly? So, that's the question. 786 01:09:30,000 --> 01:09:35,478 OK, so you are not going to hit the same slot because it's going 787 01:09:35,478 --> 01:09:37,652 to be a permutation. Yeah? 788 01:09:37,652 --> 01:09:41,913 That's exactly right. n minus one over m minus one 789 01:09:41,913 --> 01:09:45,652 because I'm now, I've essentially eliminated 790 01:09:45,652 --> 01:09:48,694 that slot that I hit the first time. 791 01:09:48,694 --> 01:09:52,694 And so, I have, now, and there was a key there. 792 01:09:52,694 --> 01:09:56,347 So, now I'm essentially looking, at random, 793 01:09:56,347 --> 01:10:00,782 into the remaining n minus one slots where there are 794 01:10:00,782 --> 01:10:06,000 aggregately n minus one keys in those slots. 795 01:10:06,000 --> 01:10:11,306 OK, everybody got that? OK, so with that probability, 796 01:10:11,306 --> 01:10:16,204 I get a collision. That means that I need a third 797 01:10:16,204 --> 01:10:18,142 probe necessary, OK? 798 01:10:18,142 --> 01:10:23,346 And, we keep going on. OK, so what is it going to be 799 01:10:23,346 --> 01:10:27,836 the next time? Yeah, it's going to be n minus 800 01:10:27,836 --> 01:10:33,939 two over m minus two. So, let's note, 801 01:10:33,939 --> 01:10:44,716 OK, that n minus i over m minus i is less than n over m, 802 01:10:44,716 --> 01:10:49,027 which equals alpha, OK? 803 01:10:49,027 --> 01:11:00,000 So, n minus i over m minus i is less than n over m. 804 01:11:00,000 --> 01:11:05,505 And, the way you can sort of reason that is that if n is less 805 01:11:05,505 --> 01:11:11,287 than m, I'm subtracting a larger fraction of n when I subtract i 806 01:11:11,287 --> 01:11:14,682 than I am subtracting a fraction of m. 807 01:11:14,682 --> 01:11:18,720 OK, so therefore, n minus i over m minus i is 808 01:11:18,720 --> 01:11:23,858 going to be less than n over m. OK, so, or you can do the 809 01:11:23,858 --> 01:11:27,070 algebra. I think it's always helpful 810 01:11:27,070 --> 01:11:31,842 when you do algebra to sort of think about it sort of 811 01:11:31,842 --> 01:11:36,705 quantitatively as well, you know, qualitatively what's 812 01:11:36,705 --> 01:11:42,119 going on. So, the expected number of 813 01:11:42,119 --> 01:11:46,559 probes is, then, going to be equal to, 814 01:11:46,559 --> 01:11:53,399 it's going to be equal to because we're going to need some 815 01:11:53,399 --> 01:12:00,600 space, well, we have one which is forced because we've got to 816 01:12:00,600 --> 01:12:09,308 do one probe, plus with probability n over m, 817 01:12:09,308 --> 01:12:21,313 I have to do another probe plus with probability of n over m 818 01:12:21,313 --> 01:12:33,930 minus one I have to do another probe up until I do one plus one 819 01:12:33,930 --> 01:12:40,276 over m minus n. OK, so each one is cascading 820 01:12:40,276 --> 01:12:42,553 what's happened. In the book, 821 01:12:42,553 --> 01:12:47,432 there is a more rigorous proof of this using indicator random 822 01:12:47,432 --> 01:12:50,767 variables. I'm going to give you the short 823 01:12:50,767 --> 01:12:52,800 version. OK, so basically, 824 01:12:52,800 --> 01:12:56,784 this is my first probe. With probability n over m, 825 01:12:56,784 --> 01:13:01,338 I had to do a second one. And, the result of that is that 826 01:13:01,338 --> 01:13:04,997 with probability n minus one over m minus one, 827 01:13:04,997 --> 01:13:08,982 I have to do another. And, with probability n over 828 01:13:08,982 --> 01:13:12,397 two minus m over two, I have to do another, 829 01:13:12,397 --> 01:13:18,857 and so forth. So, that's how many probes I'm 830 01:13:18,857 --> 01:13:25,542 going to end up doing. So, this is less than or equal 831 01:13:25,542 --> 01:13:31,457 to one plus alpha. There's one plus alpha times 832 01:13:31,457 --> 01:13:39,042 one plus alpha times one plus alpha, OK, just using the fact 833 01:13:39,042 --> 01:13:45,536 that I had here. OK, and that is less than or 834 01:13:45,536 --> 01:13:51,347 equal to one plus I just multiply through here. 835 01:13:51,347 --> 01:13:57,410 Alpha plus alpha squared plus alpha cubed plus k. 836 01:13:57,410 --> 01:14:01,957 I can just take that out to infinity. 837 01:14:01,957 --> 01:14:10,206 It's going to bound this. OK, does everybody see the math 838 01:14:10,206 --> 01:14:14,954 there? OK, and that is just the sum, 839 01:14:14,954 --> 01:14:20,653 I, equals zero to infinity, alpha to the I, 840 01:14:20,653 --> 01:14:28,929 which is equal to one over one minus alpha using your familiar 841 01:14:28,929 --> 01:14:34,615 geometric series bound. OK, and there's also, 842 01:14:34,615 --> 01:14:38,076 in the textbook, an analysis of the successful 843 01:14:38,076 --> 01:14:41,230 search, which, once again, is a little bit 844 01:14:41,230 --> 01:14:45,384 more technical because you have to worry about what the 845 01:14:45,384 --> 01:14:50,000 distribution is that you happen to have in the table when you 846 01:14:50,000 --> 01:14:54,230 are searching for something that's already in the table. 847 01:14:54,230 --> 01:14:58,538 But, it turns out it's also bounded by one over one minus 848 01:14:58,538 --> 01:15:04,920 alpha. So, let's just look to see what 849 01:15:04,920 --> 01:15:11,269 that means. So, if alpha is less than one 850 01:15:11,269 --> 01:15:18,253 is a constant, it implies that it takes order 851 01:15:18,253 --> 01:15:24,761 one probes. OK, so if alpha is a constant, 852 01:15:24,761 --> 01:15:33,621 it takes order one probes. OK, but it's helpful to 853 01:15:33,621 --> 01:15:40,706 understand what's happening with the constant. 854 01:15:40,706 --> 01:15:47,161 So, for example, if the table is 50% full, 855 01:15:47,161 --> 01:15:54,719 so alpha is a half, what's the expected number of 856 01:15:54,719 --> 01:16:03,378 probes by this analysis? Two, because one over one minus 857 01:16:03,378 --> 01:16:11,531 a half is two. If I let the table fill up to 858 01:16:11,531 --> 01:16:17,937 90%, how many probes do I need on average? 859 01:16:17,937 --> 01:16:22,781 Ten. So, you can see that as you 860 01:16:22,781 --> 01:16:30,437 fill up the table, the cost is going dramatically, 861 01:16:30,437 --> 01:16:33,955 OK? And so, typically, 862 01:16:33,955 --> 01:16:37,865 you don't let the table get too full. 863 01:16:37,865 --> 01:16:43,297 OK, you don't want to be pushing 99.9% utilization. 864 01:16:43,297 --> 01:16:49,706 Oh, I got this great hash table that's got full utilization. 865 01:16:49,706 --> 01:16:52,964 It's like, yeah, and it's slow. 866 01:16:52,964 --> 01:16:55,571 It's really, really slow, 867 01:16:55,571 --> 01:17:02,415 OK, because as alpha approaches one, the time is approaching and 868 01:17:02,415 --> 01:17:06,000 essentially m, or n. 869 01:17:06,000 --> 01:17:08,050 Good. So, next time, 870 01:17:08,050 --> 01:17:14,419 we are going to address head-on in what was one of the most, 871 01:17:14,419 --> 01:17:18,737 I think, interesting ideas in algorithms. 872 01:17:18,737 --> 01:17:25,213 We are going to talk about how you solve this problem that no 873 01:17:25,213 --> 01:17:31,798 matter what hash function you pick, there's a bad set of keys. 874 01:17:31,798 --> 01:17:38,058 OK, so next time we're going to show that there are ways of 875 01:17:38,058 --> 01:17:42,592 confronting that problem, very clever ways. 876 01:17:42,592 --> 01:17:45,000 And we use a lot of math for it so will be a really fun lecture.