1 00:00:14,825 --> 00:00:15,950 PETER SZOLOVITS: All right. 2 00:00:15,950 --> 00:00:17,900 Let's get started. 3 00:00:17,900 --> 00:00:20,120 Good afternoon. 4 00:00:20,120 --> 00:00:24,910 So last time, I started talking about the use 5 00:00:24,910 --> 00:00:29,770 of natural language processing to process clinical data. 6 00:00:29,770 --> 00:00:32,259 And things went a little bit slowly. 7 00:00:32,259 --> 00:00:36,010 And so we didn't get through a lot of the material. 8 00:00:36,010 --> 00:00:39,680 I'm going to try to rush a bit more today. 9 00:00:39,680 --> 00:00:44,690 And as a result, I have a lot of stuff to cover. 10 00:00:44,690 --> 00:00:49,510 So if you remember, last time, I started 11 00:00:49,510 --> 00:00:54,010 by saying that a lot of the NLP work 12 00:00:54,010 --> 00:00:57,040 involves coming up with phrases that one might 13 00:00:57,040 --> 00:01:01,900 be interested in to help identify the kinds of data 14 00:01:01,900 --> 00:01:06,320 that you want, and then just looking for those in text. 15 00:01:06,320 --> 00:01:07,810 So that's a very simple method. 16 00:01:07,810 --> 00:01:10,460 But it's one that works reasonably well. 17 00:01:10,460 --> 00:01:13,150 And then Kat Liao was here to talk about some 18 00:01:13,150 --> 00:01:16,880 of the applications of that kind of work 19 00:01:16,880 --> 00:01:20,410 in what she's been doing in cohort selection. 20 00:01:20,410 --> 00:01:22,060 So what I want to talk about today 21 00:01:22,060 --> 00:01:24,820 is more sophisticated versions of that, 22 00:01:24,820 --> 00:01:29,110 and then move on to more contemporary approaches 23 00:01:29,110 --> 00:01:31,360 to natural language processing. 24 00:01:31,360 --> 00:01:35,860 So this is a paper that was given 25 00:01:35,860 --> 00:01:39,040 to you as one of the optional readings last time. 26 00:01:39,040 --> 00:01:42,250 And it's work from David Sontag's lab, 27 00:01:42,250 --> 00:01:46,250 where they said, well, how do we make this more sophisticated? 28 00:01:46,250 --> 00:01:47,650 So they start the same way. 29 00:01:47,650 --> 00:01:52,270 They say, OK, Dr. Liao, let's say, 30 00:01:52,270 --> 00:01:56,830 give me terms that are very good indicators that I have 31 00:01:56,830 --> 00:01:59,950 the right kind of patient, if I find them 32 00:01:59,950 --> 00:02:01,880 in the patient's notes. 33 00:02:01,880 --> 00:02:04,940 So these are things with high predictive value. 34 00:02:04,940 --> 00:02:09,990 So you don't want to use a term like sick, because that's going 35 00:02:09,990 --> 00:02:11,500 to find way too many people. 36 00:02:11,500 --> 00:02:13,090 But you want to find something that 37 00:02:13,090 --> 00:02:17,980 is very specific but that has a high predictive value 38 00:02:17,980 --> 00:02:20,860 that you are going to find the right person. 39 00:02:20,860 --> 00:02:24,120 And then what they did is they built 40 00:02:24,120 --> 00:02:28,990 a model that tries to predict the presence 41 00:02:28,990 --> 00:02:33,640 of that word in the text from everything 42 00:02:33,640 --> 00:02:36,600 else in the medical record. 43 00:02:36,600 --> 00:02:42,080 So now, this is an example of a silver-standard way of training 44 00:02:42,080 --> 00:02:47,420 a model that says, well, I don't have the energy or the time 45 00:02:47,420 --> 00:02:50,030 to get doctors to look through thousands 46 00:02:50,030 --> 00:02:52,110 and thousands of records. 47 00:02:52,110 --> 00:02:55,350 But if I select these anchors well enough, 48 00:02:55,350 --> 00:02:59,300 then I'm going to get a high yield of correct responses 49 00:02:59,300 --> 00:03:00,410 from those. 50 00:03:00,410 --> 00:03:01,970 And then I train a machine learning 51 00:03:01,970 --> 00:03:11,000 model that learns to identify those same terms, 52 00:03:11,000 --> 00:03:14,420 or those same records that have those terms in them. 53 00:03:14,420 --> 00:03:16,430 And by the way, from that, we're going 54 00:03:16,430 --> 00:03:18,890 to learn a whole bunch of other terms 55 00:03:18,890 --> 00:03:23,040 that are proxies for the ones that we started with. 56 00:03:23,040 --> 00:03:27,410 So this is a way of enlarging that set of terms 57 00:03:27,410 --> 00:03:30,140 automatically. 58 00:03:30,140 --> 00:03:32,930 And so there are a bunch of technical details 59 00:03:32,930 --> 00:03:38,210 that you can find out about by reading the paper. 60 00:03:38,210 --> 00:03:40,970 They used a relatively simple representation, 61 00:03:40,970 --> 00:03:45,140 which is essentially a bag-of-words representation. 62 00:03:45,140 --> 00:03:48,890 They then sort of masked the three words 63 00:03:48,890 --> 00:03:52,760 around the word that actually is the one they're 64 00:03:52,760 --> 00:03:54,710 trying to predict just to get rid 65 00:03:54,710 --> 00:03:59,090 of short-term syntactic correlations. 66 00:03:59,090 --> 00:04:02,660 And then they built an L2-regularized logistic 67 00:04:02,660 --> 00:04:06,890 regression model that said, what are the features that predict 68 00:04:06,890 --> 00:04:08,840 the occurrence of this word? 69 00:04:08,840 --> 00:04:12,380 And then they expanded the search vocabulary 70 00:04:12,380 --> 00:04:14,870 to include those features as well. 71 00:04:14,870 --> 00:04:16,610 And again, there are tons of details 72 00:04:16,610 --> 00:04:21,079 about how to discretize continuous values 73 00:04:21,079 --> 00:04:24,000 and things like that that you can find out about. 74 00:04:24,000 --> 00:04:25,760 So you build a phenotype estimator 75 00:04:25,760 --> 00:04:29,330 from the anchors and the chosen predictors. 76 00:04:29,330 --> 00:04:32,030 They calculated a calibration score 77 00:04:32,030 --> 00:04:34,640 for each of these other predictors that 78 00:04:34,640 --> 00:04:37,790 told you how well it predicted. 79 00:04:37,790 --> 00:04:41,360 And then you can build a joint estimator 80 00:04:41,360 --> 00:04:43,280 that uses all of these. 81 00:04:43,280 --> 00:04:45,920 And the bottom line is that they did very well. 82 00:04:45,920 --> 00:04:51,440 So in order to evaluate this, they 83 00:04:51,440 --> 00:04:55,220 looked at eight different phenotypes for which 84 00:04:55,220 --> 00:04:58,970 they had human judgment data. 85 00:04:58,970 --> 00:05:00,980 And so this tells you that they're 86 00:05:00,980 --> 00:05:07,130 getting AUCs of between 0.83 and 0.95 87 00:05:07,130 --> 00:05:10,730 for these different phenotypes. 88 00:05:10,730 --> 00:05:13,170 So that's quite good. 89 00:05:13,170 --> 00:05:16,850 They, in fact, were estimating not only these eight phenotypes 90 00:05:16,850 --> 00:05:19,040 but 40-something. 91 00:05:19,040 --> 00:05:22,890 I don't remember the exact number, much larger number. 92 00:05:22,890 --> 00:05:25,400 But they didn't have validated data 93 00:05:25,400 --> 00:05:27,620 against which to test the others. 94 00:05:27,620 --> 00:05:30,470 But the expectation is that if it does well on these, 95 00:05:30,470 --> 00:05:33,450 it probably does well on the others as well. 96 00:05:33,450 --> 00:05:35,810 So this was a very nice idea. 97 00:05:35,810 --> 00:05:38,990 And just to illustrate, if you start with something 98 00:05:38,990 --> 00:05:41,840 like diabetes as a phenotype and you say, 99 00:05:41,840 --> 00:05:44,330 well, I'm going to look for anchors 100 00:05:44,330 --> 00:05:48,710 that are a code of 250 diabetes mellitus, 101 00:05:48,710 --> 00:05:51,440 or I'm going to look at medication 102 00:05:51,440 --> 00:05:55,110 history for diabetic therapy-- 103 00:05:55,110 --> 00:06:00,170 so those are the silver-standard goals that I'm looking at. 104 00:06:00,170 --> 00:06:03,710 And those, in fact, have a high predictive value for somebody 105 00:06:03,710 --> 00:06:05,430 being in the cohort. 106 00:06:05,430 --> 00:06:08,870 And then they identify all these other features 107 00:06:08,870 --> 00:06:12,230 that predict those, and therefore, in turn, 108 00:06:12,230 --> 00:06:15,230 predict appropriate selectors for the phenotype 109 00:06:15,230 --> 00:06:17,070 that they're interested in. 110 00:06:17,070 --> 00:06:19,910 And if you look at the paper again, what you see 111 00:06:19,910 --> 00:06:24,380 is that this outperforms, over time, 112 00:06:24,380 --> 00:06:28,970 the standard supervised baseline that they're comparing against, 113 00:06:28,970 --> 00:06:32,270 where you're getting much higher accuracy early 114 00:06:32,270 --> 00:06:35,510 in a patient's visit to be able to identify them 115 00:06:35,510 --> 00:06:39,660 as belonging to this cohort. 116 00:06:39,660 --> 00:06:45,110 I'm going to come back later to look at another similar attempt 117 00:06:45,110 --> 00:06:50,280 to generalize from a core using a different set of techniques. 118 00:06:50,280 --> 00:06:54,410 So you should see that in about 45 minutes, I hope. 119 00:06:57,380 --> 00:06:59,850 Well, context is important. 120 00:06:59,850 --> 00:07:02,810 So if you look at a sentence like Mr. Huntington was treated 121 00:07:02,810 --> 00:07:06,290 for Huntington's disease at Huntington Hospital, located 122 00:07:06,290 --> 00:07:10,670 on Huntington Avenue, each of those mentions of the word 123 00:07:10,670 --> 00:07:13,250 Huntington is different. 124 00:07:13,250 --> 00:07:16,850 And for example, if you're interested in eliminating 125 00:07:16,850 --> 00:07:19,850 personally identifiable health information 126 00:07:19,850 --> 00:07:23,270 from a record like this, then certainly you 127 00:07:23,270 --> 00:07:26,540 want to get rid of the Mr. Huntington part. 128 00:07:26,540 --> 00:07:29,150 You don't want to get rid of Huntington's disease, 129 00:07:29,150 --> 00:07:33,680 because that's a medically relevant fact. 130 00:07:33,680 --> 00:07:37,940 And you probably do want to get rid of Huntington Hospital 131 00:07:37,940 --> 00:07:40,850 and its location on Huntington Avenue, 132 00:07:40,850 --> 00:07:44,390 although those are not necessarily something 133 00:07:44,390 --> 00:07:46,580 that you're prohibited from retaining. 134 00:07:46,580 --> 00:07:50,600 So for example, if you're trying to do quality studies 135 00:07:50,600 --> 00:07:52,850 among different hospitals, then it 136 00:07:52,850 --> 00:07:56,480 would make sense to retain the name of the hospital, which 137 00:07:56,480 --> 00:08:00,230 is not considered identifying of the individual. 138 00:08:00,230 --> 00:08:05,420 So we, in fact, did a study back in the mid 2000s, 139 00:08:05,420 --> 00:08:10,040 where we were trying to build an improved de-identifier. 140 00:08:12,580 --> 00:08:14,500 And here's the way we went about it. 141 00:08:14,500 --> 00:08:17,990 This is a kind of kitchen sink approach that says, 142 00:08:17,990 --> 00:08:23,330 OK, take the text, tokenize it. 143 00:08:23,330 --> 00:08:25,400 Look at every single token. 144 00:08:25,400 --> 00:08:27,900 And derive things from it. 145 00:08:27,900 --> 00:08:30,350 So the words that make up the token, 146 00:08:30,350 --> 00:08:33,200 the part of speech, how it's capitalized, 147 00:08:33,200 --> 00:08:36,169 whether there's punctuation around it, 148 00:08:36,169 --> 00:08:40,549 which document section is it in-- 149 00:08:40,549 --> 00:08:44,059 many databases have sort of conventional document 150 00:08:44,059 --> 00:08:44,660 structure. 151 00:08:44,660 --> 00:08:48,260 If you've looked at the mimic discharge summaries, 152 00:08:48,260 --> 00:08:51,530 for example, there's a kind of prototypical way 153 00:08:51,530 --> 00:08:54,620 in which that flows from beginning to end. 154 00:08:54,620 --> 00:08:57,920 And you can use that structural information. 155 00:08:57,920 --> 00:09:01,890 We then identified a bunch of patterns and thesaurus terms. 156 00:09:01,890 --> 00:09:06,650 So we looked up, in the UMLS, words and phrases 157 00:09:06,650 --> 00:09:10,580 to see if they matched some clinically meaningful term. 158 00:09:10,580 --> 00:09:13,010 We had patterns that identified things 159 00:09:13,010 --> 00:09:17,390 like phone numbers and social security numbers and addresses 160 00:09:17,390 --> 00:09:19,140 and so on. 161 00:09:19,140 --> 00:09:22,620 And then we did parsing of the text. 162 00:09:22,620 --> 00:09:24,740 So in those days, we used something 163 00:09:24,740 --> 00:09:27,740 called the Link Grammar Parser, which, 164 00:09:27,740 --> 00:09:30,470 doesn't make a whole lot of difference what parser. 165 00:09:30,470 --> 00:09:34,790 But you get either a constituent or constituency or dependency 166 00:09:34,790 --> 00:09:39,420 parse, which gives you relationships among the words. 167 00:09:39,420 --> 00:09:44,300 And so it allows you to include, as features, 168 00:09:44,300 --> 00:09:47,180 the way in which a word that you're looking at 169 00:09:47,180 --> 00:09:49,920 relates to other words around it. 170 00:09:49,920 --> 00:09:54,110 And so what we did is we said, OK, the lexical context 171 00:09:54,110 --> 00:09:57,860 includes all of the above kind of information 172 00:09:57,860 --> 00:10:02,720 for all of the words that are either literally adjacent 173 00:10:02,720 --> 00:10:05,630 or within n words of the original word that you're 174 00:10:05,630 --> 00:10:10,730 focusing on, or that are linked by within k links 175 00:10:10,730 --> 00:10:13,590 through the parse to that word. 176 00:10:13,590 --> 00:10:17,900 So this gives you a very large set of features. 177 00:10:17,900 --> 00:10:23,070 And of course, parsing is not a solved problem. 178 00:10:23,070 --> 00:10:27,860 And so this is an example from that story 179 00:10:27,860 --> 00:10:29,780 that I showed you last time. 180 00:10:29,780 --> 00:10:36,530 And if you see, it comes up with 24 ambiguous parses 181 00:10:36,530 --> 00:10:39,710 of this sentence. 182 00:10:39,710 --> 00:10:44,960 So there are technical problems about how to deal with that. 183 00:10:44,960 --> 00:10:47,030 Today, you could use a different parser. 184 00:10:47,030 --> 00:10:49,700 The Stanford Parser, for example, 185 00:10:49,700 --> 00:10:51,650 probably does a better job than the one 186 00:10:51,650 --> 00:10:58,010 we were using 14 years ago and gives you at least 187 00:10:58,010 --> 00:11:00,080 more definitive answers. 188 00:11:00,080 --> 00:11:02,700 And so you could use that instead. 189 00:11:02,700 --> 00:11:04,460 And so if you look at what we did, 190 00:11:04,460 --> 00:11:09,100 we said, well, here is the text "Mr." 191 00:11:09,100 --> 00:11:15,080 And here are all the ways that you can look it up in the UMLS. 192 00:11:15,080 --> 00:11:18,330 And it turns out to be very ambiguous. 193 00:11:18,330 --> 00:11:22,960 So M-R stands not only for mister, 194 00:11:22,960 --> 00:11:25,480 but it also stands for Magnetic Resonance. 195 00:11:25,480 --> 00:11:28,740 And it stands for a whole bunch of other things. 196 00:11:28,740 --> 00:11:31,820 And so you get huge amounts of ambiguity. 197 00:11:31,820 --> 00:11:36,410 "Blind" turns out also to give you various ambiguities. 198 00:11:36,410 --> 00:11:41,000 So it maps here to four different concept-unique 199 00:11:41,000 --> 00:11:43,250 identifiers. 200 00:11:43,250 --> 00:11:46,010 "Is" is OK. 201 00:11:46,010 --> 00:11:49,250 "79-year-old" is OK. 202 00:11:49,250 --> 00:11:56,300 And then "male," again, maps to five different concept-unique 203 00:11:56,300 --> 00:11:57,570 identifiers. 204 00:11:57,570 --> 00:12:00,590 So there are all these problems of over-generation 205 00:12:00,590 --> 00:12:02,580 from this database. 206 00:12:02,580 --> 00:12:05,450 And here's some more, but I'm going to skip over that. 207 00:12:05,450 --> 00:12:07,340 And then the learning model, in our case, 208 00:12:07,340 --> 00:12:11,240 was a support vector machine for this project, in which we just 209 00:12:11,240 --> 00:12:14,060 said, well, throw in all the-- 210 00:12:14,060 --> 00:12:15,440 you know, it's the kill them all, 211 00:12:15,440 --> 00:12:19,370 and God will sort them out kind of approach. 212 00:12:19,370 --> 00:12:21,500 So we just threw in all these features 213 00:12:21,500 --> 00:12:23,450 and said, oh, support vector machines 214 00:12:23,450 --> 00:12:26,990 are really good at picking out exactly what are the best 215 00:12:26,990 --> 00:12:27,950 features. 216 00:12:27,950 --> 00:12:30,320 And so we just relied on that. 217 00:12:30,320 --> 00:12:34,580 And sure enough, so you wind up with literally millions 218 00:12:34,580 --> 00:12:36,440 of features. 219 00:12:36,440 --> 00:12:38,750 But sure enough, it worked pretty well. 220 00:12:38,750 --> 00:12:41,550 And so Stat De-ID was our program. 221 00:12:41,550 --> 00:12:44,600 And you see that on real discharge summaries, 222 00:12:44,600 --> 00:12:49,370 we're getting precision and recall on PHI 223 00:12:49,370 --> 00:12:53,510 up around 98 and 1/2%, 95 and 1/4%, 224 00:12:53,510 --> 00:12:56,480 which was much better than the previous state of the art, 225 00:12:56,480 --> 00:13:00,680 which had been based on rules and dictionaries 226 00:13:00,680 --> 00:13:03,560 as a way of de-identifying things. 227 00:13:03,560 --> 00:13:08,090 So this was a successful example of that approach. 228 00:13:08,090 --> 00:13:13,160 And of course, this is usable not only for de-identification. 229 00:13:13,160 --> 00:13:16,910 But it's also usable for entity recognition. 230 00:13:16,910 --> 00:13:19,400 Because instead of selecting entities 231 00:13:19,400 --> 00:13:22,720 that are personally identifiable health information, 232 00:13:22,720 --> 00:13:26,830 you could train it to select entities that are diseases 233 00:13:26,830 --> 00:13:30,370 or that are medications or that are various other things. 234 00:13:30,370 --> 00:13:35,470 And so this was, in the 2000s, a pretty typical way 235 00:13:35,470 --> 00:13:38,620 for people to approach these kinds of problems. 236 00:13:38,620 --> 00:13:40,080 And it's still used today. 237 00:13:40,080 --> 00:13:43,060 There are tools around that let you do this. 238 00:13:43,060 --> 00:13:45,620 And they work reasonably effectively. 239 00:13:45,620 --> 00:13:47,980 They're not state of the art at the moment, 240 00:13:47,980 --> 00:13:50,680 but they're simpler than many of today's state 241 00:13:50,680 --> 00:13:54,380 of the art methods. 242 00:13:54,380 --> 00:13:56,690 So here's another approach. 243 00:13:56,690 --> 00:14:01,760 This was something we published a few years ago, where 244 00:14:01,760 --> 00:14:06,510 we started working with some psychiatrists and said, 245 00:14:06,510 --> 00:14:09,560 could we predict 30-day readmission 246 00:14:09,560 --> 00:14:14,780 for a psychiatric patient with any degree of reliability? 247 00:14:14,780 --> 00:14:16,550 That's a hard prediction. 248 00:14:16,550 --> 00:14:19,340 Willie is currently running an experiment 249 00:14:19,340 --> 00:14:23,030 where we're asking psychiatrists to predict that. 250 00:14:23,030 --> 00:14:26,240 And it turns out, they're barely better than chance 251 00:14:26,240 --> 00:14:27,890 at that prediction. 252 00:14:27,890 --> 00:14:30,800 So it's not an easy task. 253 00:14:30,800 --> 00:14:35,960 And what we did is we said, well, let's use topic modeling. 254 00:14:35,960 --> 00:14:40,580 And so we had this cohort of patients, 255 00:14:40,580 --> 00:14:42,320 close to 5,000 patients. 256 00:14:42,320 --> 00:14:45,320 About 10% of them were readmitted 257 00:14:45,320 --> 00:14:47,390 with a psych diagnosis. 258 00:14:47,390 --> 00:14:50,720 And almost 3,000 of them were readmitted 259 00:14:50,720 --> 00:14:52,790 with other diagnoses. 260 00:14:52,790 --> 00:14:54,860 So one thing this tells you right away 261 00:14:54,860 --> 00:14:58,465 is that if you're dealing with psychiatric patients, 262 00:14:58,465 --> 00:15:01,700 they come and go to the hospital frequently. 263 00:15:01,700 --> 00:15:05,150 And this is not good for the hospital's bottom line because 264 00:15:05,150 --> 00:15:10,710 of reimbursement policies of insurance companies and so on. 265 00:15:10,710 --> 00:15:17,820 So of the 4,700, only 1,240 were not readmitted within 30 days. 266 00:15:17,820 --> 00:15:21,500 So there's very frequent bounce-back. 267 00:15:21,500 --> 00:15:27,560 So we said, well, let's try building a baseline model using 268 00:15:27,560 --> 00:15:31,190 a support vector machine from baseline clinical features 269 00:15:31,190 --> 00:15:34,430 like age, gender, public health insurance 270 00:15:34,430 --> 00:15:37,460 as a proxy for socioeconomic status. 271 00:15:37,460 --> 00:15:41,210 So if you're on Medicaid, you're probably poor. 272 00:15:41,210 --> 00:15:44,220 And if you have private insurance, 273 00:15:44,220 --> 00:15:48,830 then you're probably an MIT employee and/or better off. 274 00:15:48,830 --> 00:15:53,390 So that's a frequently used proxy, a comorbidity index 275 00:15:53,390 --> 00:15:55,640 that tells you sort of how sick you 276 00:15:55,640 --> 00:15:59,660 are from things other than your psychiatric problems. 277 00:15:59,660 --> 00:16:01,160 And then we said, well, what if we 278 00:16:01,160 --> 00:16:05,270 add to that model common words from notes. 279 00:16:05,270 --> 00:16:10,200 So we said, let's do a TF-IDF calculation. 280 00:16:10,200 --> 00:16:14,150 So this is term frequency divided by log of the document 281 00:16:14,150 --> 00:16:15,510 frequency. 282 00:16:15,510 --> 00:16:18,410 So it's sort of, how specific is a term 283 00:16:18,410 --> 00:16:22,100 to identify a particular kind of condition? 284 00:16:22,100 --> 00:16:28,580 And we take the 1,000 most informative words, and so there 285 00:16:28,580 --> 00:16:29,440 are a lot of these. 286 00:16:29,440 --> 00:16:33,170 So if you use 1,000 most informative words 287 00:16:33,170 --> 00:16:37,400 from these nearly 5,000 patients, 288 00:16:37,400 --> 00:16:42,860 you wind up with something like 66,000 words, unique words, 289 00:16:42,860 --> 00:16:47,750 that are informative for some patient. 290 00:16:47,750 --> 00:16:50,250 But if you limit yourself to the top 10, 291 00:16:50,250 --> 00:16:53,180 then it only uses 18,000 words. 292 00:16:53,180 --> 00:16:55,490 And if you limit yourself to the top one, 293 00:16:55,490 --> 00:16:58,490 then it uses about 3,000 words. 294 00:16:58,490 --> 00:17:01,550 And then we said, well, instead of doing individual words, 295 00:17:01,550 --> 00:17:04,670 let's do a latent Dirichlet allocation. 296 00:17:04,670 --> 00:17:09,140 So topic modeling on all of the words, as a bag of words-- 297 00:17:09,140 --> 00:17:13,980 so no sequence information, just the collection of words. 298 00:17:13,980 --> 00:17:19,640 And so we calculated 75 topics from using 299 00:17:19,640 --> 00:17:22,400 LDA on all these notes. 300 00:17:22,400 --> 00:17:26,839 So just to remind you, the LDA process 301 00:17:26,839 --> 00:17:30,800 is a model that says every document consists 302 00:17:30,800 --> 00:17:34,670 of a certain mixture of topics, and each of those topics 303 00:17:34,670 --> 00:17:38,390 probabilistically generates certain words. 304 00:17:38,390 --> 00:17:42,680 And so you can build a model like this, 305 00:17:42,680 --> 00:17:46,250 and then solve it using complicated techniques. 306 00:17:46,250 --> 00:17:52,218 And you'd wind up with topics, in this study, as follows. 307 00:17:52,218 --> 00:17:52,760 I don't know. 308 00:17:52,760 --> 00:17:54,080 Can you read these? 309 00:17:54,080 --> 00:17:57,170 They may be too small. 310 00:17:57,170 --> 00:18:01,550 So these are unsupervised topics. 311 00:18:01,550 --> 00:18:03,290 And if you look at the first one, 312 00:18:03,290 --> 00:18:06,650 it says patient, alcohol, withdrawal, depression, 313 00:18:06,650 --> 00:18:11,450 drinking, and Ativan, ETOH, drinks, medications, 314 00:18:11,450 --> 00:18:16,730 clinic inpatient, diagnosis, days, hospital, substance, 315 00:18:16,730 --> 00:18:18,320 use treatment program, name. 316 00:18:18,320 --> 00:18:24,110 That's a de-identified use/abuse problem number. 317 00:18:24,110 --> 00:18:26,990 And we had our experts look at these topics. 318 00:18:26,990 --> 00:18:28,970 And they said, oh, well, that topic 319 00:18:28,970 --> 00:18:33,380 is related to alcohol abuse, which seems reasonable. 320 00:18:33,380 --> 00:18:36,590 And then you see, on the bottom, psychosis, 321 00:18:36,590 --> 00:18:41,090 thought features, paranoid psychosis, paranoia symptoms, 322 00:18:41,090 --> 00:18:43,100 psychiatric, et cetera. 323 00:18:43,100 --> 00:18:45,930 And they said, OK, that's a psychosis topic. 324 00:18:45,930 --> 00:18:49,760 So in retrospect, you can assign meaning to these topics. 325 00:18:49,760 --> 00:18:53,900 But in fact, they're generated without any a priori notion 326 00:18:53,900 --> 00:18:55,010 of what they ought to be. 327 00:18:55,010 --> 00:18:58,490 They're just a statistical summarization 328 00:18:58,490 --> 00:19:03,980 of the common co-occurrences of words in these documents. 329 00:19:03,980 --> 00:19:11,390 But what you find is that if you use the baseline model, which 330 00:19:11,390 --> 00:19:15,320 used just the demographic and clinical variables, 331 00:19:15,320 --> 00:19:19,220 and you say, what's the difference in survival, 332 00:19:19,220 --> 00:19:23,030 in this case, in time to readmission 333 00:19:23,030 --> 00:19:28,370 between one set and another in this cohort, 334 00:19:28,370 --> 00:19:30,920 and the answer is they're pretty similar. 335 00:19:30,920 --> 00:19:34,130 Whereas, if you use a model that predicts 336 00:19:34,130 --> 00:19:37,160 based on the baseline and 75 topics, 337 00:19:37,160 --> 00:19:40,010 the 75 topics that we identified, 338 00:19:40,010 --> 00:19:42,260 you get a much bigger separation. 339 00:19:42,260 --> 00:19:44,990 And of course, this is statistically significant. 340 00:19:44,990 --> 00:19:47,150 And it tells you that this technique 341 00:19:47,150 --> 00:19:50,780 is useful for being able to improve 342 00:19:50,780 --> 00:19:54,290 the prediction of a cohort that's 343 00:19:54,290 --> 00:19:57,020 more likely to be readmitted from a cohort that's 344 00:19:57,020 --> 00:19:59,320 less likely to be readmitted. 345 00:19:59,320 --> 00:20:01,020 It's not a terrific prediction. 346 00:20:01,020 --> 00:20:06,960 So the AUC for this model was only on the order of 0.7. 347 00:20:06,960 --> 00:20:10,490 So you know, it's not like 0.99. 348 00:20:10,490 --> 00:20:16,040 But nevertheless, it provides useful information. 349 00:20:16,040 --> 00:20:20,780 The same group of psychiatrists that we worked with also 350 00:20:20,780 --> 00:20:25,370 did a study with a much larger cohort but much less rich data. 351 00:20:25,370 --> 00:20:28,820 So they got all of the discharges 352 00:20:28,820 --> 00:20:33,120 from two medical centers over a period of 12 years. 353 00:20:33,120 --> 00:20:38,960 So they had 845,000 discharges from 458,000 354 00:20:38,960 --> 00:20:40,610 unique individuals. 355 00:20:40,610 --> 00:20:44,480 And they were looking for suicide or other causes 356 00:20:44,480 --> 00:20:46,910 of death in these patients to see 357 00:20:46,910 --> 00:20:49,910 if they could predict whether somebody 358 00:20:49,910 --> 00:20:52,100 is likely to try to harm themselves, 359 00:20:52,100 --> 00:20:54,800 or whether they're likely to die accidentally, 360 00:20:54,800 --> 00:20:59,880 which sometimes can't be distinguished from suicide. 361 00:20:59,880 --> 00:21:03,480 So the censoring problems that David talked about 362 00:21:03,480 --> 00:21:05,190 are very much present in this. 363 00:21:05,190 --> 00:21:07,800 Because you lose track of people. 364 00:21:07,800 --> 00:21:10,110 It's a highly imbalanced data set. 365 00:21:10,110 --> 00:21:15,990 Because out of the 845,000 patients, only 235 366 00:21:15,990 --> 00:21:19,950 committed suicide, which is, of course, probably a good thing 367 00:21:19,950 --> 00:21:22,410 from a societal point of view but makes 368 00:21:22,410 --> 00:21:24,360 the data analysis hard. 369 00:21:24,360 --> 00:21:28,230 On the other hand, all-cause mortality was about 18% 370 00:21:28,230 --> 00:21:30,750 during nine years of a follow-up. 371 00:21:30,750 --> 00:21:33,090 So that's not so imbalanced. 372 00:21:33,090 --> 00:21:35,340 And then what they did is they curated 373 00:21:35,340 --> 00:21:39,390 a list of 3,000 terms that correspond to what, 374 00:21:39,390 --> 00:21:43,080 in the psychiatric literature, is called positive valence. 375 00:21:43,080 --> 00:21:47,790 So this is concepts like joy and happiness and good stuff, 376 00:21:47,790 --> 00:21:51,720 as opposed to negative valence, like depression and sorrow 377 00:21:51,720 --> 00:21:53,610 and all that stuff. 378 00:21:53,610 --> 00:21:58,740 And they said, well, we can use these types of terms 379 00:21:58,740 --> 00:22:02,980 in order to help distinguish among these patients. 380 00:22:02,980 --> 00:22:07,650 And what they found is that, if you plot the Kaplan-Meier curve 381 00:22:07,650 --> 00:22:14,280 for different quartiles of risk for these patients, 382 00:22:14,280 --> 00:22:16,800 you see that there's a pretty big difference 383 00:22:16,800 --> 00:22:19,020 between the different quartiles. 384 00:22:19,020 --> 00:22:23,460 And you can certainly identify the people 385 00:22:23,460 --> 00:22:27,030 who are more likely to commit suicide from the people who 386 00:22:27,030 --> 00:22:29,280 are less likely to do so. 387 00:22:29,280 --> 00:22:33,930 This curve is for suicide or accidental death. 388 00:22:33,930 --> 00:22:36,660 So this is a much larger data set, 389 00:22:36,660 --> 00:22:39,090 and therefore the error bars are smaller. 390 00:22:39,090 --> 00:22:43,060 But you see the same kind of separation here. 391 00:22:43,060 --> 00:22:46,290 So these are all useful techniques. 392 00:22:46,290 --> 00:22:48,930 Now I'll to another approach. 393 00:22:48,930 --> 00:22:52,920 This was work by one of my students, Yuon Wo, 394 00:22:52,920 --> 00:22:56,070 who was working with some lymphoma pathologists 395 00:22:56,070 --> 00:22:57,630 at Mass General. 396 00:22:57,630 --> 00:23:00,390 And so the approach they took was 397 00:23:00,390 --> 00:23:06,590 to say, well, if you read a pathology report about somebody 398 00:23:06,590 --> 00:23:10,340 with lymphoma, can we tell what type of lymphoma 399 00:23:10,340 --> 00:23:13,190 they had from the pathology report 400 00:23:13,190 --> 00:23:16,460 if we blank out the part of the pathology report that 401 00:23:16,460 --> 00:23:22,340 says, "I, the pathologist, think this person has non-Hodgkin's 402 00:23:22,340 --> 00:23:24,770 lymphoma," or something? 403 00:23:24,770 --> 00:23:28,880 So from the rest of the context, can we make that prediction? 404 00:23:28,880 --> 00:23:33,620 Now, Yuon took a kind of interesting, slightly odd 405 00:23:33,620 --> 00:23:35,900 approach to it, which is to treat 406 00:23:35,900 --> 00:23:38,420 this as an unsupervised learning problem 407 00:23:38,420 --> 00:23:41,220 rather than as a supervised learning problem. 408 00:23:41,220 --> 00:23:45,110 So he literally masked the real answer 409 00:23:45,110 --> 00:23:48,590 and said, if we just treat everything except what 410 00:23:48,590 --> 00:23:52,310 gives away the answer as just data, 411 00:23:52,310 --> 00:23:57,030 can we essentially cluster that data in some interesting way 412 00:23:57,030 --> 00:24:02,540 so that we re-identify the different types of lymphoma? 413 00:24:02,540 --> 00:24:05,210 Now, the reason this turns out to be important 414 00:24:05,210 --> 00:24:07,580 is because lymphoma pathologists keep 415 00:24:07,580 --> 00:24:11,870 arguing about how to classify lymphomas. 416 00:24:11,870 --> 00:24:15,920 And every few years, they revise the classification rules. 417 00:24:15,920 --> 00:24:19,380 And so part of his objective was to say, 418 00:24:19,380 --> 00:24:24,320 let's try to provide an unbiased, data-driven method 419 00:24:24,320 --> 00:24:28,370 that may help identify appropriate characteristics 420 00:24:28,370 --> 00:24:32,570 by which to classify these different lymphomas. 421 00:24:32,570 --> 00:24:37,760 So his approach was a tensor factorization approach. 422 00:24:40,265 --> 00:24:42,560 You often see data sets like this 423 00:24:42,560 --> 00:24:47,180 that's, say, patient by a characteristic. 424 00:24:47,180 --> 00:24:49,085 So in this case, laboratory measurements-- 425 00:24:49,085 --> 00:24:53,180 so systolic/diastolic blood pressure, sodium, potassium, 426 00:24:53,180 --> 00:24:54,170 et cetera. 427 00:24:54,170 --> 00:24:57,980 That's a very vanilla matrix encoding of data. 428 00:24:57,980 --> 00:25:00,350 And then if you add a third dimension to it, 429 00:25:00,350 --> 00:25:02,600 like this is at the time of admission, 430 00:25:02,600 --> 00:25:06,890 30 minutes later, 60 minutes later, 90 minutes later, 431 00:25:06,890 --> 00:25:09,750 now you have a three-dimensional tensor. 432 00:25:09,750 --> 00:25:14,180 And so just like you can do matrix factorization, as 433 00:25:14,180 --> 00:25:19,400 in the picture above, where we say, my matrix of data, 434 00:25:19,400 --> 00:25:26,130 I'm going to assume is generated by a product of two matrices, 435 00:25:26,130 --> 00:25:28,890 which are smaller in dimension. 436 00:25:28,890 --> 00:25:31,610 And you can train this by saying, 437 00:25:31,610 --> 00:25:34,940 I want entries in these two matrices 438 00:25:34,940 --> 00:25:37,860 that minimize the reconstruction error. 439 00:25:37,860 --> 00:25:41,510 So if I multiply these matrices together, 440 00:25:41,510 --> 00:25:46,350 then I get back my original matrix plus error. 441 00:25:46,350 --> 00:25:48,290 And I want to minimize that error, 442 00:25:48,290 --> 00:25:51,860 usually root mean square, or mean square error, or something 443 00:25:51,860 --> 00:25:53,220 like that. 444 00:25:53,220 --> 00:25:57,230 Well, you can play the same game for a tensor 445 00:25:57,230 --> 00:26:02,900 by having a so-called core tensor, which identifies 446 00:26:02,900 --> 00:26:14,660 the subset of characteristics that subdivide 447 00:26:14,660 --> 00:26:18,050 that dimension of your data. 448 00:26:18,050 --> 00:26:20,630 And then what you do is the same game. 449 00:26:20,630 --> 00:26:26,810 You have matrices corresponding to each of the dimensions. 450 00:26:26,810 --> 00:26:29,090 And if you multiply this core tensor 451 00:26:29,090 --> 00:26:32,240 by each of these matrices, you reconstruct 452 00:26:32,240 --> 00:26:34,460 the original tensor. 453 00:26:34,460 --> 00:26:37,730 And you can train it again to minimize the reconstruction 454 00:26:37,730 --> 00:26:40,100 loss. 455 00:26:40,100 --> 00:26:43,130 So there are, again, a few more tricks. 456 00:26:43,130 --> 00:26:45,810 Because this is dealing with language. 457 00:26:45,810 --> 00:26:50,660 And so this is a typical report from one of these lymphoma 458 00:26:50,660 --> 00:26:55,580 pathologists that says immunohistochemical stains show 459 00:26:55,580 --> 00:26:58,850 that the follicles-- blah, blah, blah, blah, blah-- 460 00:26:58,850 --> 00:27:01,760 so lots and lots of details. 461 00:27:01,760 --> 00:27:05,000 And so he needed a representation that 462 00:27:05,000 --> 00:27:08,780 could be put into this matrix tensor, 463 00:27:08,780 --> 00:27:13,610 this tensor factorization form. 464 00:27:13,610 --> 00:27:16,460 And what he did is to say, well, let's see. 465 00:27:16,460 --> 00:27:18,770 If we look at a statement like this, 466 00:27:18,770 --> 00:27:22,550 immuno stains show that large atypical cells 467 00:27:22,550 --> 00:27:28,520 are strongly positive for CD30, negative for these other 468 00:27:28,520 --> 00:27:31,590 surface expressions. 469 00:27:31,590 --> 00:27:35,480 So the sentence tells us relationships among procedures, 470 00:27:35,480 --> 00:27:39,350 types of cells, and immunologic factors. 471 00:27:39,350 --> 00:27:43,010 And for feature choice, we can use words. 472 00:27:43,010 --> 00:27:45,770 Or we can use UMLS concepts. 473 00:27:45,770 --> 00:27:48,590 Or we can find various kinds of mappings. 474 00:27:48,590 --> 00:27:53,900 But he decided that in order to retain 475 00:27:53,900 --> 00:27:57,770 the syntactic relationships here, what he would do 476 00:27:57,770 --> 00:28:01,760 is to use a graphical representation that 477 00:28:01,760 --> 00:28:06,650 came out of, again, parsing all of these sentences. 478 00:28:06,650 --> 00:28:11,780 And so what you get is that this creates one graph that 479 00:28:11,780 --> 00:28:17,750 talks about the strongly positive for CD30, 480 00:28:17,750 --> 00:28:20,550 large atypical cells, et cetera. 481 00:28:20,550 --> 00:28:24,470 And then you can factor this into subgraphs. 482 00:28:24,470 --> 00:28:27,860 And then you also have to identify frequently 483 00:28:27,860 --> 00:28:29,570 occurring subgraphs. 484 00:28:29,570 --> 00:28:32,630 So for example, large atypical cells 485 00:28:32,630 --> 00:28:36,380 appears here, and also appears there, and of course will 486 00:28:36,380 --> 00:28:38,230 appear in many other places. 487 00:28:38,230 --> 00:28:38,877 Yeah? 488 00:28:38,877 --> 00:28:42,595 AUDIENCE: Is this parsing domain in language diagnostics? 489 00:28:42,595 --> 00:28:43,970 For example, did they incorporate 490 00:28:43,970 --> 00:28:45,512 some sort of medical information here 491 00:28:45,512 --> 00:28:47,330 or some sort of linguistic-- 492 00:28:47,330 --> 00:28:49,390 PETER SZOLOVITS: So in this particular study, 493 00:28:49,390 --> 00:28:53,620 he was using the Stanford Parser with some tricks. 494 00:28:53,620 --> 00:28:55,780 So the Stanford Parser doesn't know 495 00:28:55,780 --> 00:28:57,310 a lot of the medical words. 496 00:28:57,310 --> 00:29:03,640 And so he basically marked these things as noun phrases. 497 00:29:03,640 --> 00:29:05,980 And then the Stanford Parser also 498 00:29:05,980 --> 00:29:10,200 doesn't do well with long lists like the set 499 00:29:10,200 --> 00:29:14,470 of immune features. 500 00:29:14,470 --> 00:29:18,100 And so he would recognize those as a pattern, 501 00:29:18,100 --> 00:29:21,520 substitute a single made-up word for them, 502 00:29:21,520 --> 00:29:24,850 and that made the parser work much better on it. 503 00:29:24,850 --> 00:29:27,250 So there were a whole bunch of little tricks 504 00:29:27,250 --> 00:29:29,500 like that in order to adapt it. 505 00:29:29,500 --> 00:29:33,160 But it was not a model trained specifically on this. 506 00:29:33,160 --> 00:29:36,790 I think it's trained on Wall Street Journal corpus 507 00:29:36,790 --> 00:29:37,810 or something like that. 508 00:29:37,810 --> 00:29:39,667 So it's general English. 509 00:29:39,667 --> 00:29:42,250 AUDIENCE: Those are things that he did manually as opposed to, 510 00:29:42,250 --> 00:29:44,447 say, [INAUDIBLE]? 511 00:29:44,447 --> 00:29:45,280 PETER SZOLOVITS: No. 512 00:29:45,280 --> 00:29:47,500 He did it algorithmically, but he didn't 513 00:29:47,500 --> 00:29:50,230 learn which algorithms to use. 514 00:29:50,230 --> 00:29:52,300 He made them up by hand. 515 00:29:52,300 --> 00:29:54,340 But then, of course, it's a big corpus. 516 00:29:54,340 --> 00:29:56,590 And he ran these programs over it 517 00:29:56,590 --> 00:29:58,810 that did those transformations. 518 00:29:58,810 --> 00:30:01,420 So he calls it two-phase parsing. 519 00:30:01,420 --> 00:30:05,950 There's a reference to his paper on the first slide 520 00:30:05,950 --> 00:30:08,560 in this section if you're interested in the details. 521 00:30:08,560 --> 00:30:11,470 It's described there. 522 00:30:11,470 --> 00:30:16,000 So what he wound up with is a tensor 523 00:30:16,000 --> 00:30:20,890 that has patients on one axis, the words appearing 524 00:30:20,890 --> 00:30:23,680 in the text on another axis. 525 00:30:23,680 --> 00:30:27,730 So he's still using a bag-of-words representation. 526 00:30:27,730 --> 00:30:30,370 But the third axis is these language concept 527 00:30:30,370 --> 00:30:33,650 subgraphs that we were talking about. 528 00:30:33,650 --> 00:30:36,790 And then he does tensor factorization on this. 529 00:30:36,790 --> 00:30:40,360 And what's interesting is that it works 530 00:30:40,360 --> 00:30:42,620 much better than I expected. 531 00:30:42,620 --> 00:30:49,540 So if you look at his technique, which he called SANTF, 532 00:30:49,540 --> 00:30:55,450 the precision and recall are about 0.72 and 0.854 533 00:30:55,450 --> 00:31:00,400 macro-average and 0.754 micro-average, 534 00:31:00,400 --> 00:31:04,510 which is much better than the non-negative matrix 535 00:31:04,510 --> 00:31:09,460 factorization results, which only use patient by word 536 00:31:09,460 --> 00:31:14,860 or patient by subgraph, or, in fact, one where you simply 537 00:31:14,860 --> 00:31:19,180 do patient and concatenate the subgraphs and the words 538 00:31:19,180 --> 00:31:20,740 in one dimension. 539 00:31:20,740 --> 00:31:24,160 So that means that this is actually taking advantage 540 00:31:24,160 --> 00:31:27,090 of the three-way relationship. 541 00:31:27,090 --> 00:31:31,620 If you read papers from about 15, 20 years ago, 542 00:31:31,620 --> 00:31:35,680 people got very excited about the idea of bi-clustering, 543 00:31:35,680 --> 00:31:40,140 which is, in modern terms, the equivalent of matrix 544 00:31:40,140 --> 00:31:41,590 factorization. 545 00:31:41,590 --> 00:31:45,600 So it says given two dimensions of data, 546 00:31:45,600 --> 00:31:48,570 and I want to cluster things, but I 547 00:31:48,570 --> 00:31:50,820 want to cluster them in such a way 548 00:31:50,820 --> 00:31:53,160 that the clustering of one dimension 549 00:31:53,160 --> 00:31:56,370 helps the clustering of the other dimension. 550 00:31:56,370 --> 00:32:00,870 So this is a formal way of doing that relatively efficiently. 551 00:32:00,870 --> 00:32:04,170 And tensor factorization is essentially tri-clustering. 552 00:32:07,190 --> 00:32:13,320 So now I'm going to turn to the last of today's big topics, 553 00:32:13,320 --> 00:32:15,080 which is language modeling. 554 00:32:15,080 --> 00:32:18,140 And this is really where the action is nowadays 555 00:32:18,140 --> 00:32:21,210 in natural language processing in general. 556 00:32:21,210 --> 00:32:24,020 I would say that the natural language processing 557 00:32:24,020 --> 00:32:28,010 on clinical data is somewhat behind the state 558 00:32:28,010 --> 00:32:32,270 of the art in natural language processing overall. 559 00:32:32,270 --> 00:32:34,520 There are fewer corpora that are available. 560 00:32:34,520 --> 00:32:37,220 There are fewer people working on it. 561 00:32:37,220 --> 00:32:40,460 And so we're catching up. 562 00:32:40,460 --> 00:32:44,000 But I'm going to lead into this somewhat gently. 563 00:32:44,000 --> 00:32:47,660 So what does it mean to model a language? 564 00:32:47,660 --> 00:32:50,270 I mean, you could imagine saying it's coming up 565 00:32:50,270 --> 00:32:55,550 with a set of parsing rules that define the syntactic structure 566 00:32:55,550 --> 00:32:56,840 of the language. 567 00:32:56,840 --> 00:33:00,110 Or you could imagine saying, as we suggested 568 00:33:00,110 --> 00:33:03,350 last time, coming up with a corresponding set 569 00:33:03,350 --> 00:33:07,060 of semantic rules that say a concept 570 00:33:07,060 --> 00:33:10,970 or terms in the language correspond to certain concepts 571 00:33:10,970 --> 00:33:13,790 and that they are a combinatorially, 572 00:33:13,790 --> 00:33:17,420 functionally combined as the syntax directs, 573 00:33:17,420 --> 00:33:20,870 in order to give us a semantic representation. 574 00:33:20,870 --> 00:33:24,090 So we don't know how to do either of those very well. 575 00:33:24,090 --> 00:33:26,810 And so the current, the contemporary idea 576 00:33:26,810 --> 00:33:30,200 about language modeling is to say, 577 00:33:30,200 --> 00:33:33,230 given a sequence of tokens, predict the next token. 578 00:33:35,750 --> 00:33:38,900 If you could do that perfectly, presumably you 579 00:33:38,900 --> 00:33:41,480 would have a good language model. 580 00:33:41,480 --> 00:33:43,790 So obviously, you can't do it perfectly. 581 00:33:43,790 --> 00:33:47,910 Because we don't always say the same word 582 00:33:47,910 --> 00:33:52,630 after some sequence of previous words when we speak. 583 00:33:52,630 --> 00:33:56,220 But probabilistically, you can get close to that. 584 00:33:56,220 --> 00:33:59,970 And there's usually some kind of Markov assumption 585 00:33:59,970 --> 00:34:04,380 that says that the probability of emitting a token 586 00:34:04,380 --> 00:34:10,230 given the stuff that came before it is ordinarily dependent 587 00:34:10,230 --> 00:34:18,600 only on n previous words rather than on all of history, 588 00:34:18,600 --> 00:34:21,659 on everything you've ever said before in your life. 589 00:34:24,480 --> 00:34:30,570 And there's a measure called perplexity, 590 00:34:30,570 --> 00:34:34,230 which is the entropy of the probability distribution 591 00:34:34,230 --> 00:34:36,570 over the predicted words. 592 00:34:36,570 --> 00:34:39,900 And roughly speaking, it's the number of likely ways 593 00:34:39,900 --> 00:34:47,610 that you could continue the text if all of the possibilities 594 00:34:47,610 --> 00:34:50,710 were equally likely. 595 00:34:50,710 --> 00:34:54,989 So perplexity is often used, for example, in speech processing. 596 00:34:57,970 --> 00:34:59,680 We did a study where we were trying 597 00:34:59,680 --> 00:35:03,280 to build a speech system that understood a conversation 598 00:35:03,280 --> 00:35:05,560 between a doctor and a patient. 599 00:35:05,560 --> 00:35:08,260 And we ran into real problems, because we 600 00:35:08,260 --> 00:35:12,280 were using software that had been developed to interpret 601 00:35:12,280 --> 00:35:14,710 dictation by doctors. 602 00:35:14,710 --> 00:35:16,600 And that was very well trained. 603 00:35:16,600 --> 00:35:19,990 But it turned out-- we didn't know this when we started-- 604 00:35:19,990 --> 00:35:24,610 that the language that doctors use in dictating medical notes 605 00:35:24,610 --> 00:35:27,490 is pretty straightforward, pretty simple. 606 00:35:27,490 --> 00:35:32,730 And so it's perplexity is about nine, 607 00:35:32,730 --> 00:35:37,050 whereas conversations are much more free flowing and cover 608 00:35:37,050 --> 00:35:38,700 many more topics. 609 00:35:38,700 --> 00:35:42,450 And so its perplexity is about 73. 610 00:35:42,450 --> 00:35:46,230 And so the model that works well for perplexity nine 611 00:35:46,230 --> 00:35:50,490 doesn't work as well for perplexity 73. 612 00:35:50,490 --> 00:35:54,840 And so what this tells you about the difficulty of accurately 613 00:35:54,840 --> 00:35:58,320 transcribing speech is that it's hard. 614 00:35:58,320 --> 00:35:59,700 It's much harder. 615 00:35:59,700 --> 00:36:02,930 And that's still not a solved problem. 616 00:36:02,930 --> 00:36:06,350 Now, you probably all know about Zipf's law. 617 00:36:06,350 --> 00:36:10,790 So if you empirically just take all the words 618 00:36:10,790 --> 00:36:15,350 in all the literature of, let's say, English, what you discover 619 00:36:15,350 --> 00:36:20,990 is that the n-th word is about one over n 620 00:36:20,990 --> 00:36:24,450 as probable as the first word. 621 00:36:24,450 --> 00:36:28,000 So there is a long-tailed distribution. 622 00:36:28,000 --> 00:36:29,710 One thing you should realize, of course, 623 00:36:29,710 --> 00:36:33,280 is if you integrate one over n from zero to infinity, 624 00:36:33,280 --> 00:36:35,980 it's infinite. 625 00:36:35,980 --> 00:36:39,700 And that may not be an inaccurate representation 626 00:36:39,700 --> 00:36:44,140 of language, because language is productive and changes. 627 00:36:44,140 --> 00:36:47,860 And people make up new words all the time and so on. 628 00:36:47,860 --> 00:36:49,840 So it may actually be infinite. 629 00:36:49,840 --> 00:36:54,260 But roughly speaking, there is a kind of decline like this. 630 00:36:54,260 --> 00:36:56,980 And interestingly, in the brown corpus, 631 00:36:56,980 --> 00:37:01,540 the top 10 words make up almost a quarter 632 00:37:01,540 --> 00:37:03,640 of the size of the corpus. 633 00:37:03,640 --> 00:37:07,500 So you write a lot of thes, ofs, ands, a's, twos, ins, 634 00:37:07,500 --> 00:37:14,470 et cetera, and much less hematemesis, obviously. 635 00:37:17,460 --> 00:37:19,590 So what about n-gram models? 636 00:37:19,590 --> 00:37:22,770 Well, remember, if we make this Markov assumption, 637 00:37:22,770 --> 00:37:25,530 then all we have to do is pay attention 638 00:37:25,530 --> 00:37:27,960 to the last n tokens before the one 639 00:37:27,960 --> 00:37:30,310 that we're interested in predicting. 640 00:37:30,310 --> 00:37:34,800 And so people have generated these large corpora n-grams. 641 00:37:34,800 --> 00:37:38,670 So for example, somebody, a couple of decades ago, 642 00:37:38,670 --> 00:37:41,700 took all of Shakespeare's writings-- 643 00:37:41,700 --> 00:37:43,320 I think they were trying to decide 644 00:37:43,320 --> 00:37:45,780 whether he had written all his works 645 00:37:45,780 --> 00:37:49,230 or whether the earl of somebody or other 646 00:37:49,230 --> 00:37:52,120 was actually the guy who wrote Shakespeare. 647 00:37:52,120 --> 00:37:54,810 You know about this controversy? 648 00:37:54,810 --> 00:37:56,570 Yeah. 649 00:37:56,570 --> 00:37:58,190 So that's why they were doing it. 650 00:37:58,190 --> 00:38:00,890 But anyway, they created this corpus. 651 00:38:00,890 --> 00:38:03,290 And they said-- so Shakespeare had 652 00:38:03,290 --> 00:38:07,790 a vocabulary of about 30,000 words and about 653 00:38:07,790 --> 00:38:17,480 300,000 bigrams, and out of 844 million possible bigrams. 654 00:38:17,480 --> 00:38:22,970 So 99.96% of bigrams were never seen. 655 00:38:22,970 --> 00:38:27,170 So there's a certain regularity to his production of language. 656 00:38:27,170 --> 00:38:30,310 Now, Google, of course, did Shakespeare one better. 657 00:38:30,310 --> 00:38:34,100 And they said, hmm, we can take a terabyte corpus-- 658 00:38:34,100 --> 00:38:36,230 this was in 2006. 659 00:38:36,230 --> 00:38:40,400 I wouldn't be surprised if it's a petabyte corpus today. 660 00:38:40,400 --> 00:38:41,540 And they published this. 661 00:38:41,540 --> 00:38:43,190 They just made it available. 662 00:38:43,190 --> 00:38:46,520 So there were 13.6 million unique words 663 00:38:46,520 --> 00:38:51,290 that occurred at least 200 times in this tera-word corpus. 664 00:38:51,290 --> 00:38:55,490 And there were 1.2 billion five-word sequences that 665 00:38:55,490 --> 00:38:57,500 occurred at least 40 times. 666 00:38:57,500 --> 00:38:59,060 So these are the statistics. 667 00:38:59,060 --> 00:39:02,090 And if you're interested, there's a URL. 668 00:39:02,090 --> 00:39:05,240 And here's a very tiny part of their database. 669 00:39:05,240 --> 00:39:11,210 So ceramics, collectibles, collectibles-- 670 00:39:11,210 --> 00:39:16,550 I don't know-- occurred 55 times in a terabyte of text. 671 00:39:16,550 --> 00:39:20,670 Ceramics collectibles fine, ceramics collectibles by, 672 00:39:20,670 --> 00:39:25,140 pottery, cooking, comma, period, end of sentence, and, at, 673 00:39:25,140 --> 00:39:26,490 is, et cetera-- 674 00:39:26,490 --> 00:39:27,780 different number of times. 675 00:39:27,780 --> 00:39:32,640 Ceramics comes from occurred 660 times, 676 00:39:32,640 --> 00:39:35,880 which is reasonably large number compared to some 677 00:39:35,880 --> 00:39:37,920 of its competitors here. 678 00:39:37,920 --> 00:39:40,890 If you look at four-grams, you see 679 00:39:40,890 --> 00:39:44,070 things like serve as the incoming, blah, blah, 680 00:39:44,070 --> 00:39:49,500 blah, 92 times; serve as the index, 223 times; 681 00:39:49,500 --> 00:39:53,860 serve as the initial, 5,300 times. 682 00:39:53,860 --> 00:39:56,730 So you've got all these statistics. 683 00:39:56,730 --> 00:40:02,430 And now, given those statistics, we can then build a generator. 684 00:40:02,430 --> 00:40:05,940 So we can say, all right. 685 00:40:05,940 --> 00:40:08,700 Suppose I start with the token, which 686 00:40:08,700 --> 00:40:11,400 is the beginning of a sentence, or the separator 687 00:40:11,400 --> 00:40:13,180 between sentences. 688 00:40:13,180 --> 00:40:16,380 And I say sample a random bigram starting 689 00:40:16,380 --> 00:40:19,350 with the beginning of a sentence and a word, 690 00:40:19,350 --> 00:40:24,240 according to its probability, and then sample the next bigram 691 00:40:24,240 --> 00:40:27,670 from that word and all the other words, 692 00:40:27,670 --> 00:40:30,630 according to its probability, and keep 693 00:40:30,630 --> 00:40:34,920 doing that until you hit the end of sentence marker. 694 00:40:34,920 --> 00:40:40,020 So for example, here I'm generating the sentence, 695 00:40:40,020 --> 00:40:43,860 I, starts with I, then followed by want, 696 00:40:43,860 --> 00:40:47,730 followed by two, followed by get, followed by Chinese, 697 00:40:47,730 --> 00:40:51,120 followed by food, followed by end of sentence. 698 00:40:51,120 --> 00:40:53,100 So I've just generated, "I want to get 699 00:40:53,100 --> 00:40:57,780 Chinese food," which sounds like a perfectly good sentence. 700 00:40:57,780 --> 00:40:59,170 So here's what's interesting. 701 00:40:59,170 --> 00:41:02,130 If you look back again at the Shakespeare corpus 702 00:41:02,130 --> 00:41:07,220 and saying, if we generated Shakespeare from unigrams, 703 00:41:07,220 --> 00:41:09,800 you get stuff like at the top, "To him 704 00:41:09,800 --> 00:41:12,110 swallowed confess here both. 705 00:41:12,110 --> 00:41:13,540 Which. 706 00:41:13,540 --> 00:41:19,100 Of save on trail for are ay device and rote life have." 707 00:41:19,100 --> 00:41:21,350 It doesn't sound terribly good. 708 00:41:21,350 --> 00:41:23,240 It's not very grammatical. 709 00:41:23,240 --> 00:41:30,140 It doesn't have that sort of Shakespearean English flavor. 710 00:41:30,140 --> 00:41:34,250 Although, you do have words like nave and ay and so on that 711 00:41:34,250 --> 00:41:36,680 are vaguely reminiscent. 712 00:41:36,680 --> 00:41:38,930 Now, if you go to bigrams, it starts 713 00:41:38,930 --> 00:41:40,550 to sound a little better. 714 00:41:40,550 --> 00:41:41,480 "What means, sir. 715 00:41:41,480 --> 00:41:43,040 I confess she? 716 00:41:43,040 --> 00:41:45,450 Then all sorts, he is trim, captain." 717 00:41:49,400 --> 00:41:51,060 That doesn't make any sense. 718 00:41:51,060 --> 00:41:53,630 But it starts to sound a little better. 719 00:41:53,630 --> 00:41:56,870 And with trigrams, we get, "Sweet prince, Falstaff 720 00:41:56,870 --> 00:41:57,980 shall die. 721 00:41:57,980 --> 00:42:01,460 Harry of Monmouth," et cetera. 722 00:42:01,460 --> 00:42:05,540 So this is beginning to sound a little Shakespearean. 723 00:42:05,540 --> 00:42:08,220 And if you go to quadrigrams, you get, "King Henry. 724 00:42:08,220 --> 00:42:08,720 What? 725 00:42:08,720 --> 00:42:11,180 I will go seek the traitor Gloucester. 726 00:42:11,180 --> 00:42:13,090 Exeunt some of the watch. 727 00:42:13,090 --> 00:42:17,960 A great banquet serv'd in," et cetera. 728 00:42:17,960 --> 00:42:23,090 I mean, when I first saw this, like 20 years ago or something, 729 00:42:23,090 --> 00:42:24,320 I was stunned. 730 00:42:24,320 --> 00:42:26,840 This is actually generating stuff 731 00:42:26,840 --> 00:42:30,170 that sounds vaguely Shakespearean and vaguely 732 00:42:30,170 --> 00:42:33,570 English-like. 733 00:42:33,570 --> 00:42:37,070 Here's an example of generating the Wall Street Journal. 734 00:42:37,070 --> 00:42:42,410 So from unigrams, "Months the my and issue of year 735 00:42:42,410 --> 00:42:45,830 foreign new exchanges September were recession." 736 00:42:45,830 --> 00:42:47,600 It's word salad. 737 00:42:47,600 --> 00:42:50,600 But if you go to trigrams, "They also point to ninety nine 738 00:42:50,600 --> 00:42:54,020 point six billion from two hundred four 739 00:42:54,020 --> 00:42:57,980 oh six three percent of the rates of interest stores 740 00:42:57,980 --> 00:43:00,050 as Mexico and Brazil." 741 00:43:00,050 --> 00:43:03,620 So you could imagine that this is some Wall Street Journal 742 00:43:03,620 --> 00:43:09,080 writer on acid writing this text. 743 00:43:09,080 --> 00:43:13,850 Because it has a little bit of the right kind of flavor. 744 00:43:13,850 --> 00:43:17,570 So more recently, people said, well, 745 00:43:17,570 --> 00:43:22,520 we ought to be able to make use of this in some systematic way 746 00:43:22,520 --> 00:43:25,730 to help us with our language analysis tasks. 747 00:43:25,730 --> 00:43:31,790 So to me, the first effort in this direction 748 00:43:31,790 --> 00:43:35,240 was Word2Vec, which was Mikolov's approach 749 00:43:35,240 --> 00:43:36,440 to doing this. 750 00:43:36,440 --> 00:43:38,540 And he developed two models. 751 00:43:38,540 --> 00:43:45,260 He said, let's build a continuous bag-of-words model 752 00:43:45,260 --> 00:43:47,420 that says what we're going to use 753 00:43:47,420 --> 00:43:54,850 is co-occurrence data on a series of tokens in the text 754 00:43:54,850 --> 00:43:56,840 that we're trying to model. 755 00:43:56,840 --> 00:43:59,300 And we're going to use a neural network 756 00:43:59,300 --> 00:44:05,250 model to predict the word from the words around it. 757 00:44:05,250 --> 00:44:07,730 And in that process, we're going to use 758 00:44:07,730 --> 00:44:13,910 the parameters of that neural network model as a vector. 759 00:44:13,910 --> 00:44:17,060 And that vector will be the representation of that word. 760 00:44:19,590 --> 00:44:21,830 And so what we're going to find is 761 00:44:21,830 --> 00:44:26,580 that words that tend to appear in the same context 762 00:44:26,580 --> 00:44:29,460 will have similar representations 763 00:44:29,460 --> 00:44:31,835 in this high-dimensional vector. 764 00:44:31,835 --> 00:44:33,210 And by the way, high-dimensional, 765 00:44:33,210 --> 00:44:37,670 people typically use like 300 or 500 dimensional vectors. 766 00:44:37,670 --> 00:44:39,060 So there's a lot of-- 767 00:44:39,060 --> 00:44:40,830 it's a big space. 768 00:44:40,830 --> 00:44:43,860 And the words are scattered throughout this. 769 00:44:43,860 --> 00:44:48,140 But you get this kind of cohesion, 770 00:44:48,140 --> 00:44:53,470 where words that are used in the same context 771 00:44:53,470 --> 00:44:55,430 appear close to each other. 772 00:44:55,430 --> 00:44:58,250 And the extrapolation of that is that if words 773 00:44:58,250 --> 00:45:00,500 are used in the same context, maybe 774 00:45:00,500 --> 00:45:03,890 they share something about meaning. 775 00:45:03,890 --> 00:45:06,405 So the other model is a skip-gram model, 776 00:45:06,405 --> 00:45:07,780 where you're doing the prediction 777 00:45:07,780 --> 00:45:08,920 in the other direction. 778 00:45:08,920 --> 00:45:13,300 From a word, you're predicting the words that are around it. 779 00:45:13,300 --> 00:45:16,330 And again, you are using a neural network model 780 00:45:16,330 --> 00:45:17,950 to do that. 781 00:45:17,950 --> 00:45:20,800 And you use the parameters of that model 782 00:45:20,800 --> 00:45:27,050 in order to represent the word that you're focused on. 783 00:45:27,050 --> 00:45:31,240 So what came as a surprise to me is this claim that's 784 00:45:31,240 --> 00:45:35,830 in his original paper, which is that not only do you 785 00:45:35,830 --> 00:45:43,030 get this effect of locality as corresponding meaning 786 00:45:43,030 --> 00:45:46,630 but that you get relationships that are geometrically 787 00:45:46,630 --> 00:45:50,770 represented in the space of these embeddings. 788 00:45:50,770 --> 00:45:53,980 And so what you see is that if you 789 00:45:53,980 --> 00:45:58,510 take the encoding of the word man and the word woman 790 00:45:58,510 --> 00:46:01,450 and look at the vector difference between them, 791 00:46:01,450 --> 00:46:05,530 and then apply that same vector difference to king, 792 00:46:05,530 --> 00:46:07,570 you get close to queen. 793 00:46:07,570 --> 00:46:11,410 And if you apply it uncle, you get close to aunt. 794 00:46:11,410 --> 00:46:13,630 And so they showed a number of examples. 795 00:46:13,630 --> 00:46:15,520 And then people have studied this. 796 00:46:15,520 --> 00:46:17,500 It doesn't hold it perfectly well. 797 00:46:17,500 --> 00:46:21,010 I mean, it's not like we've solved the semantics problem. 798 00:46:21,010 --> 00:46:24,040 But it is a genuine relationship. 799 00:46:24,040 --> 00:46:25,930 The place where it doesn't work well 800 00:46:25,930 --> 00:46:30,460 is when some of these things are much more frequent than others. 801 00:46:30,460 --> 00:46:33,970 And so one of the examples that's often cited 802 00:46:33,970 --> 00:46:41,420 is if you go, London is to England as Paris is to France, 803 00:46:41,420 --> 00:46:43,040 and that one works. 804 00:46:43,040 --> 00:46:47,950 But then you say as Kuala Lumpur is to Malaysia, 805 00:46:47,950 --> 00:46:50,500 and that one doesn't work so well. 806 00:46:50,500 --> 00:46:57,310 And then you go, as Juba or something 807 00:46:57,310 --> 00:47:01,090 is to whatever country it's the capital of. 808 00:47:01,090 --> 00:47:05,140 And since we don't write about Africa in our newspapers, 809 00:47:05,140 --> 00:47:07,040 there's very little data on that. 810 00:47:07,040 --> 00:47:10,420 And so that doesn't work so well. 811 00:47:10,420 --> 00:47:13,150 So there was this other paper later 812 00:47:13,150 --> 00:47:16,960 from van der Maaten and Geoff Hinton, 813 00:47:16,960 --> 00:47:19,930 where they came up with a visualization method 814 00:47:19,930 --> 00:47:22,180 to take these high-dimensional vectors 815 00:47:22,180 --> 00:47:25,090 and visualize them in two dimensions. 816 00:47:25,090 --> 00:47:28,750 And what you see is that if you take a bunch of concepts 817 00:47:28,750 --> 00:47:30,520 that are count concepts-- 818 00:47:30,520 --> 00:47:36,490 so 1/2, 30, 15, 5, 4, 2, 3, several, some, many, 819 00:47:36,490 --> 00:47:38,530 et cetera-- 820 00:47:38,530 --> 00:47:41,450 there is a geometric relationship between them. 821 00:47:41,450 --> 00:47:45,380 So they, in fact, do map to the same part of the space. 822 00:47:45,380 --> 00:47:48,970 Similarly, minister, leader, president, chairman, director, 823 00:47:48,970 --> 00:47:51,580 spokesman, chief, head, et cetera 824 00:47:51,580 --> 00:47:54,420 form a kind of cluster in the space. 825 00:47:54,420 --> 00:47:58,540 So there's definitely something to this. 826 00:47:58,540 --> 00:48:04,120 I promised you that I would get back to a different attempt 827 00:48:04,120 --> 00:48:06,880 to try to take a core of concepts 828 00:48:06,880 --> 00:48:09,640 that you want to use for term-spotting 829 00:48:09,640 --> 00:48:13,780 and develop an automated way of enlarging that set of concepts 830 00:48:13,780 --> 00:48:17,080 in order to give you a richer vocabulary by which 831 00:48:17,080 --> 00:48:20,480 to try to identify cases that you're interested in. 832 00:48:20,480 --> 00:48:23,480 So this was by some of my colleagues, 833 00:48:23,480 --> 00:48:27,310 including Kat, who you saw on Tuesday. 834 00:48:27,310 --> 00:48:32,800 And they said, well, what we'd like 835 00:48:32,800 --> 00:48:35,770 is the fully automated and robust, unsupervised feature 836 00:48:35,770 --> 00:48:38,860 selection method that leverages only publicly 837 00:48:38,860 --> 00:48:42,910 available medical knowledge sources instead of VHR data. 838 00:48:42,910 --> 00:48:46,690 So the method that David's group had developed, 839 00:48:46,690 --> 00:48:49,870 which we talked about earlier, uses data 840 00:48:49,870 --> 00:48:51,790 from electronic health records, which 841 00:48:51,790 --> 00:48:54,520 means that you move to different hospitals 842 00:48:54,520 --> 00:48:56,690 and there may be different conventions. 843 00:48:56,690 --> 00:48:58,390 And you might imagine that you have 844 00:48:58,390 --> 00:49:03,880 to retrain that sort of method, whereas here the idea is 845 00:49:03,880 --> 00:49:06,910 to derive these surrogate features from knowledge 846 00:49:06,910 --> 00:49:08,110 sources. 847 00:49:08,110 --> 00:49:13,330 So unlike that earlier model, here they built a Word2Vec 848 00:49:13,330 --> 00:49:17,620 skip-gram model from about 5 million Springer articles-- 849 00:49:17,620 --> 00:49:21,610 so these are published medical articles-- 850 00:49:21,610 --> 00:49:25,420 to yield 500 dimensional vectors for each word. 851 00:49:25,420 --> 00:49:29,800 And then what they did is they took the concept names 852 00:49:29,800 --> 00:49:33,130 that they were interested in and their definitions 853 00:49:33,130 --> 00:49:38,580 from the UMLS, and then they summoned 854 00:49:38,580 --> 00:49:42,390 the word vectors for each of these words, weighted 855 00:49:42,390 --> 00:49:44,650 by inverse document frequency. 856 00:49:44,650 --> 00:49:48,485 So it's sort of a TF-IDF-like approach 857 00:49:48,485 --> 00:49:51,240 to weight different words. 858 00:49:51,240 --> 00:49:53,700 And then they went out and they said, OK, 859 00:49:53,700 --> 00:49:56,610 for every disease that's mentioned in Wikipedia, 860 00:49:56,610 --> 00:49:59,760 Medscape, eMedicine, the Merck Manuals Professional 861 00:49:59,760 --> 00:50:03,390 Edition, the Mayo Clinic Diseases and Conditions, 862 00:50:03,390 --> 00:50:06,120 MedlinePlus Medical Encyclopedia, 863 00:50:06,120 --> 00:50:09,330 they used named entity recognition techniques 864 00:50:09,330 --> 00:50:15,550 to find all the concepts that are related to this phenotype. 865 00:50:15,550 --> 00:50:19,080 So then they said, well, there's a lot of randomness 866 00:50:19,080 --> 00:50:22,840 in these sources, and maybe in our extraction techniques. 867 00:50:22,840 --> 00:50:25,320 But if we insist that some concept appear 868 00:50:25,320 --> 00:50:28,810 in at least three of these five sources, 869 00:50:28,810 --> 00:50:32,400 then we can be pretty confident that it's a relevant concept. 870 00:50:32,400 --> 00:50:34,480 And so they said, OK, we'll do that. 871 00:50:34,480 --> 00:50:37,130 Then they chose the top k concepts 872 00:50:37,130 --> 00:50:41,190 whose embedding vectors are closest by cosine distance 873 00:50:41,190 --> 00:50:43,020 to the embedding of this phenotype 874 00:50:43,020 --> 00:50:44,850 that they've calculated. 875 00:50:44,850 --> 00:50:47,280 And they say, OK, the phenotype is 876 00:50:47,280 --> 00:50:51,970 going to be a linear combination of all these related concepts. 877 00:50:51,970 --> 00:50:55,840 So again, this is a bit similar to what we saw before. 878 00:50:55,840 --> 00:50:58,110 But here, instead of extracting the data 879 00:50:58,110 --> 00:51:01,110 from electronic medical records, they're 880 00:51:01,110 --> 00:51:04,680 extracting it from published literature and these web 881 00:51:04,680 --> 00:51:07,260 sources. 882 00:51:07,260 --> 00:51:16,230 And again, what you see is that the expert-curated features 883 00:51:16,230 --> 00:51:22,050 for these five phenotypes, which are coronary artery 884 00:51:22,050 --> 00:51:24,180 disease, rheumatoid arthritis, Crohn's 885 00:51:24,180 --> 00:51:29,070 disease, ulcerative colitis, and pediatric pulmonary arterial 886 00:51:29,070 --> 00:51:37,260 hypertension, they started with 20 to 50 curated features. 887 00:51:37,260 --> 00:51:39,150 So these were the ones that the doctors 888 00:51:39,150 --> 00:51:44,610 said, OK, these are the anchors in David's terminology. 889 00:51:44,610 --> 00:51:51,090 And then they expanded these to a larger set 890 00:51:51,090 --> 00:51:56,850 using the technique that I just described, and then selected 891 00:51:56,850 --> 00:52:04,515 down to the top n that were effective in finding 892 00:52:04,515 --> 00:52:06,360 relevant phenotypes. 893 00:52:06,360 --> 00:52:13,140 And this is a terrible graph that summarizes the results. 894 00:52:13,140 --> 00:52:19,590 But what you're seeing is that the orange lines are based 895 00:52:19,590 --> 00:52:22,830 on the expert-curated features. 896 00:52:22,830 --> 00:52:28,920 This is based on an earlier version of trying to do this. 897 00:52:28,920 --> 00:52:33,000 And SEDFE is the technique that I've just described. 898 00:52:33,000 --> 00:52:37,410 And what you see is that the automatic techniques 899 00:52:37,410 --> 00:52:42,000 for many of these phenotypes are just about as good 900 00:52:42,000 --> 00:52:44,760 as the manually curated ones. 901 00:52:44,760 --> 00:52:47,640 And of course, they require much less manual curation. 902 00:52:47,640 --> 00:52:52,980 Because they're using this automatic learning approach. 903 00:52:52,980 --> 00:52:56,100 Another interesting example to return 904 00:52:56,100 --> 00:52:58,770 to the theme of de-identification 905 00:52:58,770 --> 00:53:02,380 is a couple of my students, a few years ago, 906 00:53:02,380 --> 00:53:06,150 built a new de-identifier that has this rather 907 00:53:06,150 --> 00:53:08,280 complicated architecture. 908 00:53:08,280 --> 00:53:13,680 So it starts with a bi-directional recursive neural 909 00:53:13,680 --> 00:53:18,330 network model that is implemented 910 00:53:18,330 --> 00:53:23,280 over the character sequences of words in the medical text. 911 00:53:23,280 --> 00:53:25,920 So why character sequences? 912 00:53:25,920 --> 00:53:27,841 Why might those be important? 913 00:53:33,140 --> 00:53:38,090 Well, consider a misspelled word, for example. 914 00:53:38,090 --> 00:53:41,120 Most of the character sequence is correct. 915 00:53:41,120 --> 00:53:44,600 There will be a bug in it at the misspelling. 916 00:53:44,600 --> 00:53:47,540 Or consider that a lot of medical terms 917 00:53:47,540 --> 00:53:50,060 are these compound terms, where they're 918 00:53:50,060 --> 00:53:53,120 made up of lots of pieces that correspond 919 00:53:53,120 --> 00:53:56,360 to Greek or Latin roots. 920 00:53:56,360 --> 00:54:00,440 So learning those can actually be very helpful. 921 00:54:00,440 --> 00:54:02,990 So you start with that model. 922 00:54:02,990 --> 00:54:06,110 You then could concatenate the results 923 00:54:06,110 --> 00:54:10,250 from both the left-running and the right-running recursive 924 00:54:10,250 --> 00:54:12,140 neural network. 925 00:54:12,140 --> 00:54:18,095 And concatenate that with the Word2Vec embedding 926 00:54:18,095 --> 00:54:20,850 of the whole word. 927 00:54:20,850 --> 00:54:26,490 And you feed that into another bi-directional RNN layer. 928 00:54:26,490 --> 00:54:33,050 And then for each word, you take the output of those RNNs, 929 00:54:33,050 --> 00:54:36,650 run them through a feed-forward neural network in order 930 00:54:36,650 --> 00:54:38,940 to estimate the prob-- 931 00:54:38,940 --> 00:54:40,310 it's like a soft max. 932 00:54:40,310 --> 00:54:44,900 And you estimate the probability of this word belonging 933 00:54:44,900 --> 00:54:49,280 to a particular category of personally identifiable health 934 00:54:49,280 --> 00:54:50,300 information. 935 00:54:50,300 --> 00:54:51,440 So is it a name? 936 00:54:51,440 --> 00:54:52,520 Is it an address? 937 00:54:52,520 --> 00:54:53,570 Is it a phone number? 938 00:54:53,570 --> 00:54:56,150 Is it or whatever? 939 00:54:56,150 --> 00:54:59,480 And then the top layer is a kind of conditional random 940 00:54:59,480 --> 00:55:04,970 field-like layer that imposes a sequential probability 941 00:55:04,970 --> 00:55:10,490 distribution that says, OK, if you've seen a name, then 942 00:55:10,490 --> 00:55:14,220 what's the next most likely thing that you're going to see? 943 00:55:14,220 --> 00:55:19,220 And so you combine that with the probability distributions 944 00:55:19,220 --> 00:55:24,920 for each word in order to identify the category of PHI 945 00:55:24,920 --> 00:55:28,860 or non-PHI for that word. 946 00:55:28,860 --> 00:55:31,400 And this did insanely well. 947 00:55:31,400 --> 00:55:41,000 So optimized by F1 score, we're up at a precision of 99.2%, 948 00:55:41,000 --> 00:55:44,270 recall of 99.3%. 949 00:55:44,270 --> 00:55:51,290 Optimized by recall, we're up at about 98%, 99% 950 00:55:51,290 --> 00:55:53,240 for each of them. 951 00:55:53,240 --> 00:55:55,370 So this is doing quite well. 952 00:55:55,370 --> 00:56:00,030 Now, there is a non-machine learning comment to make, 953 00:56:00,030 --> 00:56:02,570 which is that if you read the HIPAA law, the HIPAA 954 00:56:02,570 --> 00:56:05,660 regulations, they don't say that you 955 00:56:05,660 --> 00:56:10,400 must get rid of 99% of the personally 956 00:56:10,400 --> 00:56:13,760 identifying information in order to be able to share 957 00:56:13,760 --> 00:56:15,500 this data for research. 958 00:56:15,500 --> 00:56:18,761 It says you have to get rid of all of it. 959 00:56:18,761 --> 00:56:23,770 So no technique we know is 100% perfect. 960 00:56:23,770 --> 00:56:27,840 And so there's a kind of practical understanding 961 00:56:27,840 --> 00:56:30,240 among people who work on this stuff 962 00:56:30,240 --> 00:56:32,850 that nothing's going to be perfect. 963 00:56:32,850 --> 00:56:36,990 And therefore, that you can get away with a little bit. 964 00:56:36,990 --> 00:56:42,300 But legally, you're on thin ice. 965 00:56:42,300 --> 00:56:46,590 So I remember many years ago, my wife was in law school. 966 00:56:46,590 --> 00:56:51,600 And I asked her at one point, so what can people sue you for? 967 00:56:51,600 --> 00:56:55,640 And she said, absolutely anything. 968 00:56:55,640 --> 00:56:57,430 They may not win. 969 00:56:57,430 --> 00:57:00,180 But they can be a real pain if you have 970 00:57:00,180 --> 00:57:02,460 to go defend yourself in court. 971 00:57:02,460 --> 00:57:04,750 And so this hasn't played out yet. 972 00:57:04,750 --> 00:57:08,910 We don't know if a de-identifier that 973 00:57:08,910 --> 00:57:13,050 is 99% sensitive and 99% specific 974 00:57:13,050 --> 00:57:17,730 will pass muster with people who agree to release data sets. 975 00:57:17,730 --> 00:57:21,000 Because they're worried, too, about winding up 976 00:57:21,000 --> 00:57:23,700 in the newspaper or winding up getting sued. 977 00:57:26,910 --> 00:57:28,810 Last topic for today-- 978 00:57:28,810 --> 00:57:34,980 so if you read this interesting blog, which, by the way, 979 00:57:34,980 --> 00:57:39,870 has a very good tutorial on BERT, 980 00:57:39,870 --> 00:57:43,290 he says, "The year 2018 has been an inflection point for machine 981 00:57:43,290 --> 00:57:47,850 learning models handling text, or more accurately, NLP. 982 00:57:47,850 --> 00:57:49,680 Our conceptual understanding of how 983 00:57:49,680 --> 00:57:52,770 best to represent words and sentences in a way 984 00:57:52,770 --> 00:57:55,710 that best captures underlying meanings and relationships 985 00:57:55,710 --> 00:57:57,760 is rapidly evolving." 986 00:57:57,760 --> 00:58:00,330 And so there are a whole bunch of new ideas 987 00:58:00,330 --> 00:58:05,530 that have come about in about the last year or two years, 988 00:58:05,530 --> 00:58:10,410 including ELMo, which learns context-specific embeddings, 989 00:58:10,410 --> 00:58:13,920 the Transformer architecture, this BERT approach. 990 00:58:13,920 --> 00:58:19,470 And then I'll end with just showing you this gigantic GPT 991 00:58:19,470 --> 00:58:24,060 model that was developed by the OpenAI people, which 992 00:58:24,060 --> 00:58:27,360 does remarkably better than the stuff I showed you 993 00:58:27,360 --> 00:58:31,690 before in generating language. 994 00:58:31,690 --> 00:58:33,160 All right. 995 00:58:33,160 --> 00:58:36,010 If you look inside Google Translate, 996 00:58:36,010 --> 00:58:40,180 at least as of not long ago, what you find 997 00:58:40,180 --> 00:58:43,260 is a model like this. 998 00:58:43,260 --> 00:58:49,470 So it's essentially an LSTM model that takes input words 999 00:58:49,470 --> 00:58:53,970 and munges them together into some representation, 1000 00:58:53,970 --> 00:58:58,980 a high-dimensional vector representation, that summarizes 1001 00:58:58,980 --> 00:59:03,270 everything that the model knows about that sentence 1002 00:59:03,270 --> 00:59:06,330 that you've just fed it. 1003 00:59:06,330 --> 00:59:08,550 Obviously, it has to be a pretty high-dimensional 1004 00:59:08,550 --> 00:59:12,120 representation, because your sentence could be about almost 1005 00:59:12,120 --> 00:59:13,690 anything. 1006 00:59:13,690 --> 00:59:17,520 And so it's important to be able to capture all 1007 00:59:17,520 --> 00:59:19,980 that in this representation. 1008 00:59:19,980 --> 00:59:22,170 But basically, at this point, you 1009 00:59:22,170 --> 00:59:24,340 start generating the output. 1010 00:59:24,340 --> 00:59:27,130 So if you're translating English to French, 1011 00:59:27,130 --> 00:59:29,310 these are English words coming in, 1012 00:59:29,310 --> 00:59:32,670 and these are French words going out, in sort of the way 1013 00:59:32,670 --> 00:59:35,190 I showed you, where we're generating Shakespeare 1014 00:59:35,190 --> 00:59:39,030 or we're generating Wall Street Journal text. 1015 00:59:41,910 --> 00:59:45,780 But the critical feature here is that in the initial version 1016 00:59:45,780 --> 00:59:48,210 of this, everything that you learned 1017 00:59:48,210 --> 00:59:51,870 about this English sentence had to be encoded in this one 1018 00:59:51,870 --> 00:59:58,150 vector that got passed from the encoder into the decoder, 1019 00:59:58,150 --> 01:00:03,720 or from the source language into the target language generator. 1020 01:00:03,720 --> 01:00:06,930 So then someone came along and said, hmm-- 1021 01:00:06,930 --> 01:00:11,470 someone, namely these guys, came along and said, 1022 01:00:11,470 --> 01:00:13,440 wouldn't it be nice if we could provide 1023 01:00:13,440 --> 01:00:17,430 some auxiliary information to the generator that said, 1024 01:00:17,430 --> 01:00:19,980 hey, which part of the input sentence 1025 01:00:19,980 --> 01:00:23,120 should you pay attention to? 1026 01:00:23,120 --> 01:00:25,790 And of course, there's no fixed answer to that. 1027 01:00:25,790 --> 01:00:29,180 I mean, if I'm translating an arbitrary English sentence 1028 01:00:29,180 --> 01:00:32,840 into an arbitrary French sentence, I can't say, 1029 01:00:32,840 --> 01:00:36,770 in general, look at the third word in the English sentence 1030 01:00:36,770 --> 01:00:39,680 when you're generating the third word in the French sentence. 1031 01:00:39,680 --> 01:00:43,040 Because that may or may not be true, depending 1032 01:00:43,040 --> 01:00:44,780 on the particular sentence. 1033 01:00:44,780 --> 01:00:46,520 But on the other hand, the intuition 1034 01:00:46,520 --> 01:00:50,060 is that there is such a positional dependence 1035 01:00:50,060 --> 01:00:56,030 and a dependence on what the particular English word was 1036 01:00:56,030 --> 01:01:00,330 that is an important component of generating the French word. 1037 01:01:00,330 --> 01:01:04,190 And so they created this idea that in addition 1038 01:01:04,190 --> 01:01:10,340 to passing along the this vector that 1039 01:01:10,340 --> 01:01:13,490 encodes the meaning of the entire input 1040 01:01:13,490 --> 01:01:18,680 and the previous word that you had generated in the output, 1041 01:01:18,680 --> 01:01:23,730 in addition, we pass along this other information that says, 1042 01:01:23,730 --> 01:01:27,320 which of the input words should we pay attention to? 1043 01:01:27,320 --> 01:01:30,110 And how much attention should we pay to them? 1044 01:01:30,110 --> 01:01:34,520 And of course, in the style of these embeddings, 1045 01:01:34,520 --> 01:01:37,520 these are all represented by high-dimensional vectors, 1046 01:01:37,520 --> 01:01:41,540 high-dimensional real number vectors that 1047 01:01:41,540 --> 01:01:44,030 get combined with the other vectors 1048 01:01:44,030 --> 01:01:46,880 in order to produce the output. 1049 01:01:46,880 --> 01:01:53,660 Now, a classical linguist would look at this and retch. 1050 01:01:53,660 --> 01:01:57,980 Because this looks nothing like classical linguistics. 1051 01:01:57,980 --> 01:02:04,160 It's just numerology that gets trained by stochastic gradient 1052 01:02:04,160 --> 01:02:08,240 descent methods in order to optimize the output. 1053 01:02:08,240 --> 01:02:12,990 But from an engineering point of view, it works quite well. 1054 01:02:12,990 --> 01:02:16,700 So then for a while, that was the state of the art. 1055 01:02:16,700 --> 01:02:22,640 And then last year, these guys, Vaswani et al. 1056 01:02:22,640 --> 01:02:27,920 came along and said, you know, we now 1057 01:02:27,920 --> 01:02:30,020 have this complicated architecture, 1058 01:02:30,020 --> 01:02:34,490 where we are doing the old-style translation where 1059 01:02:34,490 --> 01:02:37,250 we summarize everything into one vector, 1060 01:02:37,250 --> 01:02:41,690 and then use that to generate a sequence of outputs. 1061 01:02:41,690 --> 01:02:43,850 And we have this attention mechanism 1062 01:02:43,850 --> 01:02:47,450 that tells us how much of various inputs 1063 01:02:47,450 --> 01:02:52,040 to use in generating each element of the output. 1064 01:02:52,040 --> 01:02:55,050 Is the first of those actually necessary? 1065 01:02:55,050 --> 01:02:58,040 And so they published this lovely paper saying attention 1066 01:02:58,040 --> 01:03:00,740 is all you need, that says, hey, you 1067 01:03:00,740 --> 01:03:04,280 know that thing that you guys have added to this translation 1068 01:03:04,280 --> 01:03:05,720 model. 1069 01:03:05,720 --> 01:03:07,790 Not only is it a useful addition, 1070 01:03:07,790 --> 01:03:12,770 but in fact, it can take the place of the original model. 1071 01:03:12,770 --> 01:03:16,340 And so the Transformer is an architecture that 1072 01:03:16,340 --> 01:03:19,280 is the hottest thing since sliced bread 1073 01:03:19,280 --> 01:03:23,940 at the moment, that says, OK, here's what we do. 1074 01:03:23,940 --> 01:03:25,580 We take the inputs. 1075 01:03:25,580 --> 01:03:29,400 We calculate some embedding for them. 1076 01:03:29,400 --> 01:03:31,460 We then want to retain the position, 1077 01:03:31,460 --> 01:03:35,380 because of course, the sequence in which the words appear, 1078 01:03:35,380 --> 01:03:36,890 it matters. 1079 01:03:36,890 --> 01:03:39,590 And the positional encoding is this weird thing 1080 01:03:39,590 --> 01:03:44,230 where it encodes using sine waves so that-- 1081 01:03:44,230 --> 01:03:46,700 it's an orthogonal basis. 1082 01:03:46,700 --> 01:03:49,460 And so it has nice characteristics. 1083 01:03:49,460 --> 01:03:52,370 And then we run it into an attention model 1084 01:03:52,370 --> 01:03:54,890 that is essentially computing self-attention. 1085 01:03:54,890 --> 01:03:58,145 So it's saying what-- 1086 01:03:58,145 --> 01:04:02,870 it's like Word2Vec, except in a more sophisticated way. 1087 01:04:02,870 --> 01:04:06,260 So it's looking at all the words in the sentence 1088 01:04:06,260 --> 01:04:11,270 and saying, which words is this word most related to? 1089 01:04:13,890 --> 01:04:17,580 And then, in order to complicate it some more, 1090 01:04:17,580 --> 01:04:20,280 they say, well, we don't want just a single notion 1091 01:04:20,280 --> 01:04:21,420 of attention. 1092 01:04:21,420 --> 01:04:25,210 We want multiple notions of attention. 1093 01:04:25,210 --> 01:04:27,240 So what does that sound like? 1094 01:04:27,240 --> 01:04:30,510 Well, to me, it sounds a bit like what 1095 01:04:30,510 --> 01:04:34,230 you see in convolutional neural networks, 1096 01:04:34,230 --> 01:04:39,270 where often when you're processing an image with a CNN, 1097 01:04:39,270 --> 01:04:42,240 you're not only applying one filter to the image 1098 01:04:42,240 --> 01:04:45,540 but you're applying a whole bunch of different filters. 1099 01:04:45,540 --> 01:04:47,820 And because you initialize them randomly, 1100 01:04:47,820 --> 01:04:50,520 you hope that they will converge to things 1101 01:04:50,520 --> 01:04:55,370 that actually detect different interesting properties 1102 01:04:55,370 --> 01:04:56,920 of the image. 1103 01:04:56,920 --> 01:04:58,710 So the same idea here-- 1104 01:04:58,710 --> 01:05:00,210 that what they're doing is they're 1105 01:05:00,210 --> 01:05:06,330 starting with a bunch of these attention matrices and saying, 1106 01:05:06,330 --> 01:05:07,980 we initialize them randomly. 1107 01:05:07,980 --> 01:05:10,260 They will evolve into something that 1108 01:05:10,260 --> 01:05:14,860 is most useful for helping us deal with the overall problem. 1109 01:05:14,860 --> 01:05:17,400 So then they run this through a series 1110 01:05:17,400 --> 01:05:22,290 of, I think, in Vaswani's paper, something like six layers that 1111 01:05:22,290 --> 01:05:24,300 are just replicated. 1112 01:05:24,300 --> 01:05:30,510 And there are additional things like feeding forward the input 1113 01:05:30,510 --> 01:05:36,240 signal in order to add it to the output signal of the stage, 1114 01:05:36,240 --> 01:05:39,750 and then normalizing, and then rerunning it, 1115 01:05:39,750 --> 01:05:42,900 and then running it through a feed-forward network that 1116 01:05:42,900 --> 01:05:47,550 also has a bypass that combines the input with the output 1117 01:05:47,550 --> 01:05:49,500 of the feed-forward network. 1118 01:05:49,500 --> 01:05:52,890 And then you do this six times, or n times. 1119 01:05:52,890 --> 01:05:57,260 And that then feeds into the generator. 1120 01:05:57,260 --> 01:06:02,390 And the generator then uses a very similar architecture 1121 01:06:02,390 --> 01:06:04,820 to calculate output probabilities, 1122 01:06:04,820 --> 01:06:09,330 And then it samples from those in order to generate the text. 1123 01:06:09,330 --> 01:06:12,230 So this is sort of the contemporary way 1124 01:06:12,230 --> 01:06:16,190 that one can do translation, using this approach. 1125 01:06:16,190 --> 01:06:19,780 Obviously, I don't have time to go into all the details of how 1126 01:06:19,780 --> 01:06:21,440 all this is done. 1127 01:06:21,440 --> 01:06:23,960 And I'd probably do it wrong anyway. 1128 01:06:23,960 --> 01:06:27,710 But you can look at the paper, which gives a good explanation. 1129 01:06:27,710 --> 01:06:30,590 And that blog that I pointed to also has 1130 01:06:30,590 --> 01:06:34,670 a pointer to another blog post by the same guy 1131 01:06:34,670 --> 01:06:39,800 that does a pretty good job of explaining the Transformer 1132 01:06:39,800 --> 01:06:41,330 architecture. 1133 01:06:41,330 --> 01:06:43,680 It's complicated. 1134 01:06:43,680 --> 01:06:48,200 So what you get out of the multi-head attention mechanism 1135 01:06:48,200 --> 01:06:49,310 is that-- 1136 01:06:49,310 --> 01:06:53,700 here is one attention machine. 1137 01:06:53,700 --> 01:06:58,190 And for example, the colors here indicate the degree 1138 01:06:58,190 --> 01:07:01,850 to which the encoding of the word "it" 1139 01:07:01,850 --> 01:07:05,300 depends on the other words in the sentence. 1140 01:07:05,300 --> 01:07:09,860 And you see that it's focused on the animal, which makes sense. 1141 01:07:09,860 --> 01:07:14,215 Because "it," in fact, is referring 1142 01:07:14,215 --> 01:07:17,210 to the animal in this sentence. 1143 01:07:17,210 --> 01:07:21,020 Here they introduce another encoding. 1144 01:07:21,020 --> 01:07:26,210 And this one focuses on "was too tired," which is also good. 1145 01:07:26,210 --> 01:07:32,490 Because "it," again, refers to the thing that was too tired. 1146 01:07:32,490 --> 01:07:34,560 And of course, by multi-headed, they 1147 01:07:34,560 --> 01:07:37,440 mean that it's doing this many times. 1148 01:07:37,440 --> 01:07:40,200 And so you're identifying all kinds 1149 01:07:40,200 --> 01:07:45,930 of different relationships in the input sentence. 1150 01:07:45,930 --> 01:07:52,380 Well, along the same lines is this encoding called ELMo. 1151 01:07:52,380 --> 01:07:56,970 People seem to like Sesame Street characters. 1152 01:07:56,970 --> 01:08:00,090 So ELMo is based on a bi-directional LSTM. 1153 01:08:00,090 --> 01:08:02,670 So it's an older technology. 1154 01:08:02,670 --> 01:08:06,200 But what it does is, unlike Word2Vec, 1155 01:08:06,200 --> 01:08:12,000 which built an embedding for each type-- 1156 01:08:12,000 --> 01:08:17,060 so every time the word "junk" appears, 1157 01:08:17,060 --> 01:08:19,229 it gets the same embedding. 1158 01:08:19,229 --> 01:08:23,510 Here what they're saying is, hey, take context seriously. 1159 01:08:23,510 --> 01:08:26,540 And we're going to calculate a different embedding 1160 01:08:26,540 --> 01:08:32,710 for each occurrence in context of a token. 1161 01:08:32,710 --> 01:08:34,899 And this turns out to be very good. 1162 01:08:34,899 --> 01:08:38,200 Because it goes part of the way to solving 1163 01:08:38,200 --> 01:08:41,439 the word-sense disambiguation problem. 1164 01:08:41,439 --> 01:08:43,580 So this is just an example. 1165 01:08:43,580 --> 01:08:46,899 If you look at the word "play" in GloVe, which 1166 01:08:46,899 --> 01:08:49,330 is a slightly more sophisticated variant 1167 01:08:49,330 --> 01:08:53,410 of the Word2Vec approach, you get playing, game, games, 1168 01:08:53,410 --> 01:08:57,520 played, players, plays, player, play, football, multiplayer. 1169 01:08:57,520 --> 01:09:00,390 This all seems to be about games. 1170 01:09:00,390 --> 01:09:02,740 Because probably, from the literature 1171 01:09:02,740 --> 01:09:06,130 that they got this from, that's the most common usage 1172 01:09:06,130 --> 01:09:08,350 of the word "play." 1173 01:09:08,350 --> 01:09:13,090 Whereas, using this bi-directional language model, 1174 01:09:13,090 --> 01:09:16,330 they can separate out something like, 1175 01:09:16,330 --> 01:09:18,340 "Kieffer, the only junior in the group, 1176 01:09:18,340 --> 01:09:22,550 was commended for his ability to hit in the clutch, as well as 1177 01:09:22,550 --> 01:09:24,609 his all-around excellent play." 1178 01:09:24,609 --> 01:09:27,970 So this is presumably the baseball player. 1179 01:09:27,970 --> 01:09:29,620 And here is, "They were actors who 1180 01:09:29,620 --> 01:09:33,100 had been handed fat roles in a successful play." 1181 01:09:33,100 --> 01:09:35,979 So this is a different meaning of the word play. 1182 01:09:35,979 --> 01:09:40,540 And so this embedding also has made really important 1183 01:09:40,540 --> 01:09:44,109 contributions to improving the quality of natural language 1184 01:09:44,109 --> 01:09:47,140 processing by being able to deal with the fact 1185 01:09:47,140 --> 01:09:50,620 that single words have multiple meanings not only in English 1186 01:09:50,620 --> 01:09:53,710 but in other languages. 1187 01:09:53,710 --> 01:10:00,120 So after ELMo comes BERT, which is this Bidirectional Encoder 1188 01:10:00,120 --> 01:10:02,820 Representations from Transformers. 1189 01:10:02,820 --> 01:10:07,380 So rather than using the LSTM kind of model that ELMo used, 1190 01:10:07,380 --> 01:10:10,620 these guys say, well, let's hop on the bandwagon, 1191 01:10:10,620 --> 01:10:14,790 use the Transformer-based architecture. 1192 01:10:14,790 --> 01:10:18,570 And then they introduced some interesting tricks. 1193 01:10:18,570 --> 01:10:21,510 So one of the problems with Transformers 1194 01:10:21,510 --> 01:10:25,320 is if you stack them on top of each other there 1195 01:10:25,320 --> 01:10:27,930 are many paths from any of the inputs 1196 01:10:27,930 --> 01:10:31,210 to any of the intermediate nodes and the outputs. 1197 01:10:31,210 --> 01:10:33,930 And so if you're doing self-attention, 1198 01:10:33,930 --> 01:10:38,220 you're trying to figure out where the output should 1199 01:10:38,220 --> 01:10:42,210 pay attention to the input, the answer, of course, 1200 01:10:42,210 --> 01:10:45,810 is like, if you're trying to reconstruct the input, 1201 01:10:45,810 --> 01:10:50,700 if the input is present in your model, what you will learn 1202 01:10:50,700 --> 01:10:53,250 is that the corresponding word is 1203 01:10:53,250 --> 01:10:55,950 the right word for your output. 1204 01:10:55,950 --> 01:10:58,720 So they have to prevent that from happening. 1205 01:10:58,720 --> 01:11:02,610 And so the way they do it is by masking off, 1206 01:11:02,610 --> 01:11:07,590 at each level, some fraction of the words or of the inputs 1207 01:11:07,590 --> 01:11:09,460 at that level. 1208 01:11:09,460 --> 01:11:11,880 So what this is doing is it's a little bit 1209 01:11:11,880 --> 01:11:15,810 like the skip-gram model in Word2Vec, where it's 1210 01:11:15,810 --> 01:11:19,770 trying to predict the likelihood of some word, 1211 01:11:19,770 --> 01:11:23,100 except it doesn't know what a significant fraction 1212 01:11:23,100 --> 01:11:24,940 of the words are. 1213 01:11:24,940 --> 01:11:29,910 And so it can't overfit in the way that I was just suggesting. 1214 01:11:29,910 --> 01:11:32,820 So this turned out to be a good idea. 1215 01:11:32,820 --> 01:11:34,380 It's more complicated. 1216 01:11:34,380 --> 01:11:37,440 Again, for the details, you have to read the paper. 1217 01:11:37,440 --> 01:11:41,520 I gave both the Transformer paper and the BERT paper 1218 01:11:41,520 --> 01:11:44,010 as optional readings for today. 1219 01:11:44,010 --> 01:11:46,380 I meant to give them as required readings, 1220 01:11:46,380 --> 01:11:47,970 but I didn't do it in time. 1221 01:11:47,970 --> 01:11:50,220 So they're optional. 1222 01:11:50,220 --> 01:11:52,770 But there are a whole bunch of other tricks. 1223 01:11:52,770 --> 01:11:57,240 So instead of using words, they actually used word pieces. 1224 01:11:57,240 --> 01:12:03,690 So think about syllables and don't becomes do and apostrophe 1225 01:12:03,690 --> 01:12:06,570 t, and so on. 1226 01:12:06,570 --> 01:12:11,130 And then they discovered that about 15% of the tokens 1227 01:12:11,130 --> 01:12:15,540 to be masked seems to work better than other percentages. 1228 01:12:15,540 --> 01:12:21,720 So those are the hidden tokens that prevent overfitting. 1229 01:12:21,720 --> 01:12:26,010 And then they do some other weird stuff. 1230 01:12:26,010 --> 01:12:28,860 Like, instead of masking a token, 1231 01:12:28,860 --> 01:12:32,790 they will inject random other words from the vocabulary 1232 01:12:32,790 --> 01:12:36,810 into its place, again, to prevent overfitting. 1233 01:12:36,810 --> 01:12:39,720 And then they look at different tasks like, 1234 01:12:39,720 --> 01:12:43,020 can I predict the next sentence in a corpus? 1235 01:12:43,020 --> 01:12:44,790 So I read a sentence. 1236 01:12:44,790 --> 01:12:48,330 And the translation is not into another language. 1237 01:12:48,330 --> 01:12:52,500 But it's predicting what the next sentence is going to be. 1238 01:12:52,500 --> 01:12:56,880 So they trained it on 800 million words from something 1239 01:12:56,880 --> 01:13:02,430 called the Books corpus and about 2 and 1/2 1240 01:13:02,430 --> 01:13:06,000 million-word Wikipedia corpus. 1241 01:13:06,000 --> 01:13:07,640 And what they found was that there 1242 01:13:07,640 --> 01:13:12,360 is an enormous improvement on a lot of classical tasks. 1243 01:13:12,360 --> 01:13:15,990 So this is a listing of some of the standard tasks 1244 01:13:15,990 --> 01:13:20,980 for natural language processing, mostly not in the medical world 1245 01:13:20,980 --> 01:13:24,450 but in the general NLP domain. 1246 01:13:24,450 --> 01:13:32,280 And you see that you get things like an improvement from 80%. 1247 01:13:32,280 --> 01:13:35,880 Or even the GPT model that I'll talk about 1248 01:13:35,880 --> 01:13:39,060 in a minute is at 82%. 1249 01:13:39,060 --> 01:13:42,030 They're up to about 86%. 1250 01:13:42,030 --> 01:13:47,470 So a 4% improvement in this domain is really huge. 1251 01:13:47,470 --> 01:13:50,110 I mean, very often people publish papers 1252 01:13:50,110 --> 01:13:53,110 showing a 1% improvement. 1253 01:13:53,110 --> 01:13:54,900 And if their corpus is big enough, 1254 01:13:54,900 --> 01:13:57,190 then it's statistically significant, 1255 01:13:57,190 --> 01:13:59,020 and therefore publishable. 1256 01:13:59,020 --> 01:14:02,590 But it's not significant in the ordinary meaning of the term 1257 01:14:02,590 --> 01:14:05,890 significant, if you're doing 1% better. 1258 01:14:05,890 --> 01:14:08,590 But doing 4% better is pretty good. 1259 01:14:08,590 --> 01:14:15,370 Here we're going from like 66% to 72% 1260 01:14:15,370 --> 01:14:17,670 from the earlier state of the art-- 1261 01:14:17,670 --> 01:14:26,410 82 to 91; 93 to 94; 35 to 60 in the CoLA task corpus 1262 01:14:26,410 --> 01:14:28,540 of linguistic acceptability. 1263 01:14:28,540 --> 01:14:32,110 So this is asking, I think, Mechanical Turk 1264 01:14:32,110 --> 01:14:36,550 people, for generated sentences, is this sentence 1265 01:14:36,550 --> 01:14:39,000 a valid sentence of English? 1266 01:14:39,000 --> 01:14:42,700 And so it's an interesting benchmark. 1267 01:14:42,700 --> 01:14:47,650 So it's producing really significant improvements 1268 01:14:47,650 --> 01:14:49,240 all over the place. 1269 01:14:49,240 --> 01:14:50,860 They trained two models of it. 1270 01:14:50,860 --> 01:14:52,750 The base model is the smaller one. 1271 01:14:52,750 --> 01:14:57,470 The large model is just trained on larger data sets. 1272 01:14:57,470 --> 01:15:01,050 Enormous amount of computation in doing this training-- 1273 01:15:01,050 --> 01:15:04,610 so I've forgotten, it took them like a month 1274 01:15:04,610 --> 01:15:08,270 on some gigantic cluster of GPU machines. 1275 01:15:08,270 --> 01:15:11,780 And so it's daunting, because you can't just 1276 01:15:11,780 --> 01:15:14,000 crank this up on your laptop and expect 1277 01:15:14,000 --> 01:15:16,018 it to finish in your lifetime. 1278 01:15:20,210 --> 01:15:23,610 The last thing I want to tell you about is this GPT-2. 1279 01:15:23,610 --> 01:15:26,780 So this is from the OpenAI Institute, 1280 01:15:26,780 --> 01:15:30,320 which is one of these philanthropically funded-- 1281 01:15:30,320 --> 01:15:33,320 I think, this one, by Elon Musk-- 1282 01:15:33,320 --> 01:15:37,910 research institute to advance AI. 1283 01:15:37,910 --> 01:15:42,900 And what they said is, well, this is all cool, but-- 1284 01:15:42,900 --> 01:15:45,260 so they were not using BERT. 1285 01:15:45,260 --> 01:15:49,520 They were using the Transformer architecture 1286 01:15:49,520 --> 01:15:53,720 but without the same training style as BERT. 1287 01:15:53,720 --> 01:15:56,780 And they said, the secret is going 1288 01:15:56,780 --> 01:16:02,930 to be that we're going to apply this not only to one problem 1289 01:16:02,930 --> 01:16:05,160 but to a whole bunch of problems. 1290 01:16:05,160 --> 01:16:08,690 So it's a multi-task learning approach that says, 1291 01:16:08,690 --> 01:16:10,880 we're going to build a better model 1292 01:16:10,880 --> 01:16:16,000 by trying to solve a bunch of different tasks simultaneously. 1293 01:16:16,000 --> 01:16:19,950 And so they built enormous models. 1294 01:16:19,950 --> 01:16:24,180 By the way, the task itself is given as a sequence of tokens. 1295 01:16:24,180 --> 01:16:26,880 So for example, they might have a task 1296 01:16:26,880 --> 01:16:31,890 that says translate to French, English text, French text. 1297 01:16:31,890 --> 01:16:36,780 Or answer the question, document, question, answer. 1298 01:16:36,780 --> 01:16:43,400 And so the system not only learns 1299 01:16:43,400 --> 01:16:45,660 how to do whatever it's supposed to do. 1300 01:16:45,660 --> 01:16:47,990 But it even learns something about the tasks 1301 01:16:47,990 --> 01:16:52,670 that it's being asked to work on by encoding these and using 1302 01:16:52,670 --> 01:16:54,890 them as part of its model. 1303 01:16:54,890 --> 01:16:58,070 So they built four different models. 1304 01:16:58,070 --> 01:17:01,790 Take a look at the bottom one. 1305 01:17:01,790 --> 01:17:09,120 1.5 billion parameters-- this is a large model. 1306 01:17:09,120 --> 01:17:10,860 This is a very large model. 1307 01:17:13,430 --> 01:17:16,610 And so it's a byte-level model. 1308 01:17:16,610 --> 01:17:20,240 So they just said forget words, because we're trying 1309 01:17:20,240 --> 01:17:21,890 to do this multilingually. 1310 01:17:21,890 --> 01:17:25,020 And so for Chinese, you want characters. 1311 01:17:25,020 --> 01:17:29,330 And for English, you might as well take characters also. 1312 01:17:29,330 --> 01:17:32,990 And the system will, in its 1.5 billion parameters, 1313 01:17:32,990 --> 01:17:37,520 learn all about the sequences of characters that make up words. 1314 01:17:37,520 --> 01:17:39,590 And it'll be cool. 1315 01:17:39,590 --> 01:17:44,540 And so then they look at a whole bunch of different challenges. 1316 01:17:44,540 --> 01:17:48,380 And what you see is that the state of the art before they 1317 01:17:48,380 --> 01:17:54,010 did this on, for example, the Lambada data set 1318 01:17:54,010 --> 01:18:00,130 was that the perplexity of its predictions was a hundred. 1319 01:18:00,130 --> 01:18:04,300 And with this large model, the perplexity of its predictions 1320 01:18:04,300 --> 01:18:06,500 is about nine. 1321 01:18:06,500 --> 01:18:10,340 So that means that it's reduced the uncertainty of what 1322 01:18:10,340 --> 01:18:13,700 to predict next ridiculously much-- 1323 01:18:13,700 --> 01:18:16,280 I mean, by more than an order of magnitude. 1324 01:18:16,280 --> 01:18:18,920 And you get similar gains, accuracy going 1325 01:18:18,920 --> 01:18:25,700 from 59% to 63% accuracy on a-- 1326 01:18:25,700 --> 01:18:29,480 this is the children's something-or-other challenge-- 1327 01:18:29,480 --> 01:18:31,640 from 85% to 93%-- 1328 01:18:31,640 --> 01:18:37,100 so dramatic improvements almost across the board, 1329 01:18:37,100 --> 01:18:40,160 except for this particular data set, 1330 01:18:40,160 --> 01:18:42,720 where they did not do well. 1331 01:18:42,720 --> 01:18:47,880 And what really blew me away is here's 1332 01:18:47,880 --> 01:18:51,660 an application of this 1.5 billion-word model 1333 01:18:51,660 --> 01:18:56,730 that they built. So they said, OK, I give you a prompt, 1334 01:18:56,730 --> 01:18:59,490 like the opening paragraph of a Wall Street Journal 1335 01:18:59,490 --> 01:19:02,010 article or a Wikipedia article. 1336 01:19:02,010 --> 01:19:07,230 And you complete the article by using that generator idea 1337 01:19:07,230 --> 01:19:10,980 that I showed you before, that just uses the language model 1338 01:19:10,980 --> 01:19:14,520 and picks the most likely word to come next 1339 01:19:14,520 --> 01:19:17,160 and emits that as the next word. 1340 01:19:17,160 --> 01:19:20,490 So here is a prompt that says, "A train carriage containing 1341 01:19:20,490 --> 01:19:24,270 controlled nuclear materials was stolen in Cincinnati today. 1342 01:19:24,270 --> 01:19:26,070 Its whereabouts are unknown." 1343 01:19:26,070 --> 01:19:27,450 By the way, this is made up. 1344 01:19:27,450 --> 01:19:31,230 I mean, this is not a real news article. 1345 01:19:31,230 --> 01:19:34,380 And the system comes back with a completion 1346 01:19:34,380 --> 01:19:36,900 that says, "The incident occurred on the downtown train 1347 01:19:36,900 --> 01:19:40,260 line, which runs from Covington and Ashland stations. 1348 01:19:40,260 --> 01:19:42,300 In an email to Ohio news outlets, 1349 01:19:42,300 --> 01:19:44,910 the US Department of Energy said it's 1350 01:19:44,910 --> 01:19:47,250 working with the Federal Railroad Administration 1351 01:19:47,250 --> 01:19:50,400 to find the thief," et cetera. 1352 01:19:50,400 --> 01:19:53,790 This looks astoundingly good. 1353 01:19:53,790 --> 01:19:56,650 Now, the paper from which this comes-- 1354 01:19:56,650 --> 01:19:59,220 this is actually from a blog, but they've also 1355 01:19:59,220 --> 01:20:01,560 published a paper about it-- 1356 01:20:01,560 --> 01:20:04,590 claims that these examples are not even cherry-picked. 1357 01:20:04,590 --> 01:20:09,410 If you go to that page and pick sample 1, 2, 3, 4, 5, 1358 01:20:09,410 --> 01:20:12,810 6, et cetera, you get different examples 1359 01:20:12,810 --> 01:20:15,270 that they claim are not cherry-picked. 1360 01:20:15,270 --> 01:20:17,880 And every one of them is really good. 1361 01:20:17,880 --> 01:20:21,690 I mean, you could imagine this being an actual article 1362 01:20:21,690 --> 01:20:24,090 about this actual event. 1363 01:20:24,090 --> 01:20:27,520 So somehow or other, in this enormous model, 1364 01:20:27,520 --> 01:20:30,600 and with this Transformer technology, 1365 01:20:30,600 --> 01:20:34,510 and with the multi-task training that they've done, 1366 01:20:34,510 --> 01:20:37,300 they have managed to capture so much 1367 01:20:37,300 --> 01:20:40,810 of the regularity of the English language 1368 01:20:40,810 --> 01:20:43,840 that they can generate these fake news articles based 1369 01:20:43,840 --> 01:20:48,910 on a prompt and make them look unbelievably realistic. 1370 01:20:48,910 --> 01:20:51,940 Now, interestingly, they have chosen not 1371 01:20:51,940 --> 01:20:54,400 to release that trained model. 1372 01:20:54,400 --> 01:20:57,980 Because they're worried that people will, in fact, do this, 1373 01:20:57,980 --> 01:21:02,260 and that they will generate fake news articles all the time. 1374 01:21:02,260 --> 01:21:04,360 They've released a much smaller model 1375 01:21:04,360 --> 01:21:09,010 that is not nearly as good in terms of its realism. 1376 01:21:09,010 --> 01:21:12,580 So that's the state of the art in language modeling 1377 01:21:12,580 --> 01:21:13,970 at the moment. 1378 01:21:13,970 --> 01:21:18,520 And as I say, the general domain is ahead of the medical domain. 1379 01:21:18,520 --> 01:21:20,530 But you can bet that there are tons 1380 01:21:20,530 --> 01:21:24,040 of people who are sitting around looking at exactly 1381 01:21:24,040 --> 01:21:27,250 these results and saying, well, we 1382 01:21:27,250 --> 01:21:29,590 ought to be able to take advantage of this 1383 01:21:29,590 --> 01:21:33,310 to build much better language models for the medical domain 1384 01:21:33,310 --> 01:21:36,670 and to exploit them in order to do phenotyping, in order 1385 01:21:36,670 --> 01:21:41,200 to do entity recognition, in order to do inference, 1386 01:21:41,200 --> 01:21:43,420 in order to do question answering, 1387 01:21:43,420 --> 01:21:47,156 in order to do any of these kinds of topics. 1388 01:21:47,156 --> 01:21:51,030 And I was talking to Patrick Winston, who 1389 01:21:51,030 --> 01:21:54,660 is one of the good old-fashioned AI people, 1390 01:21:54,660 --> 01:21:56,970 as he characterizes himself. 1391 01:21:56,970 --> 01:22:00,090 And the thing that's a little troublesome about this 1392 01:22:00,090 --> 01:22:04,770 is that this technology has virtually nothing 1393 01:22:04,770 --> 01:22:07,470 to do with anything that we understand 1394 01:22:07,470 --> 01:22:11,670 about language or about inference or about question 1395 01:22:11,670 --> 01:22:15,010 answering or about anything. 1396 01:22:15,010 --> 01:22:19,140 And so one is left with this queasy feeling that, 1397 01:22:19,140 --> 01:22:22,530 here is a wonderful engineering solution to a whole set 1398 01:22:22,530 --> 01:22:24,870 of problems, but it's unclear how 1399 01:22:24,870 --> 01:22:29,110 it relates to the original goal of artificial intelligence, 1400 01:22:29,110 --> 01:22:31,830 which is to understand something about human intelligence 1401 01:22:31,830 --> 01:22:35,160 by simulating it in a computer. 1402 01:22:35,160 --> 01:22:38,410 Maybe our BCS friends will discover 1403 01:22:38,410 --> 01:22:42,780 that there are, in fact, transformer mechanisms deeply 1404 01:22:42,780 --> 01:22:44,670 buried in our brain. 1405 01:22:44,670 --> 01:22:46,830 But I would be surprised if that turned out 1406 01:22:46,830 --> 01:22:48,960 to be exactly the case. 1407 01:22:48,960 --> 01:22:52,480 But perhaps there is something like that going on. 1408 01:22:52,480 --> 01:22:54,930 And so this leaves an interesting scientific 1409 01:22:54,930 --> 01:22:57,180 conundrum of, exactly what have we 1410 01:22:57,180 --> 01:23:02,040 learned from this type of very, very successful model building? 1411 01:23:02,040 --> 01:23:02,760 OK. 1412 01:23:02,760 --> 01:23:03,540 Thank you. 1413 01:23:03,540 --> 01:23:06,590 [APPLAUSE]