1 00:00:01,461 --> 00:00:02,435 [CLICK] 2 00:00:02,435 --> 00:00:03,409 [SQUEAK] 3 00:00:03,409 --> 00:00:04,383 [PAGES RUSTLING] 4 00:00:04,383 --> 00:00:14,858 [MOUSE DOUBLE-CLICKS] 5 00:00:14,858 --> 00:00:17,150 PROFESSOR: So today we'll be continuing along the theme 6 00:00:17,150 --> 00:00:18,320 of risk stratification. 7 00:00:18,320 --> 00:00:22,430 I'll spend the first half to 2/3 of today's lecture 8 00:00:22,430 --> 00:00:25,010 continuing where we left off last week 9 00:00:25,010 --> 00:00:27,230 before the discussion. 10 00:00:27,230 --> 00:00:29,287 I'll talk about how does one derive the labels 11 00:00:29,287 --> 00:00:31,370 that one uses within a supervised machine learning 12 00:00:31,370 --> 00:00:32,270 approach. 13 00:00:32,270 --> 00:00:34,610 I'll continue talking about how one evaluates 14 00:00:34,610 --> 00:00:36,253 risk stratification models. 15 00:00:36,253 --> 00:00:38,420 And then I'll talk about some of the subtleties that 16 00:00:38,420 --> 00:00:39,770 arise when you want to use machine 17 00:00:39,770 --> 00:00:41,353 learning for health care, specifically 18 00:00:41,353 --> 00:00:42,530 for risk stratification. 19 00:00:42,530 --> 00:00:43,947 And I think that's going to be one 20 00:00:43,947 --> 00:00:46,070 of the most interesting parts of today's lecture. 21 00:00:46,070 --> 00:00:47,630 In the last third of today's lecture, 22 00:00:47,630 --> 00:00:51,040 I'll be talking about how one can rethink 23 00:00:51,040 --> 00:00:52,820 the supervised machine learning problem, 24 00:00:52,820 --> 00:00:54,750 not to be a classification problem, 25 00:00:54,750 --> 00:00:57,320 but be something closer to a regression problem. 26 00:00:57,320 --> 00:01:00,900 And one now thinks about not will someone, for example, 27 00:01:00,900 --> 00:01:03,420 develop diabetes within one to three years from now, 28 00:01:03,420 --> 00:01:06,340 but when precisely will they develop diabetes-- 29 00:01:06,340 --> 00:01:08,750 so the time to event. 30 00:01:08,750 --> 00:01:11,300 Then one has to start to really think very carefully 31 00:01:11,300 --> 00:01:14,660 about the censoring issues that I alluded to last week. 32 00:01:14,660 --> 00:01:16,490 And so I'll formalize those notions 33 00:01:16,490 --> 00:01:18,302 in the language of survival modeling. 34 00:01:18,302 --> 00:01:20,510 And I'll talk about how one can do maximum likelihood 35 00:01:20,510 --> 00:01:22,960 estimation in that setting, and how one should do evaluation 36 00:01:22,960 --> 00:01:23,627 in that setting. 37 00:01:26,780 --> 00:01:29,570 So in our lecture last week, I gave you 38 00:01:29,570 --> 00:01:31,820 this example of risk stratification for type 2 39 00:01:31,820 --> 00:01:32,990 diabetes. 40 00:01:32,990 --> 00:01:35,660 The goal, just to remind you, was as follows. 41 00:01:35,660 --> 00:01:38,000 25% of people in the United States 42 00:01:38,000 --> 00:01:40,970 have undiagnosed type 2 diabetes. 43 00:01:40,970 --> 00:01:43,850 If we could take health insurance claims 44 00:01:43,850 --> 00:01:45,950 data that's available for everyone 45 00:01:45,950 --> 00:01:48,170 who has health insurance, and use 46 00:01:48,170 --> 00:01:52,040 that to predict who, in the near-term-- next one to three 47 00:01:52,040 --> 00:01:56,360 years-- is likely to be newly diagnosed with type 2 diabetes, 48 00:01:56,360 --> 00:01:58,910 then we could use it to risk-stratify patient 49 00:01:58,910 --> 00:01:59,540 population. 50 00:01:59,540 --> 00:02:02,030 We could use that, then, to figure out who is most at risk, 51 00:02:02,030 --> 00:02:03,620 do interventions for those patients, 52 00:02:03,620 --> 00:02:06,260 to try to get them diagnosed and get them started 53 00:02:06,260 --> 00:02:09,080 on treatment if relevant. 54 00:02:09,080 --> 00:02:10,669 But what I didn't talk much about 55 00:02:10,669 --> 00:02:13,380 was where did those labels come from. 56 00:02:13,380 --> 00:02:18,277 How do we know that someone had a diabetes onset in that window 57 00:02:18,277 --> 00:02:19,610 that I show up there on the top? 58 00:02:22,520 --> 00:02:23,870 So what are the answers? 59 00:02:23,870 --> 00:02:26,490 I mean, all of you should have read the paper by Razavian. 60 00:02:26,490 --> 00:02:29,670 And then also you should hopefully have some ideas. 61 00:02:29,670 --> 00:02:31,630 Thoughts? 62 00:02:31,630 --> 00:02:33,410 A hint-- it was in supplementary material. 63 00:02:37,630 --> 00:02:41,230 How did we define a positive case in that paper? 64 00:02:48,120 --> 00:02:48,713 Yep. 65 00:02:48,713 --> 00:02:49,520 AUDIENCE: Drugs they were on. 66 00:02:49,520 --> 00:02:50,770 PROFESSOR: Drugs they were on. 67 00:02:50,770 --> 00:02:55,860 OK, yeah, so for example, metformin, glucose-- 68 00:02:55,860 --> 00:02:56,975 sorry, insulin. 69 00:02:56,975 --> 00:03:00,620 AUDIENCE: I think they did include metformin actually. 70 00:03:00,620 --> 00:03:02,550 PROFESSOR: Metformin is a tricky case. 71 00:03:02,550 --> 00:03:07,320 Because metformin is often used for alternative indications. 72 00:03:07,320 --> 00:03:10,570 But there are many medications, such as insulin, 73 00:03:10,570 --> 00:03:13,230 which are used pretty exclusively for treating 74 00:03:13,230 --> 00:03:13,950 diabetes. 75 00:03:13,950 --> 00:03:16,920 And so you can look to see, does a patient 76 00:03:16,920 --> 00:03:21,030 have a record of taking one of these diabetic medications 77 00:03:21,030 --> 00:03:25,590 in that window that we're using to define the outcome? 78 00:03:25,590 --> 00:03:27,480 If you see a record of a medication, 79 00:03:27,480 --> 00:03:31,680 you might conjecture, this patient probably has diabetes. 80 00:03:31,680 --> 00:03:34,170 But what about it they don't have any medication listed 81 00:03:34,170 --> 00:03:35,610 in that time window? 82 00:03:35,610 --> 00:03:37,470 What could you conclude then? 83 00:03:37,470 --> 00:03:38,720 Any ideas? 84 00:03:38,720 --> 00:03:39,220 Yeah. 85 00:03:39,220 --> 00:03:42,660 AUDIENCE: If you look at the HBA1C value, 86 00:03:42,660 --> 00:03:47,510 and you know the normal range, and if you see the [INAUDIBLE] 87 00:03:47,510 --> 00:03:49,950 above like 7.5 or 7. 88 00:03:49,950 --> 00:03:52,260 PROFESSOR: So you're giving me an alternative approach, 89 00:03:52,260 --> 00:03:54,540 not looking at medications, but looking at laboratory test 90 00:03:54,540 --> 00:03:55,040 results. 91 00:03:55,040 --> 00:03:58,080 Look at their HBA1C results, which 92 00:03:58,080 --> 00:04:02,070 measures approximately an average of three-month glucose 93 00:04:02,070 --> 00:04:03,120 values. 94 00:04:03,120 --> 00:04:05,460 And if that's out of range, then they're diabetic. 95 00:04:05,460 --> 00:04:07,530 And that's, in fact, usually used 96 00:04:07,530 --> 00:04:10,020 as a definition of diabetes. 97 00:04:10,020 --> 00:04:12,490 But that didn't answer my original question. 98 00:04:12,490 --> 00:04:15,090 Why is just looking at diabetic medications not enough? 99 00:04:18,394 --> 00:04:20,282 AUDIENCE: Some of the diabetic medications 100 00:04:20,282 --> 00:04:22,297 can be used to treat other conditions. 101 00:04:22,297 --> 00:04:23,880 PROFESSOR: Sometimes there's ambiguity 102 00:04:23,880 --> 00:04:25,475 in diabetic medications. 103 00:04:25,475 --> 00:04:26,850 But we've sort of dealt with that 104 00:04:26,850 --> 00:04:29,610 already by trying to choose an unambiguous set. 105 00:04:29,610 --> 00:04:31,397 What are other reasons? 106 00:04:31,397 --> 00:04:33,730 AUDIENCE: You're starting with the medicine at the onset 107 00:04:33,730 --> 00:04:36,887 of diabetes [INAUDIBLE]. 108 00:04:36,887 --> 00:04:38,970 PROFESSOR: Oh, that's a really interesting point-- 109 00:04:38,970 --> 00:04:41,220 not the one I was thinking about, but I like it-- 110 00:04:41,220 --> 00:04:44,010 which is that a patient might have been diagnosed with type 2 111 00:04:44,010 --> 00:04:47,135 diabetes, but they, for whatever reason, 112 00:04:47,135 --> 00:04:49,260 in that communication between provider and patient, 113 00:04:49,260 --> 00:04:52,050 they decided we're not going to start treatment yet. 114 00:04:52,050 --> 00:04:55,800 So they might not yet be on treatment for diabetes, 115 00:04:55,800 --> 00:04:57,640 yet the whole health care system might 116 00:04:57,640 --> 00:04:59,640 be very well aware that the patient is diabetic, 117 00:04:59,640 --> 00:05:02,370 in which case doing these interventions for that patient 118 00:05:02,370 --> 00:05:03,787 might be irrelevant. 119 00:05:03,787 --> 00:05:04,620 Yep, another reason? 120 00:05:04,620 --> 00:05:06,328 AUDIENCE: So a lot of people are just not 121 00:05:06,328 --> 00:05:07,530 diagnosed for diabetes. 122 00:05:07,530 --> 00:05:08,200 So they have it. 123 00:05:08,200 --> 00:05:10,280 So one label means that they have diabetes, 124 00:05:10,280 --> 00:05:12,665 and the other label is a combination of people who 125 00:05:12,665 --> 00:05:14,453 have and don't have diabetes. 126 00:05:14,453 --> 00:05:15,870 PROFESSOR: So the point was, often 127 00:05:15,870 --> 00:05:18,448 you just might not be diagnosed for diabetes. 128 00:05:18,448 --> 00:05:19,990 That, unfortunately, is not something 129 00:05:19,990 --> 00:05:21,640 that we're going to able to solve here. 130 00:05:21,640 --> 00:05:25,292 It is an issue, but we have no solution for it. 131 00:05:25,292 --> 00:05:27,750 No, rather there's a different point that I want to get at, 132 00:05:27,750 --> 00:05:29,950 which is that this data has biases in it. 133 00:05:29,950 --> 00:05:35,160 So even if a patient is on a diabetes medication, 134 00:05:35,160 --> 00:05:37,500 for whatever reason-- maybe they are paying 135 00:05:37,500 --> 00:05:39,180 cash for those medications. 136 00:05:39,180 --> 00:05:41,910 If they're paying cash for those medications, 137 00:05:41,910 --> 00:05:44,520 then there's not going to be any record for the patient taking 138 00:05:44,520 --> 00:05:47,730 those medications in the health insurance claims. 139 00:05:47,730 --> 00:05:50,895 Because the health insurer didn't have to pay for it. 140 00:05:50,895 --> 00:05:53,520 But the reason that you gave is also a very interesting reason. 141 00:05:53,520 --> 00:05:54,720 And both of them are valid. 142 00:05:54,720 --> 00:05:57,840 So for all of these reasons, just looking at the medications 143 00:05:57,840 --> 00:05:59,280 alone is going to be insufficient. 144 00:05:59,280 --> 00:06:01,200 And as was just suggested a moment ago, 145 00:06:01,200 --> 00:06:03,600 looking at other indicators, like, for example, 146 00:06:03,600 --> 00:06:06,900 does the patient have an abnormal blood glucose value 147 00:06:06,900 --> 00:06:11,640 or HBA1C value would also provide information. 148 00:06:11,640 --> 00:06:13,077 So it's non-trivial, right? 149 00:06:13,077 --> 00:06:15,660 And part of what you're going to be doing in your next problem 150 00:06:15,660 --> 00:06:18,120 set, problem set 2, is going to be thinking through how 151 00:06:18,120 --> 00:06:21,060 does one actually do this cohort construction, not just what 152 00:06:21,060 --> 00:06:22,650 is your inclusion/exclusion criteria, 153 00:06:22,650 --> 00:06:25,080 but also how do you really derive those labels 154 00:06:25,080 --> 00:06:26,880 from that data set. 155 00:06:26,880 --> 00:06:31,350 Now the traditional answer to this has two steps to it. 156 00:06:31,350 --> 00:06:35,800 Step 1 is to actually manually label some patients. 157 00:06:35,800 --> 00:06:38,530 So you take a few hundred patients, 158 00:06:38,530 --> 00:06:40,050 and you go through their data. 159 00:06:40,050 --> 00:06:42,660 You actually look at their data, and decide, 160 00:06:42,660 --> 00:06:45,650 is this patient diabetic or are they not diabetic? 161 00:06:45,650 --> 00:06:48,150 And the reason why you have to do that is because often what 162 00:06:48,150 --> 00:06:49,680 you might think of is obvious-- 163 00:06:49,680 --> 00:06:52,830 like, oh, if they're on diabetes medication, they're diabetic-- 164 00:06:52,830 --> 00:06:54,060 has flaws to it. 165 00:06:54,060 --> 00:06:56,290 And until you really dig down and look at the data, 166 00:06:56,290 --> 00:06:59,200 you might not recognize that that criteria has a flaw in it. 167 00:06:59,200 --> 00:07:01,200 So that chart review is really an essential part 168 00:07:01,200 --> 00:07:03,272 of this process. 169 00:07:03,272 --> 00:07:04,730 Then the second step is, how do you 170 00:07:04,730 --> 00:07:08,400 generalize to get that label now for everyone 171 00:07:08,400 --> 00:07:09,618 in your population. 172 00:07:09,618 --> 00:07:11,910 And again, there, there are usually two different types 173 00:07:11,910 --> 00:07:13,090 of approaches. 174 00:07:13,090 --> 00:07:16,380 The first approach is to come up with some simple rule 175 00:07:16,380 --> 00:07:18,970 to try to then extrapolate to everyone. 176 00:07:18,970 --> 00:07:22,830 For example, if they have, A, diabetes medication, 177 00:07:22,830 --> 00:07:25,140 or an abnormal lab test result, that 178 00:07:25,140 --> 00:07:26,820 would be an example of a rule. 179 00:07:26,820 --> 00:07:29,970 And then you could then apply that to everyone. 180 00:07:29,970 --> 00:07:33,470 But even those rules can be really tricky to derive. 181 00:07:33,470 --> 00:07:37,990 And I'll show you some examples of that in just a moment. 182 00:07:37,990 --> 00:07:41,700 And as we know, machine learning is 183 00:07:41,700 --> 00:07:44,850 sometimes good as an alternative for coming up with a rule. 184 00:07:44,850 --> 00:07:47,320 So there's often now a second approach 185 00:07:47,320 --> 00:07:49,260 to this being more and more commonly used 186 00:07:49,260 --> 00:07:52,140 in the literature, which is to actually use machine learning 187 00:07:52,140 --> 00:07:54,783 itself to derive the labels. 188 00:07:54,783 --> 00:07:56,700 And this is a bit subtle, because it's machine 189 00:07:56,700 --> 00:07:58,020 learning for machine learning. 190 00:07:58,020 --> 00:08:01,095 So I want to break that down for one second. 191 00:08:01,095 --> 00:08:02,970 When you're trying to derive the labels, what 192 00:08:02,970 --> 00:08:06,390 you want to know is not, at time T, 193 00:08:06,390 --> 00:08:09,120 what's going to happen at time T plus W and onwards-- 194 00:08:09,120 --> 00:08:11,010 that's the original machine learning task 195 00:08:11,010 --> 00:08:12,750 that we set out to solve-- 196 00:08:12,750 --> 00:08:14,920 but rather, given everything you know 197 00:08:14,920 --> 00:08:18,060 about the patient, including the future data, 198 00:08:18,060 --> 00:08:21,090 is this patient newly diagnosed with diabetes in that window 199 00:08:21,090 --> 00:08:24,700 that I show in black there, between T plus W and onward. 200 00:08:24,700 --> 00:08:26,250 OK? 201 00:08:26,250 --> 00:08:28,943 So for example, this machine learning problem, 202 00:08:28,943 --> 00:08:30,360 this new machine learning problem, 203 00:08:30,360 --> 00:08:33,450 could take, as input, lab test results, and medications, 204 00:08:33,450 --> 00:08:35,190 and a whole bunch of other data. 205 00:08:35,190 --> 00:08:40,230 And you then use the few examples you labeled in step 1 206 00:08:40,230 --> 00:08:42,870 to try to predict, is this patient currently 207 00:08:42,870 --> 00:08:44,347 diabetic or not. 208 00:08:44,347 --> 00:08:45,930 You then use that model to extrapolate 209 00:08:45,930 --> 00:08:46,930 to the whole population. 210 00:08:46,930 --> 00:08:48,990 And now you have your outcome label. 211 00:08:48,990 --> 00:08:50,550 It might be a little bit imperfect, 212 00:08:50,550 --> 00:08:52,383 but hopefully it's much better than what you 213 00:08:52,383 --> 00:08:53,700 could have gotten with a rule. 214 00:08:53,700 --> 00:08:55,710 And then, now using those outcome 215 00:08:55,710 --> 00:08:59,877 labels, you solve your original machine learning problem. 216 00:08:59,877 --> 00:09:00,460 Is that clear? 217 00:09:00,460 --> 00:09:02,960 Any questions? 218 00:09:02,960 --> 00:09:03,930 AUDIENCE: I have one. 219 00:09:03,930 --> 00:09:04,555 PROFESSOR: Yep. 220 00:09:04,555 --> 00:09:06,400 AUDIENCE: How do you evaluate yourself then, 221 00:09:06,400 --> 00:09:07,817 if you have these labels that were 222 00:09:07,817 --> 00:09:10,255 produced with machine learning, which are probabilistic? 223 00:09:10,255 --> 00:09:12,880 PROFESSOR: So that's where this first step is really important. 224 00:09:12,880 --> 00:09:16,000 You've got to get ground truth somehow. 225 00:09:16,000 --> 00:09:19,210 And of course once you get that ground truth, 226 00:09:19,210 --> 00:09:21,798 you create a train-and-validate set of that ground truth. 227 00:09:21,798 --> 00:09:24,340 You run your machine learning algorithm with the trained one. 228 00:09:24,340 --> 00:09:25,882 You'd look at its performance metrics 229 00:09:25,882 --> 00:09:29,260 on that validate set for the label prediction problem. 230 00:09:29,260 --> 00:09:31,043 And that's how you get confidence in it. 231 00:09:31,043 --> 00:09:32,960 But let's try to break this down a little bit. 232 00:09:32,960 --> 00:09:36,402 So first of all, what does this chart review step look like? 233 00:09:36,402 --> 00:09:38,110 Well, if it's an electronic health record 234 00:09:38,110 --> 00:09:41,335 system, what you often do is you will pull up Epic, or Cerner, 235 00:09:41,335 --> 00:09:43,727 or whatever the commercial EHR system is. 236 00:09:43,727 --> 00:09:46,060 And you will actually start looking at the patient data. 237 00:09:46,060 --> 00:09:48,058 You'll read notes written by previous doctors 238 00:09:48,058 --> 00:09:48,850 about this patient. 239 00:09:48,850 --> 00:09:50,642 And you'll look at their blood test results 240 00:09:50,642 --> 00:09:52,503 across time, medications that they're on. 241 00:09:52,503 --> 00:09:53,920 And from that you can usually tell 242 00:09:53,920 --> 00:09:56,535 pretty coherent story what's going on with your patient. 243 00:09:56,535 --> 00:09:58,660 Of course even better-- or the best way to get data 244 00:09:58,660 --> 00:10:00,210 is to do a prospective study. 245 00:10:00,210 --> 00:10:03,520 So you actually have a research assistant standing 246 00:10:03,520 --> 00:10:06,010 in the room when a patient walks into a provider. 247 00:10:06,010 --> 00:10:09,100 And they talk to the patient, and they take down 248 00:10:09,100 --> 00:10:12,910 really very clear notes what this patient has, 249 00:10:12,910 --> 00:10:13,930 what they don't have. 250 00:10:13,930 --> 00:10:16,138 But that's usually too expensive to do prospectively. 251 00:10:16,138 --> 00:10:18,925 So usually what we do is do this retrospectively. 252 00:10:18,925 --> 00:10:21,300 Now, if you're working with health insurance claims data, 253 00:10:21,300 --> 00:10:23,710 you usually don't have the luxury of looking at notes. 254 00:10:23,710 --> 00:10:27,430 And so what, in my group, we type typically do is we build, 255 00:10:27,430 --> 00:10:29,770 actually, a visualization tool. 256 00:10:29,770 --> 00:10:31,690 And by the way, I'm a machine learning person. 257 00:10:31,690 --> 00:10:34,060 I don't know anything about visualization. 258 00:10:34,060 --> 00:10:36,460 Neither do I claim to be good at it. 259 00:10:36,460 --> 00:10:39,430 But you can't do the machine learning work unless you really 260 00:10:39,430 --> 00:10:40,600 understand your data. 261 00:10:40,600 --> 00:10:43,570 So we had to build this tool in order to look at the data, 262 00:10:43,570 --> 00:10:46,240 in order to try to do that first step of understanding, 263 00:10:46,240 --> 00:10:48,435 did we even characterize diabetes correctly. 264 00:10:48,435 --> 00:10:49,810 So I'm not going go deep into it. 265 00:10:49,810 --> 00:10:51,250 By the way, you can download this. 266 00:10:51,250 --> 00:10:53,530 It's an open source tool. 267 00:10:53,530 --> 00:10:56,910 But ballpark what I'm showing you here is one patient's data. 268 00:10:56,910 --> 00:10:58,720 I'm showing on this x-axis, time, 269 00:10:58,720 --> 00:11:01,450 going from April to December. 270 00:11:01,450 --> 00:11:05,180 And on the y-axis, I'm showing events as they occurred. 271 00:11:05,180 --> 00:11:07,690 So in orange are diagnosis codes that 272 00:11:07,690 --> 00:11:09,130 were recorded for the patient. 273 00:11:09,130 --> 00:11:10,960 In green are procedure codes. 274 00:11:10,960 --> 00:11:13,030 In blue are laboratory tests. 275 00:11:13,030 --> 00:11:16,060 And if you see, on a given line, multiple dots 276 00:11:16,060 --> 00:11:19,790 along that same line, it means that same lab test 277 00:11:19,790 --> 00:11:21,138 was performed multiple times. 278 00:11:21,138 --> 00:11:23,430 And you could click on it to see what the results were. 279 00:11:23,430 --> 00:11:25,780 And in this way, you could start to tell a coherent story what's 280 00:11:25,780 --> 00:11:26,905 going on with your patient. 281 00:11:26,905 --> 00:11:28,480 All right, so tools like this is what 282 00:11:28,480 --> 00:11:29,897 you're going to need to able to do 283 00:11:29,897 --> 00:11:32,410 that first step from something like health insurance claims 284 00:11:32,410 --> 00:11:34,180 data. 285 00:11:34,180 --> 00:11:37,910 Now, traditionally, that first step, 286 00:11:37,910 --> 00:11:41,440 which then leads you to label some data, and then, 287 00:11:41,440 --> 00:11:44,470 from there, you go and come up with these rules, 288 00:11:44,470 --> 00:11:46,810 or do a machine learning algorithm to get the label, 289 00:11:46,810 --> 00:11:48,850 usually that's a paper in itself. 290 00:11:48,850 --> 00:11:51,392 Of course, not of interest to the computer science community, 291 00:11:51,392 --> 00:11:53,860 but of extreme interest to the health care community. 292 00:11:53,860 --> 00:11:56,830 So usually there's a first paper, academic paper, 293 00:11:56,830 --> 00:12:00,653 which evaluates this process for deriving the label, 294 00:12:00,653 --> 00:12:03,070 and then there are much later papers which talk about what 295 00:12:03,070 --> 00:12:05,487 you could do with that label, such as the machine learning 296 00:12:05,487 --> 00:12:07,790 problem we originally set out to solve. 297 00:12:07,790 --> 00:12:10,540 So let's look at an example of one of those rules. 298 00:12:10,540 --> 00:12:15,760 Here is a rule, to derive from health insurance claims 299 00:12:15,760 --> 00:12:19,250 data whether a patient has type 2 diabetes. 300 00:12:19,250 --> 00:12:25,910 Now, this isn't quite the same one that we used in that paper, 301 00:12:25,910 --> 00:12:27,658 but it gets the idea across. 302 00:12:27,658 --> 00:12:29,950 First you look to see, did the patient have a diagnosis 303 00:12:29,950 --> 00:12:32,245 code for type 1 diabetes. 304 00:12:34,780 --> 00:12:37,990 If the answer is no, you continue. 305 00:12:37,990 --> 00:12:40,080 If the answer is yes, you've sort of ruled out. 306 00:12:40,080 --> 00:12:43,290 Because you say, OK, this patient's abnormal blood test 307 00:12:43,290 --> 00:12:45,520 results are because they have type 1 diabetes, not 308 00:12:45,520 --> 00:12:47,260 type 2 diabetes. 309 00:12:47,260 --> 00:12:50,110 Type 1 diabetes usually is what you 310 00:12:50,110 --> 00:12:51,820 can think of as juvenile diabetes, 311 00:12:51,820 --> 00:12:53,480 is diagnosed much earlier. 312 00:12:53,480 --> 00:12:55,448 And there's a different mechanism behind it. 313 00:12:55,448 --> 00:12:56,740 Then you look at other things-- 314 00:12:56,740 --> 00:12:58,900 OK, is there a diagnosis code for type 315 00:12:58,900 --> 00:13:00,920 2 diabetes somewhere in the patient's data? 316 00:13:00,920 --> 00:13:02,920 If so, you go to the right, and you look to see, 317 00:13:02,920 --> 00:13:05,350 is there a medication, an Rx, for type 318 00:13:05,350 --> 00:13:07,480 1 diabetes in the data. 319 00:13:07,480 --> 00:13:10,620 If the answer is no, you continue down this way. 320 00:13:10,620 --> 00:13:13,030 If the answer is yes, you go this way. 321 00:13:13,030 --> 00:13:15,520 A yes of a type 1 diabetes medication 322 00:13:15,520 --> 00:13:17,440 doesn't alone rule out the patient. 323 00:13:17,440 --> 00:13:19,030 Because maybe the same medications 324 00:13:19,030 --> 00:13:20,770 are used for type 1 as for type 2. 325 00:13:20,770 --> 00:13:22,860 So there's some other things you need to do there. 326 00:13:22,860 --> 00:13:25,360 Right, you can see that this starts to really quickly become 327 00:13:25,360 --> 00:13:26,650 complicated. 328 00:13:26,650 --> 00:13:29,830 And these manual-based approaches 329 00:13:29,830 --> 00:13:33,362 end up having pretty bad positive-- 330 00:13:33,362 --> 00:13:35,320 so they're designed usually to have pretty high 331 00:13:35,320 --> 00:13:36,550 positive predictive value. 332 00:13:36,550 --> 00:13:38,490 But they end up having pretty bad recall, 333 00:13:38,490 --> 00:13:40,740 in that they don't end up finding all of the patients. 334 00:13:40,740 --> 00:13:42,740 And that's really why the machine-learning-based 335 00:13:42,740 --> 00:13:44,380 approaches end up being very important 336 00:13:44,380 --> 00:13:46,570 for this type of problem. 337 00:13:46,570 --> 00:13:50,030 Now, this is just one example of what I call a phenotype. 338 00:13:50,030 --> 00:13:51,762 I call this a phenotype. 339 00:13:51,762 --> 00:13:53,590 That's just what the literature calls it. 340 00:13:53,590 --> 00:13:55,900 It's a phenotype for type 2 diabetes. 341 00:13:55,900 --> 00:13:57,815 And the word, phenotype, in this context 342 00:13:57,815 --> 00:13:59,440 is exactly the same thing as the label. 343 00:13:59,440 --> 00:14:00,005 Yep. 344 00:14:00,005 --> 00:14:02,280 AUDIENCE: What is abnormal mean? 345 00:14:02,280 --> 00:14:08,340 PROFESSOR: For example, if the HA1C result is 6.5 or higher, 346 00:14:08,340 --> 00:14:10,230 you might say the patient has diabetes. 347 00:14:10,230 --> 00:14:13,840 AUDIENCE: OK, so this is a lab result, not a medical-- 348 00:14:13,840 --> 00:14:15,263 PROFESSOR: Correct, yeah, thanks. 349 00:14:15,263 --> 00:14:15,930 Other questions. 350 00:14:15,930 --> 00:14:17,370 AUDIENCE: What's the phenotype, which part exactly 351 00:14:17,370 --> 00:14:18,750 is the phenotype, like, the whole thing? 352 00:14:18,750 --> 00:14:19,440 PROFESSOR: The whole thing, yeah. 353 00:14:19,440 --> 00:14:21,300 So the construction, where you say-- 354 00:14:21,300 --> 00:14:23,610 you follow this decision tree, and you 355 00:14:23,610 --> 00:14:26,460 get to a conclusion, which is case, which means, 356 00:14:26,460 --> 00:14:28,080 yes they're type 2 diabetic. 357 00:14:28,080 --> 00:14:30,930 And if ever you don't reach this point, then the answer is no, 358 00:14:30,930 --> 00:14:33,060 they're not type 2 diabetic. 359 00:14:33,060 --> 00:14:35,070 That's what I mean by-- so that labeling 360 00:14:35,070 --> 00:14:37,540 is what we're calling the phenotype of type 2 diabetes. 361 00:14:37,540 --> 00:14:39,645 Now later in the semester, people 362 00:14:39,645 --> 00:14:43,740 will use the word, phenotype, to mean something else. 363 00:14:43,740 --> 00:14:44,770 It's an overloaded term. 364 00:14:44,770 --> 00:14:47,370 But this is what it's called in this context as well. 365 00:14:47,370 --> 00:14:50,850 Now here's an example of a website-- 366 00:14:50,850 --> 00:14:53,130 it's from the PheKB project-- 367 00:14:53,130 --> 00:14:56,670 where you will find tens to close 368 00:14:56,670 --> 00:14:59,040 to 100 of these phenotypes that have 369 00:14:59,040 --> 00:15:02,310 been arduously created for a whole range 370 00:15:02,310 --> 00:15:03,500 of different conditions. 371 00:15:03,500 --> 00:15:05,655 OK, so if you go to this website, 372 00:15:05,655 --> 00:15:07,530 and you click on any one of these conditions, 373 00:15:07,530 --> 00:15:10,380 like appendicitis, autism, cataracts, 374 00:15:10,380 --> 00:15:13,800 you'll see a different diagram of this sort I just showed you. 375 00:15:13,800 --> 00:15:14,920 So this is a real thing. 376 00:15:14,920 --> 00:15:17,045 This is something that the medical community really 377 00:15:17,045 --> 00:15:20,220 needs to do in order to try to derive the label that we 378 00:15:20,220 --> 00:15:24,342 can then use in our machine learning task. 379 00:15:24,342 --> 00:15:27,910 AUDIENCE: I'm just curious, is the lab value ground truth? 380 00:15:27,910 --> 00:15:30,890 Like if somebody has diabetes, then they must have 381 00:15:30,890 --> 00:15:32,570 [INAUDIBLE]. 382 00:15:32,570 --> 00:15:35,288 It means they have been diagnosed, and they must have-- 383 00:15:35,288 --> 00:15:36,850 PROFESSOR: Well, so, for example, 384 00:15:36,850 --> 00:15:38,670 you might have an abnormal glucose 385 00:15:38,670 --> 00:15:40,600 value for a variety of reasons. 386 00:15:40,600 --> 00:15:43,530 One reason is because you might have 387 00:15:43,530 --> 00:15:45,270 what's called gestational diabetes, which 388 00:15:45,270 --> 00:15:48,360 is diabetes that's induced due to pregnancy. 389 00:15:48,360 --> 00:15:49,890 But those patients typically-- well, 390 00:15:49,890 --> 00:15:51,307 although it's a predictive factor, 391 00:15:51,307 --> 00:15:54,190 they don't always have long-term type 2 diabetes. 392 00:15:54,190 --> 00:15:57,780 So even the laboratory test alone 393 00:15:57,780 --> 00:15:59,056 doesn't tell the whole story. 394 00:15:59,056 --> 00:16:01,386 AUDIENCE: You could be diagnosed without having 395 00:16:01,386 --> 00:16:03,720 abnormal diabetic? 396 00:16:03,720 --> 00:16:06,510 PROFESSOR: That's much less common here. 397 00:16:06,510 --> 00:16:08,010 The story will change in the future, 398 00:16:08,010 --> 00:16:10,320 because there will be a whole range of new diagnosis 399 00:16:10,320 --> 00:16:15,750 techniques that might use new modalities, like gene 400 00:16:15,750 --> 00:16:18,220 expression, for example. 401 00:16:18,220 --> 00:16:21,020 But typically, today, the answer is yes to that. 402 00:16:21,020 --> 00:16:21,520 Yep. 403 00:16:21,520 --> 00:16:22,770 AUDIENCE: So if these are made by doctors, 404 00:16:22,770 --> 00:16:23,925 does that mean, for every single disease, 405 00:16:23,925 --> 00:16:25,405 there's one definitive phenotype? 406 00:16:25,405 --> 00:16:26,780 PROFESSOR: These are usually made 407 00:16:26,780 --> 00:16:31,730 by health outcomes researchers, which usually 408 00:16:31,730 --> 00:16:33,840 have clinicians on their team. 409 00:16:33,840 --> 00:16:36,260 But the type of people who often work on these often 410 00:16:36,260 --> 00:16:40,650 come from the field of epidemiology, for example. 411 00:16:40,650 --> 00:16:42,150 And so what was your question again? 412 00:16:42,150 --> 00:16:43,692 AUDIENCE: Is there just one phenotype 413 00:16:43,692 --> 00:16:44,810 for every single disease? 414 00:16:44,810 --> 00:16:46,185 PROFESSOR: Is there one phenotype 415 00:16:46,185 --> 00:16:47,690 for every different disease? 416 00:16:47,690 --> 00:16:49,970 In the ideal world, you'd have at least one phenotype 417 00:16:49,970 --> 00:16:52,550 for every single disease that could possibly exist. 418 00:16:52,550 --> 00:16:53,630 Now, of course, you might be interested 419 00:16:53,630 --> 00:16:54,560 in different aspects. 420 00:16:54,560 --> 00:16:56,352 Like you might be interested in not knowing 421 00:16:56,352 --> 00:16:57,920 just does the patient have autism, 422 00:16:57,920 --> 00:17:00,250 but where they are on their autism spectrum. 423 00:17:00,250 --> 00:17:02,360 You might not be interested in knowing just, 424 00:17:02,360 --> 00:17:03,813 do they have it now, but you also 425 00:17:03,813 --> 00:17:05,480 might want to know when did they get it. 426 00:17:05,480 --> 00:17:09,140 So there's a lot of subtleties that could go into this. 427 00:17:09,140 --> 00:17:11,750 But building these up is really slow. 428 00:17:11,750 --> 00:17:13,833 And validating them to make sure that they're 429 00:17:13,833 --> 00:17:15,250 going to work across multiple data 430 00:17:15,250 --> 00:17:18,750 sets is really challenging, and usually is a negative result. 431 00:17:18,750 --> 00:17:20,869 And so it's been a very slow process 432 00:17:20,869 --> 00:17:23,060 to do this manually, which has led me 433 00:17:23,060 --> 00:17:25,190 and many others to start thinking about the machine 434 00:17:25,190 --> 00:17:28,430 learning approaches for how to do it automatically. 435 00:17:28,430 --> 00:17:30,180 AUDIENCE: Just as a follow-up, is there 436 00:17:30,180 --> 00:17:33,120 any case where there's, like, five autism phenotypes, 437 00:17:33,120 --> 00:17:35,135 for example, or multiple competing ones? 438 00:17:35,135 --> 00:17:35,760 PROFESSOR: Yes. 439 00:17:35,760 --> 00:17:39,090 So there are often many different such rule-based 440 00:17:39,090 --> 00:17:41,760 systems that give you conflicting results. 441 00:17:41,760 --> 00:17:44,047 Yes, that happens all the time. 442 00:17:44,047 --> 00:17:45,630 AUDIENCE: Can these rule-based systems 443 00:17:45,630 --> 00:17:48,932 provide an estimate of when their condition was onset? 444 00:17:48,932 --> 00:17:50,390 PROFESSOR: Right, so that's getting 445 00:17:50,390 --> 00:17:52,973 at one of the subtleties I just mentioned-- can these tell you 446 00:17:52,973 --> 00:17:55,420 when the onset happened? 447 00:17:55,420 --> 00:17:57,170 They're not typically designed to do that, 448 00:17:57,170 --> 00:17:59,660 but one can come up with one to do it. 449 00:17:59,660 --> 00:18:01,260 And so one way to try to do that is 450 00:18:01,260 --> 00:18:06,340 you change those rules to have a time period associate to it. 451 00:18:06,340 --> 00:18:08,710 And then you can imagine applying those rules 452 00:18:08,710 --> 00:18:10,450 in a sliding window to the patient data 453 00:18:10,450 --> 00:18:13,295 to see, when is the first time that it triggers. 454 00:18:13,295 --> 00:18:14,920 And that would be one way to try to get 455 00:18:14,920 --> 00:18:16,740 a sense of when onset was. 456 00:18:16,740 --> 00:18:19,290 But there's a lot of subtleties to that, too. 457 00:18:19,290 --> 00:18:21,900 So I'm going to move on now. 458 00:18:21,900 --> 00:18:24,400 I just want to give it some sense of what that deriving 459 00:18:24,400 --> 00:18:27,010 the labels ends up looking like. 460 00:18:27,010 --> 00:18:31,090 Let's now turn to evaluation. 461 00:18:31,090 --> 00:18:33,910 So a very commonly used approach in this field 462 00:18:33,910 --> 00:18:37,910 is to compute what's known as the Receiver-Operator Curve, 463 00:18:37,910 --> 00:18:39,873 or ROC curve. 464 00:18:39,873 --> 00:18:41,540 And what this looks at is the following. 465 00:18:41,540 --> 00:18:43,300 First of all, this is well-defined 466 00:18:43,300 --> 00:18:46,900 for a binary classification problem. 467 00:18:46,900 --> 00:18:48,670 For a binary classification problem 468 00:18:48,670 --> 00:18:51,430 when you're using a model that outputs, 469 00:18:51,430 --> 00:18:54,310 let's say, a probability or some continuous value, 470 00:18:54,310 --> 00:18:56,932 then you could use that continuous valid prediction. 471 00:18:56,932 --> 00:18:58,390 If you wanted to make a prediction, 472 00:18:58,390 --> 00:18:59,450 you usually threshold it, right? 473 00:18:59,450 --> 00:19:01,992 So you say, if it's greater than 0.5, it's a prediction of 1. 474 00:19:01,992 --> 00:19:04,510 If it's less than 0.5, prediction of zero. 475 00:19:04,510 --> 00:19:07,645 But here we might be interested in not just what minimizes, 476 00:19:07,645 --> 00:19:09,970 let's say, 0-1 loss, but you might also 477 00:19:09,970 --> 00:19:12,910 be interested in trading off, let's say, 478 00:19:12,910 --> 00:19:15,130 false positives for false negatives. 479 00:19:15,130 --> 00:19:17,580 And so you might choose different thresholds. 480 00:19:17,580 --> 00:19:19,810 And you might want to quantify how 481 00:19:19,810 --> 00:19:21,730 do those trade-offs look for different choices 482 00:19:21,730 --> 00:19:24,740 of those thresholds of this continuous value prediction. 483 00:19:24,740 --> 00:19:26,620 And that's what the ROC curve will show you. 484 00:19:26,620 --> 00:19:29,260 So as you move along the threshold, you can compute, 485 00:19:29,260 --> 00:19:31,750 for every single threshold, what is the true positive rate, 486 00:19:31,750 --> 00:19:33,260 and what is the false positive rate. 487 00:19:33,260 --> 00:19:35,050 And that gives you a number. 488 00:19:35,050 --> 00:19:36,550 And you try all possible thresholds, 489 00:19:36,550 --> 00:19:38,470 that gives you a curve. 490 00:19:38,470 --> 00:19:42,190 And then you can compare curves from different machine learning 491 00:19:42,190 --> 00:19:43,250 algorithms. 492 00:19:43,250 --> 00:19:44,890 For example, here, I'm showing you, 493 00:19:44,890 --> 00:19:48,610 in the green line, the predictive model obtained 494 00:19:48,610 --> 00:19:51,730 by using what we're calling the traditional risk factors, so 495 00:19:51,730 --> 00:19:54,940 something like eight or 10 different risk 496 00:19:54,940 --> 00:19:57,880 factors for type 2 diabetes that are very commonly used 497 00:19:57,880 --> 00:19:59,050 in the literature. 498 00:19:59,050 --> 00:20:00,850 Versus in blue, it's showing you what 499 00:20:00,850 --> 00:20:04,630 you'd get if you just used a naive L1-regularized 500 00:20:04,630 --> 00:20:07,430 logistic regression model with no domain knowledge, 501 00:20:07,430 --> 00:20:10,660 just sort of throw in the bag of features. 502 00:20:10,660 --> 00:20:12,010 And you want to be up there. 503 00:20:12,010 --> 00:20:15,530 You want to be in that top left corner. 504 00:20:15,530 --> 00:20:16,640 That's the goal here. 505 00:20:16,640 --> 00:20:18,280 So you would like that blue curve 506 00:20:18,280 --> 00:20:22,360 to be up there, and then all the way to the right. 507 00:20:22,360 --> 00:20:27,940 Now, one way to try to quantify in a single number 508 00:20:27,940 --> 00:20:31,600 how useful any one ROC curve is is 509 00:20:31,600 --> 00:20:34,960 by looking at what's called the area under the ROC curve. 510 00:20:34,960 --> 00:20:37,870 And mathematically, this is exactly what you'd expect. 511 00:20:37,870 --> 00:20:42,340 This area is the area under the ROC curve. 512 00:20:42,340 --> 00:20:44,170 So you could just integrate the curve, 513 00:20:44,170 --> 00:20:46,383 and you get that number out. 514 00:20:46,383 --> 00:20:47,800 Now, remember, I told you you want 515 00:20:47,800 --> 00:20:50,680 to be in the upper left quadrant. 516 00:20:50,680 --> 00:20:52,540 And so the goal was to get an area 517 00:20:52,540 --> 00:20:56,110 under the ROC curve of a 1. 518 00:20:56,110 --> 00:21:00,600 Now, what would a random prediction give you? 519 00:21:00,600 --> 00:21:01,430 Any idea? 520 00:21:01,430 --> 00:21:06,310 So if you're to just flip a coin and guess-- 521 00:21:06,310 --> 00:21:07,060 what do you think? 522 00:21:07,060 --> 00:21:07,660 AUDIENCE: 0.5. 523 00:21:07,660 --> 00:21:09,940 PROFESSOR: 0.5? 524 00:21:09,940 --> 00:21:12,340 AUDIENCE: [INAUDIBLE] 525 00:21:12,340 --> 00:21:15,043 PROFESSOR: Well, so I was a little bit misleading when 526 00:21:15,043 --> 00:21:16,210 I said you just flip a coin. 527 00:21:16,210 --> 00:21:21,903 You got to flip a coin with sort of different noise rates. 528 00:21:21,903 --> 00:21:23,320 And each one of those will get you 529 00:21:23,320 --> 00:21:25,970 sort of a different place along this curve. 530 00:21:25,970 --> 00:21:28,120 And if you look at the curve that you 531 00:21:28,120 --> 00:21:30,490 get from these random guesses, it's 532 00:21:30,490 --> 00:21:32,740 going to be the straight line from 0 to 1. 533 00:21:32,740 --> 00:21:37,040 And as you said, that will then have an AUC of 0.5. 534 00:21:37,040 --> 00:21:38,800 So 0.5 is going to be random guessing. 535 00:21:38,800 --> 00:21:39,538 1 is perfect. 536 00:21:39,538 --> 00:21:41,830 And your algorithm is going to be somewhere in between. 537 00:21:44,860 --> 00:21:48,550 Now, of relevance to the rest of today's lecture 538 00:21:48,550 --> 00:21:50,890 is going to be an alternative definition-- 539 00:21:50,890 --> 00:21:55,000 alternative way of computing the area under the ROC curve. 540 00:21:55,000 --> 00:21:57,340 So one way to compute it is literally as I said. 541 00:21:57,340 --> 00:21:59,860 You create that curve, and you integrate 542 00:21:59,860 --> 00:22:01,960 to get the area under it. 543 00:22:01,960 --> 00:22:03,700 But one can show mathematically-- 544 00:22:03,700 --> 00:22:04,900 I'm not going to give you the derivation here, 545 00:22:04,900 --> 00:22:06,550 but you can look it up on Wikipedia. 546 00:22:06,550 --> 00:22:09,550 One can show mathematically that an equivalent 547 00:22:09,550 --> 00:22:12,670 way of computing the area under the ROC curve 548 00:22:12,670 --> 00:22:15,610 is to compute the probability that an algorithm will 549 00:22:15,610 --> 00:22:18,880 rank a positive-labeled patient over a negative-labeled 550 00:22:18,880 --> 00:22:20,510 patient. 551 00:22:20,510 --> 00:22:22,300 So mathematically what I'm talking about 552 00:22:22,300 --> 00:22:23,760 is the following thing. 553 00:22:23,760 --> 00:22:29,530 You're going to sum over pairs of patients where-- 554 00:22:29,530 --> 00:22:38,470 I'm going to call x1 is a patient with label y1 equals 1. 555 00:22:38,470 --> 00:22:44,050 And x2 is a patient with label y-- 556 00:22:44,050 --> 00:22:45,520 actually, I'll call it-- 557 00:22:45,520 --> 00:22:48,790 yeah, with label x2 equals 1. 558 00:22:48,790 --> 00:22:50,305 So these are two different patients. 559 00:22:53,360 --> 00:22:57,320 I think I'm going to rewrite it like this-- xi and xj, 560 00:22:57,320 --> 00:22:59,960 just for generality's sake. 561 00:22:59,960 --> 00:23:04,520 So we're going to sum this up over all choices of i 562 00:23:04,520 --> 00:23:10,580 and j such that yi and yj have different labels. 563 00:23:10,580 --> 00:23:12,973 So that should say yj equals 0. 564 00:23:12,973 --> 00:23:14,390 And then you're going to look at-- 565 00:23:14,390 --> 00:23:17,000 what you want to happen, like suppose that you're 566 00:23:17,000 --> 00:23:18,290 using a linear model here. 567 00:23:18,290 --> 00:23:22,742 So your prediction is given to you by, let's say, w.xj. 568 00:23:30,020 --> 00:23:33,637 What you want is that this should be smaller than w.xi. 569 00:23:37,850 --> 00:23:42,140 So the j data point, remember, was the one 570 00:23:42,140 --> 00:23:44,990 that got the 0-th and the i-th data point 571 00:23:44,990 --> 00:23:47,430 is the one that got the 1 label. 572 00:23:47,430 --> 00:23:52,460 So we want the score of the data point that should've been 1 573 00:23:52,460 --> 00:23:55,640 to be higher than the score of the data point which 574 00:23:55,640 --> 00:23:57,290 should've gotten the label 0. 575 00:23:57,290 --> 00:23:59,540 And you just count up-- this is an indicator function. 576 00:23:59,540 --> 00:24:02,330 You just count up how many of those were correctly ordered. 577 00:24:02,330 --> 00:24:05,900 And then you're just going to normalize by the total number 578 00:24:05,900 --> 00:24:07,838 of comparisons that you do. 579 00:24:07,838 --> 00:24:10,130 And it turns out that that is exactly equal to the area 580 00:24:10,130 --> 00:24:11,383 under the ROC curve. 581 00:24:11,383 --> 00:24:13,550 And it really makes clear that this is a notion that 582 00:24:13,550 --> 00:24:15,320 really cares about ranking. 583 00:24:15,320 --> 00:24:20,630 Are you getting the ranking of patients correct? 584 00:24:20,630 --> 00:24:22,880 Are you ranking the ones who should 585 00:24:22,880 --> 00:24:25,850 have been given 1 higher than the ones that 586 00:24:25,850 --> 00:24:28,490 should have gotten the label 0. 587 00:24:28,490 --> 00:24:31,520 And importantly, this whole measure 588 00:24:31,520 --> 00:24:35,240 is actually invariant to the label imbalance. 589 00:24:35,240 --> 00:24:38,300 So you might have a very imbalanced data set. 590 00:24:38,300 --> 00:24:41,480 But if you were to re-sample with now making 591 00:24:41,480 --> 00:24:44,720 it a balanced data set, your AUC of your predictive model 592 00:24:44,720 --> 00:24:46,350 wouldn't change. 593 00:24:46,350 --> 00:24:48,500 And that's a nice property to have 594 00:24:48,500 --> 00:24:52,760 when it comes to evaluating settings where you might have 595 00:24:52,760 --> 00:24:56,090 artificially created a balanced data set 596 00:24:56,090 --> 00:24:57,590 for computational concerns. 597 00:24:57,590 --> 00:25:00,007 Even though the true setting is imbalanced, there at least 598 00:25:00,007 --> 00:25:01,715 you know that the numbers are going to be 599 00:25:01,715 --> 00:25:02,912 the same in both settings. 600 00:25:02,912 --> 00:25:05,120 On the other hand, it also has lots of disadvantages. 601 00:25:05,120 --> 00:25:07,310 Because often you don't care about the performance 602 00:25:07,310 --> 00:25:09,350 of the whole entire curve. 603 00:25:09,350 --> 00:25:12,780 Often you care about particular parts along the curve. 604 00:25:12,780 --> 00:25:16,130 So for example, in last week's lecture, 605 00:25:16,130 --> 00:25:19,358 I argued that really what we often care about 606 00:25:19,358 --> 00:25:20,900 is just the positive predictive value 607 00:25:20,900 --> 00:25:22,780 for a particular threshold. 608 00:25:22,780 --> 00:25:25,430 And we want that to be as high as possible for as few people 609 00:25:25,430 --> 00:25:26,020 as possible. 610 00:25:26,020 --> 00:25:28,315 Like, find the 100 most risky people, 611 00:25:28,315 --> 00:25:29,690 and look at what fraction of them 612 00:25:29,690 --> 00:25:31,782 actually developed type 2 diabetes. 613 00:25:31,782 --> 00:25:33,740 And that setting, what you're really looking at 614 00:25:33,740 --> 00:25:35,580 is this part of the curve. 615 00:25:35,580 --> 00:25:38,960 And so it turns out there are generalizations 616 00:25:38,960 --> 00:25:42,510 of area under the curve that focus on parts of the curve. 617 00:25:42,510 --> 00:25:44,930 And that goes by the name of partial AUC. 618 00:25:44,930 --> 00:25:47,330 For example, if you just integrated from 0 to, 619 00:25:47,330 --> 00:25:50,678 let's say, 0.1 of the curve, then 620 00:25:50,678 --> 00:25:53,220 you could still get a number to compare two different curves, 621 00:25:53,220 --> 00:25:55,220 but it's focusing on the area of that curve that's actually 622 00:25:55,220 --> 00:25:57,070 relevant for your predictive purposes, 623 00:25:57,070 --> 00:26:00,510 for your task at hand. 624 00:26:00,510 --> 00:26:04,340 So that's all I want to say about receiver-operator 625 00:26:04,340 --> 00:26:05,330 characteristic curves. 626 00:26:05,330 --> 00:26:07,900 Any questions? 627 00:26:07,900 --> 00:26:08,400 Yep. 628 00:26:08,400 --> 00:26:11,950 AUDIENCE: Could you talk more about what the drawbacks were 629 00:26:11,950 --> 00:26:13,277 of using this. 630 00:26:13,277 --> 00:26:15,610 Does the class imbalance-- is the class imbalance, then, 631 00:26:15,610 --> 00:26:17,290 always a positive effect? 632 00:26:17,290 --> 00:26:23,542 PROFESSOR: So the thing is, when you want to use this approach, 633 00:26:23,542 --> 00:26:25,500 depending on how you're using the [INAUDIBLE],, 634 00:26:25,500 --> 00:26:27,030 you might not be able to tolerate 635 00:26:27,030 --> 00:26:30,152 a 0.8 false positive rate. 636 00:26:30,152 --> 00:26:32,610 So in some sense, what's going on in this part of the curve 637 00:26:32,610 --> 00:26:36,960 might be completely irrelevant for your task. 638 00:26:36,960 --> 00:26:40,202 And so one of the algorithms-- so one of these curves-- 639 00:26:40,202 --> 00:26:41,910 might look like it's doing really, really 640 00:26:41,910 --> 00:26:43,860 well over here, and pretty poorly over here. 641 00:26:43,860 --> 00:26:48,210 But if you're looking at the full area under the ROC curve, 642 00:26:48,210 --> 00:26:49,480 you won't notice that. 643 00:26:49,480 --> 00:26:51,340 And so that's one of the big problems. 644 00:26:51,340 --> 00:26:51,890 Yeah. 645 00:26:51,890 --> 00:26:55,725 AUDIENCE: And when would you use this versus precision 646 00:26:55,725 --> 00:26:56,280 recall or-- 647 00:26:56,280 --> 00:26:58,030 PROFESSOR: Yeah, so a lot of the community 648 00:26:58,030 --> 00:27:00,510 is interested in precision recall curves. 649 00:27:00,510 --> 00:27:02,463 And precision recall curves, as opposed 650 00:27:02,463 --> 00:27:04,380 to receiver-operator curves, have the property 651 00:27:04,380 --> 00:27:08,558 that they are not invariant to class imbalance, which 652 00:27:08,558 --> 00:27:10,350 in many settings is of interest, because it 653 00:27:10,350 --> 00:27:12,548 will allow you to capture these types of quantities. 654 00:27:12,548 --> 00:27:14,590 I'm not going to go into depth about your reasons 655 00:27:14,590 --> 00:27:15,465 for one or the other. 656 00:27:15,465 --> 00:27:17,577 But that's something that you could read up about, 657 00:27:17,577 --> 00:27:19,410 and I encourage you to post to Piazza about, 658 00:27:19,410 --> 00:27:20,785 and we have discussion on Piazza. 659 00:27:24,630 --> 00:27:30,080 So the last evaluation quantity that I want to talk about 660 00:27:30,080 --> 00:27:32,450 is known as calibration. 661 00:27:32,450 --> 00:27:34,238 And calibration, as I've defined it here, 662 00:27:34,238 --> 00:27:36,155 has to do with binary classification problems. 663 00:27:38,690 --> 00:27:41,258 Now, before you dig into this figure, which 664 00:27:41,258 --> 00:27:42,800 I'll explain in a moment, let me just 665 00:27:42,800 --> 00:27:47,140 give you the gist of what I mean by calibration. 666 00:27:47,140 --> 00:27:50,410 Suppose that your model outputs a probability. 667 00:27:50,410 --> 00:27:51,780 So you do logistic regression. 668 00:27:51,780 --> 00:27:53,830 You get a probability out. 669 00:27:53,830 --> 00:27:59,410 And your model says, for these 10 patients, 670 00:27:59,410 --> 00:28:08,890 that their likelihood of dying in the next 48 hours is 0.7. 671 00:28:08,890 --> 00:28:11,770 Suppose that's what your model output. 672 00:28:11,770 --> 00:28:14,360 If you were on the receiving end of that result, 673 00:28:14,360 --> 00:28:17,320 so you heard that, 0.7, what should you 674 00:28:17,320 --> 00:28:20,938 expect about those 10 people? 675 00:28:20,938 --> 00:28:22,480 What fraction of them should actually 676 00:28:22,480 --> 00:28:25,900 die in the next 48 hours? 677 00:28:25,900 --> 00:28:27,300 Everyone could scream out loud. 678 00:28:27,300 --> 00:28:29,500 [INTERPOSING VOICES] 679 00:28:29,500 --> 00:28:31,720 PROFESSOR: So seven of them. 680 00:28:31,720 --> 00:28:35,860 Seven of the 10 you would expect to die in the next 48 hours 681 00:28:35,860 --> 00:28:39,435 if the probability for all of them that was output was 0.7. 682 00:28:39,435 --> 00:28:41,343 All right, that's what I mean by calibration. 683 00:28:41,343 --> 00:28:42,760 So if, on the other hand, what you 684 00:28:42,760 --> 00:28:45,660 found was that only one of them died, 685 00:28:45,660 --> 00:28:49,730 then it would be a very weird number that you're outputting. 686 00:28:49,730 --> 00:28:52,307 And so the reason why this notion of calibration, 687 00:28:52,307 --> 00:28:54,640 which I'll define formally in a second, is so important, 688 00:28:54,640 --> 00:28:56,940 is when you're out putting a probability, 689 00:28:56,940 --> 00:28:59,190 and when you don't really know how that probability is 690 00:28:59,190 --> 00:29:00,640 going to be used. 691 00:29:00,640 --> 00:29:04,180 If you knew-- if you had some task loss in mind. 692 00:29:04,180 --> 00:29:06,130 And you knew that all that mattered 693 00:29:06,130 --> 00:29:10,990 was the actual prediction, 1 or 0, then that would be fine. 694 00:29:10,990 --> 00:29:13,802 But often predictions in machine learning 695 00:29:13,802 --> 00:29:15,260 are used in a much more subtle way. 696 00:29:15,260 --> 00:29:18,010 Like for example, often your doctor 697 00:29:18,010 --> 00:29:20,980 might have more information than your computer has. 698 00:29:20,980 --> 00:29:24,160 And so often they might want to take the result 699 00:29:24,160 --> 00:29:27,040 that your computer predicts, and weigh 700 00:29:27,040 --> 00:29:28,930 that against other evidence. 701 00:29:28,930 --> 00:29:30,728 Or in some settings, it's not just 702 00:29:30,728 --> 00:29:32,020 weighting about other evidence. 703 00:29:32,020 --> 00:29:34,300 Maybe it's also about making a decision. 704 00:29:34,300 --> 00:29:36,420 And that decision might take exertion-- 705 00:29:36,420 --> 00:29:40,090 a utility, for example, a patient preference 706 00:29:40,090 --> 00:29:47,890 for suffering versus getting a treatment that could 707 00:29:47,890 --> 00:29:50,985 have big, adverse consequences. 708 00:29:50,985 --> 00:29:52,360 And that's something that Pete is 709 00:29:52,360 --> 00:29:56,810 going to talk about much more later in the semester, I think, 710 00:29:56,810 --> 00:29:58,060 how to formalize that notion. 711 00:29:58,060 --> 00:30:01,360 But at this point, I just want to sort of get out the point 712 00:30:01,360 --> 00:30:03,610 that the probabilities themselves could be important. 713 00:30:03,610 --> 00:30:05,535 And having the probabilities be meaningful 714 00:30:05,535 --> 00:30:07,160 is something that one can now quantify. 715 00:30:07,160 --> 00:30:08,940 So how do we quantify it? 716 00:30:08,940 --> 00:30:14,470 Well, one way to try to quantify it 717 00:30:14,470 --> 00:30:18,105 is to create the following prompt. 718 00:30:18,105 --> 00:30:22,970 Actually, we'll call it a histogram. 719 00:30:22,970 --> 00:30:26,960 So on the x-axis is the predicted probability. 720 00:30:34,650 --> 00:30:37,500 So that's what I meant by p-hat. 721 00:30:37,500 --> 00:30:40,920 On the y-axis is the true probability. 722 00:30:40,920 --> 00:30:43,740 It's what I mean when I say the fraction of individuals 723 00:30:43,740 --> 00:30:46,140 with that predicted probability that actually 724 00:30:46,140 --> 00:30:48,120 got the positive outcome. 725 00:30:48,120 --> 00:30:49,510 That's going to be the y-axis. 726 00:30:49,510 --> 00:30:52,911 So I'll call that the true probability. 727 00:30:57,630 --> 00:31:01,110 And what we would like to see is that this 728 00:31:01,110 --> 00:31:06,210 is a line, a straight line, meaning these two should always 729 00:31:06,210 --> 00:31:07,480 be equal. 730 00:31:07,480 --> 00:31:09,840 And in the example I gave, remember 731 00:31:09,840 --> 00:31:12,030 I said that there were a bunch of people 732 00:31:12,030 --> 00:31:17,190 with 0.7 probability predicted, but for whom 733 00:31:17,190 --> 00:31:21,628 only one out of them actually got the positive end. 734 00:31:21,628 --> 00:31:23,670 So that would have been something like over here. 735 00:31:26,550 --> 00:31:29,460 Whereas you would have expected it to be over there. 736 00:31:29,460 --> 00:31:33,210 So you might ask, how do I create such a plot 737 00:31:33,210 --> 00:31:34,800 from finite data? 738 00:31:34,800 --> 00:31:37,530 Well, a common way to do so is to bin your data. 739 00:31:37,530 --> 00:31:40,440 So you'll create intervals. 740 00:31:40,440 --> 00:31:46,290 So this bin is the bin from 0 to 0.1. 741 00:31:46,290 --> 00:31:52,620 This bin is the bin from 0.1 to 0.2, and so on. 742 00:31:52,620 --> 00:31:56,160 And then you'll look to see, OK, how many people for whom 743 00:31:56,160 --> 00:32:00,510 the predicted probability was between 0 and 0.1 744 00:32:00,510 --> 00:32:01,770 actually died? 745 00:32:01,770 --> 00:32:03,698 And you'll get a number out. 746 00:32:03,698 --> 00:32:05,490 And now here's where I can go to this plot. 747 00:32:05,490 --> 00:32:07,198 That's exactly what I'm showing you here. 748 00:32:07,198 --> 00:32:09,330 So for now, ignore the bar charts and the bottom, 749 00:32:09,330 --> 00:32:11,930 and just look at the line. 750 00:32:11,930 --> 00:32:14,073 So let's focus on just the green line. 751 00:32:14,073 --> 00:32:15,990 Here I'm showing you several different models. 752 00:32:15,990 --> 00:32:18,580 For now, just focus on the green line. 753 00:32:18,580 --> 00:32:22,170 So the green line, by the way, notice it looks pretty good. 754 00:32:22,170 --> 00:32:24,560 It's almost a straight line. 755 00:32:24,560 --> 00:32:25,560 So how did I compute it? 756 00:32:25,560 --> 00:32:27,510 Well, first of all, notice the number of ticks 757 00:32:27,510 --> 00:32:30,510 are 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. 758 00:32:30,510 --> 00:32:33,810 OK, so there are 10 points along this line. 759 00:32:33,810 --> 00:32:36,330 And each of those corresponds to one of these bins. 760 00:32:36,330 --> 00:32:39,810 So the first point is the 0 to 0.1 bin. 761 00:32:39,810 --> 00:32:42,930 The second point is the 0.1 to 0.2 bin, and so on. 762 00:32:42,930 --> 00:32:45,372 So that's how it computed this. 763 00:32:45,372 --> 00:32:46,830 The next thing you notice is that I 764 00:32:46,830 --> 00:32:49,710 have confidence intervals. 765 00:32:49,710 --> 00:32:52,500 And the reason I compute these confidence intervals 766 00:32:52,500 --> 00:32:55,020 is because sometimes you just might not 767 00:32:55,020 --> 00:32:57,060 have that much data in one of these bins. 768 00:32:57,060 --> 00:32:59,880 So for example, suppose your algorithm almost never 769 00:32:59,880 --> 00:33:04,560 said that someone has a predictive probability of 0.99. 770 00:33:04,560 --> 00:33:06,452 Then until you get a ton of data, 771 00:33:06,452 --> 00:33:08,910 you're not going to know what fraction of those individuals 772 00:33:08,910 --> 00:33:12,120 actually went on to develop the event. 773 00:33:12,120 --> 00:33:14,610 And you should be looking at sort of the confidence 774 00:33:14,610 --> 00:33:17,940 interval of this line, which should take that 775 00:33:17,940 --> 00:33:19,310 into consideration. 776 00:33:19,310 --> 00:33:21,720 And a different way to try to understand that notion, now 777 00:33:21,720 --> 00:33:22,830 looking at the numbers, is what I'm 778 00:33:22,830 --> 00:33:24,663 showing you in the bar charts in the bottom. 779 00:33:24,663 --> 00:33:27,790 On the bar charts, I'm showing you 780 00:33:27,790 --> 00:33:30,510 the number of individuals or the fraction of individuals 781 00:33:30,510 --> 00:33:33,550 who actually got that predicted probability. 782 00:33:33,550 --> 00:33:36,210 So now let's start comparing the lines. 783 00:33:36,210 --> 00:33:42,600 So the blue line shown here is a machine 784 00:33:42,600 --> 00:33:46,262 learning algorithm which is predicting infection 785 00:33:46,262 --> 00:33:47,220 in the emergency rooms. 786 00:33:47,220 --> 00:33:49,512 It's a slightly different problem than the diabetes one 787 00:33:49,512 --> 00:33:50,520 we looked at earlier. 788 00:33:50,520 --> 00:33:54,960 And it's using a bag of words model from clinical text. 789 00:33:54,960 --> 00:34:01,020 The red line is using just chief complaint. 790 00:34:01,020 --> 00:34:03,567 So it's using one piece of structured data 791 00:34:03,567 --> 00:34:05,400 that you get at one point of time in the ER. 792 00:34:05,400 --> 00:34:10,960 So it's using very little information. 793 00:34:10,960 --> 00:34:17,199 And you can see that both models are somewhat well calibrated. 794 00:34:17,199 --> 00:34:19,800 But the intervals-- the confidence 795 00:34:19,800 --> 00:34:22,679 intervals of both the red and the purple lines 796 00:34:22,679 --> 00:34:25,389 gets really big towards the end. 797 00:34:25,389 --> 00:34:26,969 And if you look at these bar charts, 798 00:34:26,969 --> 00:34:29,760 it explains why, because the models 799 00:34:29,760 --> 00:34:35,190 that use less information end up being much more risk-averse. 800 00:34:35,190 --> 00:34:38,010 So they will never predict a very high probability. 801 00:34:38,010 --> 00:34:40,502 They will always sort of stay in this lower regime. 802 00:34:40,502 --> 00:34:42,960 And that's why we have very big confidence intervals there. 803 00:34:46,340 --> 00:34:50,159 OK, so that's all I want to say about evaluation. 804 00:34:50,159 --> 00:34:52,020 And I won't take any questions on this right 805 00:34:52,020 --> 00:34:53,395 now, because I really want to get 806 00:34:53,395 --> 00:34:55,560 on to the rest of the lecture. 807 00:34:55,560 --> 00:34:57,852 But again, if you have any questions, post to Piazza, 808 00:34:57,852 --> 00:34:59,810 and I'm happy to discuss them with you offline. 809 00:35:03,210 --> 00:35:06,990 So, in summary, we've talked about how 810 00:35:06,990 --> 00:35:11,610 to reduce risk stratification to binary classification. 811 00:35:11,610 --> 00:35:13,470 I've told you how to derive the labels. 812 00:35:13,470 --> 00:35:15,880 I've given you one example of machine learning algorithm 813 00:35:15,880 --> 00:35:19,440 you can use, and I talked to you about how to evaluate it. 814 00:35:19,440 --> 00:35:20,890 What could possibly go wrong? 815 00:35:23,570 --> 00:35:26,335 So let's look at some examples. 816 00:35:26,335 --> 00:35:28,960 And these are a small number of examples of what could possibly 817 00:35:28,960 --> 00:35:29,780 go wrong. 818 00:35:29,780 --> 00:35:31,680 There are many more. 819 00:35:31,680 --> 00:35:33,340 So here's some data. 820 00:35:33,340 --> 00:35:35,950 I'm showing you-- for the same problem 821 00:35:35,950 --> 00:35:38,260 we looked at before, diabetes onset, I'm 822 00:35:38,260 --> 00:35:44,050 showing you the prevalence of type 2 diabetes as recorded by, 823 00:35:44,050 --> 00:35:47,926 let's say, diagnosis codes across time. 824 00:35:47,926 --> 00:35:49,450 All right, so over here is 1980. 825 00:35:49,450 --> 00:35:53,290 Over here is 2012. 826 00:35:53,290 --> 00:35:54,340 Look at that. 827 00:35:54,340 --> 00:35:56,088 It is not a flat line. 828 00:35:56,088 --> 00:35:57,130 Now, what does that mean? 829 00:35:57,130 --> 00:36:01,720 Does that mean that the population is eating much more 830 00:36:01,720 --> 00:36:06,810 unhealthy from 1980 to 2012, and so more people 831 00:36:06,810 --> 00:36:08,890 are becoming diabetic? 832 00:36:08,890 --> 00:36:11,230 That would be one plausible answer. 833 00:36:11,230 --> 00:36:17,660 Another plausible explanation is that something has changed. 834 00:36:17,660 --> 00:36:21,670 So in fact I'm showing you with these blue lines, well, 835 00:36:21,670 --> 00:36:25,240 in fact, there was a change in the diagnostic criteria 836 00:36:25,240 --> 00:36:27,790 for diabetes. 837 00:36:27,790 --> 00:36:29,740 And so now the patient population actually 838 00:36:29,740 --> 00:36:31,390 didn't change much between, let's say, 839 00:36:31,390 --> 00:36:33,130 this time point at that time point. 840 00:36:33,130 --> 00:36:37,390 But what really led it to this big uptick, 841 00:36:37,390 --> 00:36:40,300 according to one theory, is because the diagnostic criteria 842 00:36:40,300 --> 00:36:41,460 changed. 843 00:36:41,460 --> 00:36:43,240 So who we're calling diabetic has changed. 844 00:36:43,240 --> 00:36:46,460 Because diseases are, at the end of the day, 845 00:36:46,460 --> 00:36:51,760 a human-made concept, you know, what do we call some disease. 846 00:36:51,760 --> 00:36:55,747 And so the data is changing, as you see here. 847 00:36:55,747 --> 00:36:57,080 Let me show you another example. 848 00:36:57,080 --> 00:37:00,070 Oh, by the way, so the consequence of that is that 849 00:37:00,070 --> 00:37:01,720 automatically-derived labels-- 850 00:37:01,720 --> 00:37:04,125 for example, if you use one of those phenotyping 851 00:37:04,125 --> 00:37:05,960 algorithms I showed you earlier, the rules-- 852 00:37:08,770 --> 00:37:11,680 what the label is derived for over here 853 00:37:11,680 --> 00:37:13,960 might be very different from the label that's 854 00:37:13,960 --> 00:37:15,460 derived from over here, particularly 855 00:37:15,460 --> 00:37:18,880 if it's using data such as diagnosis codes that 856 00:37:18,880 --> 00:37:20,947 have changed in meaning over the years. 857 00:37:20,947 --> 00:37:22,030 So that's one consequence. 858 00:37:22,030 --> 00:37:24,762 There'll be other consequences I'll tell you about later. 859 00:37:24,762 --> 00:37:25,720 Here's another example. 860 00:37:25,720 --> 00:37:28,012 And by the way, this notion is called non-stationarity, 861 00:37:28,012 --> 00:37:30,080 that the data is changing across time. 862 00:37:30,080 --> 00:37:32,170 It's not stationary. 863 00:37:32,170 --> 00:37:34,650 Here's another example. 864 00:37:34,650 --> 00:37:38,490 On the x-axis again I'm showing you time. 865 00:37:38,490 --> 00:37:44,800 Here each column is a month, from 2005 to 2014. 866 00:37:44,800 --> 00:37:49,930 And on the y-axis, for every sort of row of this table, 867 00:37:49,930 --> 00:37:51,625 I'm showing you a laboratory test. 868 00:37:54,023 --> 00:37:56,440 And here we're not looking at the results of the lab test, 869 00:37:56,440 --> 00:37:59,080 we're only looking at what fraction 870 00:37:59,080 --> 00:38:02,110 of-- at how many lab tests of that type 871 00:38:02,110 --> 00:38:06,426 were performed at this point in time. 872 00:38:06,426 --> 00:38:10,510 And now you might expect that, broadly speaking, 873 00:38:10,510 --> 00:38:13,150 the number of glucose tests, the number of white blood cell 874 00:38:13,150 --> 00:38:21,040 count tests, the number of neutrophil tests and so on 875 00:38:21,040 --> 00:38:23,860 might be pretty constant across time, on average, 876 00:38:23,860 --> 00:38:26,200 because you're averaging over lots of people. 877 00:38:26,200 --> 00:38:29,090 But indeed what you see here is that, in fact, 878 00:38:29,090 --> 00:38:31,210 there is a huge amount of non-stationarity. 879 00:38:31,210 --> 00:38:34,360 Which tests are ordered dramatically 880 00:38:34,360 --> 00:38:36,230 changes across time. 881 00:38:36,230 --> 00:38:39,310 So for example you see this one line over here, 882 00:38:39,310 --> 00:38:43,240 where it's all blue, meaning no one is ordering the test, 883 00:38:43,240 --> 00:38:46,360 until this point in time, when people start using it. 884 00:38:46,360 --> 00:38:47,550 What could that be? 885 00:38:47,550 --> 00:38:49,970 Any ideas? 886 00:38:49,970 --> 00:38:50,818 Yeah. 887 00:38:50,818 --> 00:38:54,067 AUDIENCE: [INAUDIBLE] 888 00:38:54,067 --> 00:38:56,650 PROFESSOR: So the test was used less, or really, in this case, 889 00:38:56,650 --> 00:38:57,320 not used at all. 890 00:38:57,320 --> 00:38:58,330 And then suddenly it was used. 891 00:38:58,330 --> 00:38:59,320 Why might that happen? 892 00:38:59,320 --> 00:38:59,940 In the back. 893 00:38:59,940 --> 00:39:01,690 AUDIENCE: A new test. 894 00:39:01,690 --> 00:39:05,090 PROFESSOR: A new test, right, because technology changes. 895 00:39:05,090 --> 00:39:07,660 Suddenly we come up with a new diagnostic test, a new lab 896 00:39:07,660 --> 00:39:08,770 test. 897 00:39:08,770 --> 00:39:11,177 And we can start using it, where it didn't exist before. 898 00:39:11,177 --> 00:39:13,010 So obviously there was no data on it before. 899 00:39:13,010 --> 00:39:17,014 What's another reason why it might have suddenly showed up? 900 00:39:17,014 --> 00:39:17,926 Yep. 901 00:39:17,926 --> 00:39:21,406 AUDIENCE: It could be like annual check-ups become 902 00:39:21,406 --> 00:39:26,510 mandatory, or that it's part of the test admission at hospital. 903 00:39:26,510 --> 00:39:28,800 Like, it's an additional test. 904 00:39:28,800 --> 00:39:31,020 PROFESSOR: I'll stick with your first example. 905 00:39:31,020 --> 00:39:33,420 Maybe that test becomes mandatory. 906 00:39:33,420 --> 00:39:35,880 OK, so maybe there's a clinical guideline 907 00:39:35,880 --> 00:39:41,490 that is created at this point in time, right there. 908 00:39:41,490 --> 00:39:44,490 And health insurers decide we're going 909 00:39:44,490 --> 00:39:47,647 to reimburse for this test at this point in time. 910 00:39:47,647 --> 00:39:49,480 And the test might've been really expensive. 911 00:39:49,480 --> 00:39:51,670 So no one would have done it beforehand. 912 00:39:51,670 --> 00:39:52,830 And now that the health insurance companies 913 00:39:52,830 --> 00:39:54,480 are going to pay for it, now people start doing it. 914 00:39:54,480 --> 00:39:56,190 So it might have existed beforehand. 915 00:39:56,190 --> 00:39:59,790 But if no one would pay for it, no one would use it. 916 00:39:59,790 --> 00:40:02,460 What's another reason why you might see something like this, 917 00:40:02,460 --> 00:40:03,762 or maybe even a gap like this? 918 00:40:03,762 --> 00:40:05,220 Notice, here in the middle, there's 919 00:40:05,220 --> 00:40:06,387 this huge gap in the middle. 920 00:40:06,387 --> 00:40:07,770 What might have explained that? 921 00:40:16,195 --> 00:40:17,070 AUDIENCE: [INAUDIBLE] 922 00:40:17,070 --> 00:40:17,862 PROFESSOR: Hold on. 923 00:40:17,862 --> 00:40:19,865 Yep, over here. 924 00:40:19,865 --> 00:40:21,490 AUDIENCE: Maybe your patient population 925 00:40:21,490 --> 00:40:25,206 is mostly of a certain age, and coverage for something 926 00:40:25,206 --> 00:40:28,870 changes once your age crosses a threshold. 927 00:40:28,870 --> 00:40:30,540 PROFESSOR: Yeah, so one explanation-- 928 00:40:30,540 --> 00:40:32,610 I think it's not plausible in this data set, 929 00:40:32,610 --> 00:40:34,410 but it is plausible for some data sets-- 930 00:40:34,410 --> 00:40:40,380 is that maybe your patients at time 0 931 00:40:40,380 --> 00:40:42,860 were all of exactly the same age. 932 00:40:42,860 --> 00:40:44,610 So maybe there's some amount of alignment. 933 00:40:44,610 --> 00:40:49,740 And suddenly, at this point in time, let's say, 934 00:40:49,740 --> 00:40:52,492 women only get, let's say, their annual mammography 935 00:40:52,492 --> 00:40:53,700 once they turn a certain age. 936 00:40:53,700 --> 00:40:57,420 And so that might be one reason why you would see nothing 937 00:40:57,420 --> 00:40:58,720 until one point in time. 938 00:40:58,720 --> 00:41:00,720 And maybe that would change across time as well. 939 00:41:00,720 --> 00:41:03,838 Maybe they'll stop getting it at some point after menopause. 940 00:41:03,838 --> 00:41:05,130 That's not true, but let's say. 941 00:41:07,527 --> 00:41:08,610 So that's one explanation. 942 00:41:08,610 --> 00:41:10,110 In this case, it doesn't make sense, 943 00:41:10,110 --> 00:41:12,518 because the patient population is very mixed. 944 00:41:12,518 --> 00:41:15,060 So you could think about it as being roughly at steady state. 945 00:41:15,060 --> 00:41:18,060 So they're not-- you'll have patients of all ages here. 946 00:41:18,060 --> 00:41:19,280 What's another reason? 947 00:41:19,280 --> 00:41:20,990 Someone raised their hand over here. 948 00:41:20,990 --> 00:41:21,520 Yep. 949 00:41:21,520 --> 00:41:23,600 AUDIENCE: Yeah, I was just going to say, 950 00:41:23,600 --> 00:41:25,610 maybe the EMR shut down for awhile, 951 00:41:25,610 --> 00:41:27,660 and so they were only doing stuff on paper, 952 00:41:27,660 --> 00:41:29,710 and they only were able to record 4 things. 953 00:41:29,710 --> 00:41:31,210 PROFESSOR: Ding ding ding ding ding. 954 00:41:31,210 --> 00:41:32,340 Yes, that's right. 955 00:41:32,340 --> 00:41:36,740 So maybe the EMR shut down. 956 00:41:36,740 --> 00:41:40,100 Or in this case, we had data issues. 957 00:41:40,100 --> 00:41:43,830 So this data was acquired somehow. 958 00:41:43,830 --> 00:41:45,930 For example, maybe it was required 959 00:41:45,930 --> 00:41:47,460 through a contract with something 960 00:41:47,460 --> 00:41:50,460 like Webquest or LabCorp. 961 00:41:50,460 --> 00:41:54,510 And maybe, during that four-month interval, 962 00:41:54,510 --> 00:41:56,202 there was contract negotiation. 963 00:41:56,202 --> 00:41:57,660 And so suddenly we couldn't get the 964 00:41:57,660 --> 00:41:59,100 Data for that time period. 965 00:41:59,100 --> 00:42:01,470 Or maybe our databases crashed, and we suddenly 966 00:42:01,470 --> 00:42:03,480 lost all the data for that time period. 967 00:42:03,480 --> 00:42:05,567 This happens, and this happens all the time, 968 00:42:05,567 --> 00:42:07,150 and not just the health care industry, 969 00:42:07,150 --> 00:42:09,060 but other industries as well. 970 00:42:09,060 --> 00:42:12,210 And as a result of those systemic-type changes, 971 00:42:12,210 --> 00:42:16,170 your data is also going to be non-stationary across time. 972 00:42:16,170 --> 00:42:18,420 So now we've seen three or four different explanations 973 00:42:18,420 --> 00:42:19,540 for why this happens. 974 00:42:19,540 --> 00:42:23,720 And the reality is really a mixture of all of these. 975 00:42:23,720 --> 00:42:25,037 And just as in the previous-- 976 00:42:25,037 --> 00:42:27,120 so in the previous example, notice how what really 977 00:42:27,120 --> 00:42:29,010 changed here is that the derived labels might 978 00:42:29,010 --> 00:42:30,830 change meaning across time. 979 00:42:30,830 --> 00:42:34,930 Now the significance of the features 980 00:42:34,930 --> 00:42:36,690 used in the machine learning models 981 00:42:36,690 --> 00:42:38,048 would really change across time. 982 00:42:38,048 --> 00:42:39,840 And that's one of the consequences of this, 983 00:42:39,840 --> 00:42:44,090 particular if you're driving features from lab test values. 984 00:42:44,090 --> 00:42:47,790 Here's one last example. 985 00:42:47,790 --> 00:42:50,430 Again, on the x-axis here, I have time. 986 00:42:50,430 --> 00:42:53,460 On the y-axis here, I'm showing the number of times 987 00:42:53,460 --> 00:42:58,780 that you observed some diagnosis code of some kind. 988 00:42:58,780 --> 00:43:01,530 This cyan line is ICD-9 codes. 989 00:43:01,530 --> 00:43:05,090 And this red line are ICD-10 codes. 990 00:43:05,090 --> 00:43:07,590 You might remember that Pete mentioned in an earlier lecture 991 00:43:07,590 --> 00:43:11,340 that there was a big shift from ICD-9 coding to ICD-10 coding 992 00:43:11,340 --> 00:43:12,048 at some point. 993 00:43:12,048 --> 00:43:12,840 When was that time? 994 00:43:12,840 --> 00:43:15,212 It was precisely this time. 995 00:43:15,212 --> 00:43:17,670 And so if you think about the feature vector that you would 996 00:43:17,670 --> 00:43:20,010 derive for your machine learning problem, 997 00:43:20,010 --> 00:43:23,740 you would have one feature for all ICD-9 codes, and one-- 998 00:43:23,740 --> 00:43:26,190 a whole set of features for all ICD-10 codes. 999 00:43:26,190 --> 00:43:27,930 And those ICD-9-based features are 1000 00:43:27,930 --> 00:43:30,120 going to be-- they're going to be used quite a bit 1001 00:43:30,120 --> 00:43:31,000 in this time period. 1002 00:43:31,000 --> 00:43:33,000 And then suddenly they're going to be completely 1003 00:43:33,000 --> 00:43:34,690 sparse in this time period. 1004 00:43:34,690 --> 00:43:37,740 And ICD-10 features start to become used. 1005 00:43:37,740 --> 00:43:39,990 And you could imagine that if you did machine learning 1006 00:43:39,990 --> 00:43:44,340 using just ICD-9 data, and then you 1007 00:43:44,340 --> 00:43:47,173 tried to apply your model at this point in time, 1008 00:43:47,173 --> 00:43:49,590 it's going to do horribly, because it's expecting features 1009 00:43:49,590 --> 00:43:51,780 that it no longer has access to. 1010 00:43:51,780 --> 00:43:53,358 And this happens all the time. 1011 00:43:53,358 --> 00:43:54,900 And in fact, what I'm describing here 1012 00:43:54,900 --> 00:43:58,020 is actually a major problem for the whole health care industry. 1013 00:43:58,020 --> 00:43:59,407 For the next five years, everyone 1014 00:43:59,407 --> 00:44:00,990 is going to grapple with this problem, 1015 00:44:00,990 --> 00:44:03,240 because they want to use their historical data for machine 1016 00:44:03,240 --> 00:44:04,698 learning, but their historical data 1017 00:44:04,698 --> 00:44:08,270 is very different from their recent data. 1018 00:44:08,270 --> 00:44:13,390 So now, in the face of all of this non-stationarity that I 1019 00:44:13,390 --> 00:44:17,560 just described, did we do anything wrong in the diabetes 1020 00:44:17,560 --> 00:44:22,030 risk stratification problem that I told you about earlier? 1021 00:44:22,030 --> 00:44:22,530 Thoughts. 1022 00:44:25,050 --> 00:44:26,300 That was my paper, by the way. 1023 00:44:26,300 --> 00:44:29,000 Did I make an error? 1024 00:44:29,000 --> 00:44:29,500 Thoughts. 1025 00:44:36,990 --> 00:44:37,850 Don't be afraid. 1026 00:44:37,850 --> 00:44:38,940 I'm often wrong. 1027 00:44:45,960 --> 00:44:47,710 I'm just asking specifically about the way 1028 00:44:47,710 --> 00:44:48,835 I evaluated the models. 1029 00:44:51,200 --> 00:44:51,700 Yep. 1030 00:44:51,700 --> 00:44:54,551 AUDIENCE: This wasn't an error, but one thing, 1031 00:44:54,551 --> 00:44:56,920 like if I was a doctor I would like to see 1032 00:44:56,920 --> 00:44:59,054 is the sensitivity to-- 1033 00:44:59,054 --> 00:45:01,434 like, the inclusion criteria if I 1034 00:45:01,434 --> 00:45:04,710 remove the HBA1C for instance. 1035 00:45:04,710 --> 00:45:08,456 Like most people, they have compared to having either Rx 1036 00:45:08,456 --> 00:45:11,970 or [INAUDIBLE] then kind of evaluating the-- 1037 00:45:11,970 --> 00:45:13,720 PROFESSOR: So understanding the robustness 1038 00:45:13,720 --> 00:45:15,730 to changing the data a bit is something that 1039 00:45:15,730 --> 00:45:17,350 would be of a lot of interest. 1040 00:45:17,350 --> 00:45:18,460 I agree. 1041 00:45:18,460 --> 00:45:19,960 But that's not immediately suggested 1042 00:45:19,960 --> 00:45:21,720 by the non-stationarity results. 1043 00:45:21,720 --> 00:45:25,330 Not something that's suggested by non-stationarity results. 1044 00:45:25,330 --> 00:45:26,830 Our TA in the front row has an idea. 1045 00:45:26,830 --> 00:45:27,830 Yeah, let's hear it. 1046 00:45:27,830 --> 00:45:29,625 AUDIENCE: The train and test distributions 1047 00:45:29,625 --> 00:45:31,250 were drawn from the same-- or the train 1048 00:45:31,250 --> 00:45:33,503 and tests were drawn from the same distribution. 1049 00:45:33,503 --> 00:45:35,920 PROFESSOR: So in the way that we did our evaluation there, 1050 00:45:35,920 --> 00:45:42,760 we said, OK, we're going to set it up such that on January 1, 1051 00:45:42,760 --> 00:45:44,710 2009, we're predicting what's going to happen 1052 00:45:44,710 --> 00:45:47,350 in the following three years. 1053 00:45:47,350 --> 00:45:50,140 And we segmented our patient population 1054 00:45:50,140 --> 00:45:53,800 into train, validate, and test, but at all times, 1055 00:45:53,800 --> 00:46:00,040 using that same setup, January 1 2009, as the prediction time. 1056 00:46:00,040 --> 00:46:04,570 Now, we learned this model, and it's now 2018. 1057 00:46:04,570 --> 00:46:07,000 We want to apply this model today. 1058 00:46:07,000 --> 00:46:09,430 And I computed an area under the ROC curve. 1059 00:46:09,430 --> 00:46:11,650 I computed positive predictive values 1060 00:46:11,650 --> 00:46:13,690 using that retrospective data. 1061 00:46:13,690 --> 00:46:17,650 And I handed those off to my partners. 1062 00:46:17,650 --> 00:46:20,530 And they might hope that those numbers are reflective of what 1063 00:46:20,530 --> 00:46:23,390 their models would do today. 1064 00:46:23,390 --> 00:46:26,090 But because of these issues I just told you about-- 1065 00:46:26,090 --> 00:46:27,940 for example, that the number of people 1066 00:46:27,940 --> 00:46:30,232 who have type 2 diabetes, and even the definition of it 1067 00:46:30,232 --> 00:46:31,480 has changed. 1068 00:46:31,480 --> 00:46:33,550 Because of the fact that the laboratory-- ignore 1069 00:46:33,550 --> 00:46:34,180 this part over here. 1070 00:46:34,180 --> 00:46:35,013 That's just a fluke. 1071 00:46:35,013 --> 00:46:36,940 But the fact, because of the laboratory 1072 00:46:36,940 --> 00:46:38,860 tests that were available during training 1073 00:46:38,860 --> 00:46:41,940 might be different from the ones that are available now, 1074 00:46:41,940 --> 00:46:45,850 and because of the fact that we have only ICD-10 data now, 1075 00:46:45,850 --> 00:46:48,172 and not ICD-9, for all of those reasons, 1076 00:46:48,172 --> 00:46:49,630 our predictive performance is going 1077 00:46:49,630 --> 00:46:52,870 to be really horrible now, Particularly 1078 00:46:52,870 --> 00:46:55,663 because of this last issue of not having ICD-9s. 1079 00:46:55,663 --> 00:46:57,580 Our predictive model is going to work horribly 1080 00:46:57,580 --> 00:47:02,170 now if it was trained on data from 2008 or 2009. 1081 00:47:02,170 --> 00:47:05,020 And so we would have never ever even recognized 1082 00:47:05,020 --> 00:47:07,840 that if we used the validation set up that we had done there. 1083 00:47:07,840 --> 00:47:12,107 So I wrote that paper when I was young and naive. 1084 00:47:12,107 --> 00:47:13,480 [AUDIENCE CHUCKLING] 1085 00:47:13,480 --> 00:47:16,540 I'm a little bit more gray-haired now. 1086 00:47:16,540 --> 00:47:18,640 And so in our more recent work-- for example, 1087 00:47:18,640 --> 00:47:22,510 this is a paper which we're working on right now, 1088 00:47:22,510 --> 00:47:24,670 done by a master's student of mine, Helen Zhou, 1089 00:47:24,670 --> 00:47:27,160 and is looking at predicting antibiotic resistance, 1090 00:47:27,160 --> 00:47:29,950 now we're a little bit smarter about over evaluation setup. 1091 00:47:29,950 --> 00:47:32,357 And we decided to set it up a little bit differently. 1092 00:47:32,357 --> 00:47:33,940 So what I'm showing you now is the way 1093 00:47:33,940 --> 00:47:35,650 that we chose, trained, validated 1094 00:47:35,650 --> 00:47:38,960 and test for our population. 1095 00:47:38,960 --> 00:47:41,240 So we segmented our data. 1096 00:47:41,240 --> 00:47:47,230 So the x-axis here is time, and the y-axis here are people. 1097 00:47:47,230 --> 00:47:49,732 So you can think of each person as being a different row. 1098 00:47:49,732 --> 00:47:51,940 And you can imagine that we randomly sorted the rows. 1099 00:47:54,490 --> 00:47:59,150 What we did is we segmented our data into these four quadrants. 1100 00:47:59,150 --> 00:48:03,980 The first two quadrants, we used for train and validate. 1101 00:48:03,980 --> 00:48:09,910 Notice, by the way, that we have different people 1102 00:48:09,910 --> 00:48:12,498 in the training set as we do in the validate set. 1103 00:48:12,498 --> 00:48:14,290 That's important for another quantity which 1104 00:48:14,290 --> 00:48:16,010 I'll talk about in a minute. 1105 00:48:16,010 --> 00:48:18,040 So we used this data for train and validate. 1106 00:48:18,040 --> 00:48:19,870 And that's, again, very similar to the way 1107 00:48:19,870 --> 00:48:22,030 we did it in the diabetes paper. 1108 00:48:22,030 --> 00:48:26,356 But now, for testing, we use this future data. 1109 00:48:26,356 --> 00:48:29,287 So we used data from 2014 to 2016. 1110 00:48:29,287 --> 00:48:31,120 And one can imagine two different quadrants. 1111 00:48:31,120 --> 00:48:32,710 You might be interested in knowing, 1112 00:48:32,710 --> 00:48:35,260 for the same patients for whom you made predictions 1113 00:48:35,260 --> 00:48:40,030 on during training, how would your predictions do 1114 00:48:40,030 --> 00:48:44,743 for those same people at test time in the future data. 1115 00:48:44,743 --> 00:48:46,660 And that's assuming that what we're predicting 1116 00:48:46,660 --> 00:48:48,670 is something that's much more myopic in nature. 1117 00:48:48,670 --> 00:48:50,830 In this case it was predicting, are they 1118 00:48:50,830 --> 00:48:52,973 going to be resistant to some antibiotic? 1119 00:48:52,973 --> 00:48:55,390 But you can also look at it for a completely different set 1120 00:48:55,390 --> 00:48:57,190 of patients, for patients who are not 1121 00:48:57,190 --> 00:48:58,660 used during training at all. 1122 00:48:58,660 --> 00:49:02,680 And suppose that this 2 bucket isn't used at all, 1123 00:49:02,680 --> 00:49:04,630 for those patients, how do we do, again, 1124 00:49:04,630 --> 00:49:06,063 using the future data for that. 1125 00:49:06,063 --> 00:49:07,480 And the advantage of this setup is 1126 00:49:07,480 --> 00:49:10,900 that it can really help you assess non-stationarity. 1127 00:49:10,900 --> 00:49:14,050 So if your model really took advantage 1128 00:49:14,050 --> 00:49:17,860 of features that were available in 2007, 2008, 2009, 1129 00:49:17,860 --> 00:49:19,422 but weren't available in 2014, you 1130 00:49:19,422 --> 00:49:21,130 would see a big drop in your performance. 1131 00:49:21,130 --> 00:49:22,547 Looking at the drop in performance 1132 00:49:22,547 --> 00:49:24,550 from your validate set in this time period, 1133 00:49:24,550 --> 00:49:26,740 to your test set from that time period, 1134 00:49:26,740 --> 00:49:29,650 that drop in performance will be uniquely attributed 1135 00:49:29,650 --> 00:49:31,760 to the non-stationarity. 1136 00:49:31,760 --> 00:49:33,190 So it's a good way to diagnose it. 1137 00:49:33,190 --> 00:49:33,690 Yep. 1138 00:49:33,690 --> 00:49:35,065 AUDIENCE: Just some clarification 1139 00:49:35,065 --> 00:49:38,013 on non-stationarity-- is it the fact that certain data is just 1140 00:49:38,013 --> 00:49:39,430 lost altogether, or is it the fact 1141 00:49:39,430 --> 00:49:41,240 that it's just encoded differently, 1142 00:49:41,240 --> 00:49:43,698 and so then it's difficult to get that mapping correct? 1143 00:49:43,698 --> 00:49:44,365 PROFESSOR: Both. 1144 00:49:44,365 --> 00:49:45,790 Both of these happen. 1145 00:49:45,790 --> 00:49:47,980 So I have a big research program now 1146 00:49:47,980 --> 00:49:50,115 which is asking not just how-- 1147 00:49:50,115 --> 00:49:51,990 so this is how you can evaluate and recognize 1148 00:49:51,990 --> 00:49:52,510 there's a problem. 1149 00:49:52,510 --> 00:49:55,052 But of course there's a really interesting research question, 1150 00:49:55,052 --> 00:49:57,450 which is, how can you make use of the non-stationarity. 1151 00:49:57,450 --> 00:50:01,870 Right, so for example, you had ICD-9/ICD-10 data. 1152 00:50:01,870 --> 00:50:05,020 You don't want to just throw away the ICD-9 data. 1153 00:50:05,020 --> 00:50:06,640 Is there a way to use it? 1154 00:50:06,640 --> 00:50:09,510 So the naive answer, which is what the community is largely 1155 00:50:09,510 --> 00:50:12,990 using today, is come up with a mapping. 1156 00:50:12,990 --> 00:50:15,700 Come up with a manual mapping from ICD-9 to ICD-10 1157 00:50:15,700 --> 00:50:18,870 so that you can sort of manually transform your data 1158 00:50:18,870 --> 00:50:20,820 into this new format such that the models you 1159 00:50:20,820 --> 00:50:24,300 learn from this older time is useful in the future time. 1160 00:50:24,300 --> 00:50:27,520 That's the boring and simple answer. 1161 00:50:27,520 --> 00:50:29,020 But I think we could do much better. 1162 00:50:29,020 --> 00:50:31,437 For example, we can learn new representations of the data. 1163 00:50:31,437 --> 00:50:33,780 We can learn that mapping directly 1164 00:50:33,780 --> 00:50:37,290 in order to optimize for your sort of most 1165 00:50:37,290 --> 00:50:38,082 recent performance. 1166 00:50:38,082 --> 00:50:40,582 And there's a whole bunch more that we can talk about later. 1167 00:50:40,582 --> 00:50:41,422 Yep. 1168 00:50:41,422 --> 00:50:44,040 AUDIENCE: [INAUDIBLE] non-stationary change, 1169 00:50:44,040 --> 00:50:49,970 this will [INAUDIBLE] does not ensure robustness 1170 00:50:49,970 --> 00:50:50,820 to the future. 1171 00:50:50,820 --> 00:50:51,970 PROFESSOR: Correct. 1172 00:50:51,970 --> 00:50:54,360 So this allows you to detect that a non-stationarity has 1173 00:50:54,360 --> 00:50:55,800 happened. 1174 00:50:55,800 --> 00:50:58,950 And it allows you to say that your model is going 1175 00:50:58,950 --> 00:51:00,202 to generalize to 2014-2016. 1176 00:51:00,202 --> 00:51:02,535 But of course, that doesn't mean that your model's going 1177 00:51:02,535 --> 00:51:06,397 to generalize to 2016-2018. 1178 00:51:06,397 --> 00:51:07,480 And so how do you do that? 1179 00:51:07,480 --> 00:51:08,310 How do you have confidence in that? 1180 00:51:08,310 --> 00:51:10,477 Well, that's a really interesting research question. 1181 00:51:10,477 --> 00:51:12,610 We don't have good answers to that today. 1182 00:51:12,610 --> 00:51:19,020 From a practical perspective, the best I can offer you today 1183 00:51:19,020 --> 00:51:22,590 is, build in these checks and balances all the time. 1184 00:51:22,590 --> 00:51:25,380 So continuously sort of evaluate how you're 1185 00:51:25,380 --> 00:51:26,780 doing on the most recent data. 1186 00:51:26,780 --> 00:51:30,150 And if you see big changes, throw a red flag. 1187 00:51:30,150 --> 00:51:33,510 Build more checks and balances into your deployment process. 1188 00:51:33,510 --> 00:51:35,790 If you see a bunch of patients who are getting 1189 00:51:35,790 --> 00:51:38,610 predicted probabilities of 1, and in the past, 1190 00:51:38,610 --> 00:51:40,110 you'd never predicted probability 1, 1191 00:51:40,110 --> 00:51:42,003 that might tell you something. 1192 00:51:42,003 --> 00:51:44,670 Then much later in the semester, we'll talk about robust machine 1193 00:51:44,670 --> 00:51:45,690 learning approaches, for example, 1194 00:51:45,690 --> 00:51:47,357 approaches that have been designed to be 1195 00:51:47,357 --> 00:51:49,290 robust against adversaries. 1196 00:51:49,290 --> 00:51:50,930 And those type of approaches as well 1197 00:51:50,930 --> 00:51:53,370 will allow you to be much more robust to particular types 1198 00:51:53,370 --> 00:51:55,410 of data set shift, of which non-stationarity 1199 00:51:55,410 --> 00:51:56,400 is one example. 1200 00:51:56,400 --> 00:51:58,400 But it's a big, open research field. 1201 00:51:58,400 --> 00:51:58,900 Yep. 1202 00:51:58,900 --> 00:52:01,610 AUDIENCE: So just to make sure I have the understanding correct, 1203 00:52:01,610 --> 00:52:03,360 theoretically, if you could map everything 1204 00:52:03,360 --> 00:52:07,500 from the old data set to the new data set, like the encodings, 1205 00:52:07,500 --> 00:52:09,456 would it still be OK, like the results 1206 00:52:09,456 --> 00:52:12,165 you get on the future data set? 1207 00:52:12,165 --> 00:52:14,040 PROFESSOR: If you could do a perfect mapping, 1208 00:52:14,040 --> 00:52:16,457 and it's one to one, and the distributions of those things 1209 00:52:16,457 --> 00:52:18,750 also didn't change, then yeah. 1210 00:52:18,750 --> 00:52:21,660 Really what you need to assess is, is there data set shift? 1211 00:52:21,660 --> 00:52:23,970 Is your training distribution, after mapping, 1212 00:52:23,970 --> 00:52:26,147 the same as your testing distribution? 1213 00:52:26,147 --> 00:52:27,730 If the answer is yes, you're all good. 1214 00:52:27,730 --> 00:52:29,110 If you're not, you're in trouble. 1215 00:52:29,110 --> 00:52:29,610 Yep. 1216 00:52:29,610 --> 00:52:32,068 AUDIENCE: What seems to be the test set of traits set here? 1217 00:52:32,068 --> 00:52:35,010 Or what [INAUDIBLE]? 1218 00:52:35,010 --> 00:52:38,530 PROFESSOR: So 1 is using data only from 2007-2013, 1219 00:52:38,530 --> 00:52:40,950 3 is using data only from 2014-2016. 1220 00:52:40,950 --> 00:52:44,611 AUDIENCE: But in the case, like, the output we care about 1221 00:52:44,611 --> 00:52:47,016 happened in, like, 2007-2013, then 1222 00:52:47,016 --> 00:52:49,580 that observation would be not-- it wouldn't be useful. 1223 00:52:49,580 --> 00:52:51,570 PROFESSOR: Yeah, so for the diabetes problem, 1224 00:52:51,570 --> 00:52:54,090 there's also just inclusion/exclusion criteria 1225 00:52:54,090 --> 00:52:55,310 that you have to deal with. 1226 00:52:55,310 --> 00:52:57,727 For what I'm showing you here, I'm talking about a setting 1227 00:52:57,727 --> 00:53:00,840 where you might be making multiple predictions 1228 00:53:00,840 --> 00:53:02,230 for patients across time. 1229 00:53:02,230 --> 00:53:04,338 So it's a much more myopic prediction task. 1230 00:53:04,338 --> 00:53:05,880 But one could come up with an analogy 1231 00:53:05,880 --> 00:53:07,720 to this for the diabetes setting. 1232 00:53:07,720 --> 00:53:15,000 Like, for example, just hold out half of the patients at random. 1233 00:53:15,000 --> 00:53:21,290 And then for your training set, use data up to 2009, 1234 00:53:21,290 --> 00:53:23,760 and evaluate on data only up to 2013. 1235 00:53:23,760 --> 00:53:30,610 And for your test set, pretend as if it was January 1, 2013, 1236 00:53:30,610 --> 00:53:35,390 and look at performance up to 2017. 1237 00:53:35,390 --> 00:53:36,600 And so that would be-- 1238 00:53:36,600 --> 00:53:39,510 you're changing your prediction time to use more recent data. 1239 00:53:43,330 --> 00:53:47,727 So the next subtlety is-- 1240 00:53:47,727 --> 00:53:49,060 it's a name that I put on to it. 1241 00:53:49,060 --> 00:53:50,220 This isn't a standard name. 1242 00:53:50,220 --> 00:53:53,200 This is what I'm calling intervention-tainted outcomes. 1243 00:53:56,130 --> 00:54:01,210 And so the example here came from your reading for today. 1244 00:54:01,210 --> 00:54:03,772 The reading was this paper on intelligible models 1245 00:54:03,772 --> 00:54:05,980 for health care predicting pneumonia risk in hospital 1246 00:54:05,980 --> 00:54:08,350 30-day admissions from KDD 2015. 1247 00:54:08,350 --> 00:54:10,040 So in that paper, they give an example-- 1248 00:54:10,040 --> 00:54:12,070 it's a very old example-- 1249 00:54:12,070 --> 00:54:13,840 of trying to use a predictive model 1250 00:54:13,840 --> 00:54:17,920 to understand a patient's risk of mortality 1251 00:54:17,920 --> 00:54:21,100 when they come into the hospital. 1252 00:54:21,100 --> 00:54:24,010 And what they learned-- and they used a rule-based learning 1253 00:54:24,010 --> 00:54:25,510 algorithm-- and what they discovered 1254 00:54:25,510 --> 00:54:29,740 was a rule that said if the patient has asthma, 1255 00:54:29,740 --> 00:54:33,445 then they have low risk of dying. 1256 00:54:33,445 --> 00:54:35,320 So these are all patients who have pneumonia. 1257 00:54:35,320 --> 00:54:38,140 So a patient who comes in with pneumonia and asthma 1258 00:54:38,140 --> 00:54:40,270 has a lower risk of dying than a patient who 1259 00:54:40,270 --> 00:54:45,400 comes in with pneumonia and does not have a history of asthma. 1260 00:54:45,400 --> 00:54:47,830 OK, that's what this rule says. 1261 00:54:47,830 --> 00:54:51,550 And this paper argued that there's something 1262 00:54:51,550 --> 00:54:54,440 wrong with that learned model. 1263 00:54:54,440 --> 00:54:56,110 Any of you remember what that was? 1264 00:54:56,110 --> 00:54:58,390 Someone who hasn't talked today, please. 1265 00:54:58,390 --> 00:54:59,250 Yeah, in the back. 1266 00:54:59,250 --> 00:55:00,875 AUDIENCE: It was that those with asthma 1267 00:55:00,875 --> 00:55:02,204 had more aggressive treatment. 1268 00:55:02,204 --> 00:55:04,930 So that means that they had a higher chance of survival. 1269 00:55:04,930 --> 00:55:07,540 PROFESSOR: Patients with asthma had more aggressive treatment. 1270 00:55:07,540 --> 00:55:08,998 In particular, they might have been 1271 00:55:08,998 --> 00:55:10,600 admitted to the intensive care unit 1272 00:55:10,600 --> 00:55:13,080 for more careful vigilance. 1273 00:55:13,080 --> 00:55:14,830 And as a result, they had better outcomes. 1274 00:55:14,830 --> 00:55:17,080 Yes, that's exactly right. 1275 00:55:17,080 --> 00:55:21,370 So the real story behind this is that risk stratification, 1276 00:55:21,370 --> 00:55:23,140 as we talked about the last couple weeks, 1277 00:55:23,140 --> 00:55:25,180 it's used to drive interventions. 1278 00:55:25,180 --> 00:55:28,360 And those interventions, if they happened in the past data, 1279 00:55:28,360 --> 00:55:30,350 would change the outcomes. 1280 00:55:30,350 --> 00:55:33,550 So in this case, you might imagine 1281 00:55:33,550 --> 00:55:35,530 using the learned predictive model to say, 1282 00:55:35,530 --> 00:55:38,218 a new patient comes in, this new patient has asthma, 1283 00:55:38,218 --> 00:55:40,010 and so we're going to say they're low risk. 1284 00:55:40,010 --> 00:55:42,340 And if we took a naive action based on that prediction, 1285 00:55:42,340 --> 00:55:44,800 we might say, OK, let's send them home. 1286 00:55:44,800 --> 00:55:46,742 They're at low risk of dying. 1287 00:55:46,742 --> 00:55:48,700 But if we did that, we could be killing people. 1288 00:55:48,700 --> 00:55:50,710 Because the reason why they were low 1289 00:55:50,710 --> 00:55:53,950 risk is because they had those interventions in the past. 1290 00:55:56,650 --> 00:55:59,800 So here's what's going on in that picture. 1291 00:55:59,800 --> 00:56:02,028 You have your data, X. And you're 1292 00:56:02,028 --> 00:56:04,570 trying to make a prediction at some point in time, let's say, 1293 00:56:04,570 --> 00:56:06,070 emergency department triage. 1294 00:56:06,070 --> 00:56:07,630 You want to predict some outcome Y, 1295 00:56:07,630 --> 00:56:10,480 let's say, whether the patient dies at some defined point 1296 00:56:10,480 --> 00:56:12,710 in the future. 1297 00:56:12,710 --> 00:56:16,960 Now, the challenge is that, as stated in the machine learning 1298 00:56:16,960 --> 00:56:19,940 tasks that you saw there, all you had access to 1299 00:56:19,940 --> 00:56:25,420 was X and Y, the covariance of the features and the outcome. 1300 00:56:25,420 --> 00:56:28,150 And so you're predicting Y from X, 1301 00:56:28,150 --> 00:56:30,670 but you're marginalizing over everything 1302 00:56:30,670 --> 00:56:33,490 that happens in between, in this case, the treatment. 1303 00:56:33,490 --> 00:56:36,777 So the good outcomes, people surviving, 1304 00:56:36,777 --> 00:56:38,860 might have been due to what's going on in between. 1305 00:56:38,860 --> 00:56:40,402 But what's going on in between is not 1306 00:56:40,402 --> 00:56:43,780 even observed in the data necessarily. 1307 00:56:43,780 --> 00:56:46,202 So how do we address this problem? 1308 00:56:46,202 --> 00:56:48,160 Well, the first thing I want you to think about 1309 00:56:48,160 --> 00:56:51,030 is, can we even recognize that this is a problem? 1310 00:56:51,030 --> 00:56:53,260 And that's where that article really 1311 00:56:53,260 --> 00:56:55,630 suggests that using an unintelligible model, a model 1312 00:56:55,630 --> 00:56:58,510 that you can introspect and try to understand a little bit, 1313 00:56:58,510 --> 00:57:01,270 is actually really important for even recognizing 1314 00:57:01,270 --> 00:57:04,400 that weird things are happening. 1315 00:57:04,400 --> 00:57:05,860 And this is a topic which we will 1316 00:57:05,860 --> 00:57:08,570 talk about in a lecture towards the end of the semester in much 1317 00:57:08,570 --> 00:57:09,070 more-- 1318 00:57:09,070 --> 00:57:11,200 Jack will talk about algorithms for interpreting 1319 00:57:11,200 --> 00:57:13,247 machine learning models. 1320 00:57:13,247 --> 00:57:14,080 So that's important. 1321 00:57:14,080 --> 00:57:16,090 You've got to recognize what's going on. 1322 00:57:16,090 --> 00:57:17,780 But what do you do about it? 1323 00:57:17,780 --> 00:57:20,820 So here are some hacks. 1324 00:57:20,820 --> 00:57:23,390 Hack number 1-- modify the model. 1325 00:57:23,390 --> 00:57:26,120 This is the solution that is proposed in the paper you read. 1326 00:57:26,120 --> 00:57:29,740 They said, OK, if it's a simple rule-based prediction 1327 00:57:29,740 --> 00:57:32,360 that the learning algorithm outputs to you, 1328 00:57:32,360 --> 00:57:35,180 you could see the rule that doesn't make sense, 1329 00:57:35,180 --> 00:57:36,800 you could use your clinical insight 1330 00:57:36,800 --> 00:57:37,850 to recognize it doesn't make sense. 1331 00:57:37,850 --> 00:57:39,933 You might even be able to explain why it happened. 1332 00:57:39,933 --> 00:57:41,780 And then you just remove that rule. 1333 00:57:41,780 --> 00:57:47,570 So you manually modify the model to push it towards something 1334 00:57:47,570 --> 00:57:48,883 that's more sensible. 1335 00:57:48,883 --> 00:57:50,550 All right, so that's what was suggested. 1336 00:57:50,550 --> 00:57:52,020 And I think it's nonsense. 1337 00:57:52,020 --> 00:57:56,060 I don't think that's ever going to work in today's world. 1338 00:57:56,060 --> 00:57:58,940 In today's world of high-dimensional models, 1339 00:57:58,940 --> 00:58:01,915 there's always going to be surrogates which are somehow 1340 00:58:01,915 --> 00:58:03,290 picked up by a learning algorithm 1341 00:58:03,290 --> 00:58:05,510 that you will not even recognize. 1342 00:58:05,510 --> 00:58:07,910 And it will be really hard to modify it in the way 1343 00:58:07,910 --> 00:58:09,040 that you want. 1344 00:58:09,040 --> 00:58:11,540 Maybe it's impossible using the simple approach, by the way. 1345 00:58:11,540 --> 00:58:12,920 Another interesting research question-- 1346 00:58:12,920 --> 00:58:14,480 how do you actually make this work 1347 00:58:14,480 --> 00:58:16,218 in a high-dimensional setting? 1348 00:58:16,218 --> 00:58:18,260 But for now, let's say we don't know how to do it 1349 00:58:18,260 --> 00:58:19,080 in a high-dimensional setting. 1350 00:58:19,080 --> 00:58:20,480 So what are your other choices? 1351 00:58:20,480 --> 00:58:24,080 Hack number 2 is to redefine the outcome altogether, 1352 00:58:24,080 --> 00:58:26,180 to change what you're predicting. 1353 00:58:26,180 --> 00:58:29,570 So for example, if you go back to this picture, 1354 00:58:29,570 --> 00:58:31,490 and instead of trying to predict Y, 1355 00:58:31,490 --> 00:58:34,490 death, if you could try to find some surrogate for the thing 1356 00:58:34,490 --> 00:58:37,410 you care about, which is pre-treatment, 1357 00:58:37,410 --> 00:58:40,160 and you predict that thing instead, 1358 00:58:40,160 --> 00:58:43,070 then you'll be back in business. 1359 00:58:43,070 --> 00:58:46,215 And so, for example, in one of the optional readings for-- 1360 00:58:46,215 --> 00:58:49,310 or actually I think in the second required reading 1361 00:58:49,310 --> 00:58:51,380 for today's class, it was a paper 1362 00:58:51,380 --> 00:58:53,990 about risk revocation for sepsis, which 1363 00:58:53,990 --> 00:58:56,850 is often caused by infection. 1364 00:58:56,850 --> 00:58:58,640 And what they show in that article 1365 00:58:58,640 --> 00:59:01,850 is that there are laboratory test results, such as lactate, 1366 00:59:01,850 --> 00:59:03,980 and there are others, which can give you 1367 00:59:03,980 --> 00:59:06,500 a hint that this patient might be on a path 1368 00:59:06,500 --> 00:59:08,960 to clinical deterioration. 1369 00:59:08,960 --> 00:59:12,590 And that test might precede the interventions to try 1370 00:59:12,590 --> 00:59:15,140 to take care of that condition. 1371 00:59:15,140 --> 00:59:17,720 And so if you instead change your outcome 1372 00:59:17,720 --> 00:59:21,230 to be predicting that surrogate, then you're 1373 00:59:21,230 --> 00:59:26,470 getting around this problem that I just pointed out. 1374 00:59:26,470 --> 00:59:31,450 Now, a third hack is from one of the optional readings 1375 00:59:31,450 --> 00:59:33,170 from today's lecture, this paper by Suchi 1376 00:59:33,170 --> 00:59:35,380 Saria and her students, from Science Translational Medicine 1377 00:59:35,380 --> 00:59:36,080 2015. 1378 00:59:36,080 --> 00:59:37,455 It's a really well-written paper. 1379 00:59:37,455 --> 00:59:38,960 I highly recommend reading it. 1380 00:59:38,960 --> 00:59:42,370 In that paper, they suggest formalizing the problem 1381 00:59:42,370 --> 00:59:43,990 as one of censoring, which is what 1382 00:59:43,990 --> 00:59:46,365 we'll be talking about for the very last third of today's 1383 00:59:46,365 --> 00:59:47,110 lecture. 1384 00:59:47,110 --> 00:59:50,830 In particular, what they say is suppose 1385 00:59:50,830 --> 00:59:53,210 you see that a patient is treated for the condition. 1386 00:59:53,210 --> 00:59:56,620 Let's say they're treated for sepsis. 1387 00:59:56,620 --> 00:59:58,810 Then if the patient is treated for that condition, 1388 00:59:58,810 --> 01:00:01,390 then we don't know what would have happened to them had they 1389 01:00:01,390 --> 01:00:02,570 not been treated. 1390 01:00:02,570 --> 01:00:07,990 So we don't observe the outcome, death given no treatment. 1391 01:00:07,990 --> 01:00:11,070 And so we're going to treat it as an unknown outcome. 1392 01:00:11,070 --> 01:00:14,500 And for patients who were not treated, but ended up 1393 01:00:14,500 --> 01:00:17,462 dying due to sepsis, then they're not censored. 1394 01:00:17,462 --> 01:00:19,670 And what I'll show you in the later part of the class 1395 01:00:19,670 --> 01:00:21,390 is how to learn from censored data. 1396 01:00:21,390 --> 01:00:23,620 So this is another formalization which 1397 01:00:23,620 --> 01:00:27,170 tries to address this problem that we pointed out. 1398 01:00:27,170 --> 01:00:29,740 Now, I call these hacks because, really, I 1399 01:00:29,740 --> 01:00:32,320 think what we should be doing is formalizing it using 1400 01:00:32,320 --> 01:00:35,200 the language of causality. 1401 01:00:35,200 --> 01:00:36,820 Once you do this introspection and you 1402 01:00:36,820 --> 01:00:39,290 realize that there is treatment, in fact, 1403 01:00:39,290 --> 01:00:41,350 you should be rethinking about the problem as one 1404 01:00:41,350 --> 01:00:43,777 of now having three quantities of interest. 1405 01:00:43,777 --> 01:00:46,360 There's the patient, everything you know about them at triage. 1406 01:00:46,360 --> 01:00:48,430 That's the X-variable I showed you before. 1407 01:00:48,430 --> 01:00:50,440 There's the outcome, let's say, Y. 1408 01:00:50,440 --> 01:00:52,023 And then there's that everything that 1409 01:00:52,023 --> 01:00:54,190 happened in between, in particular the interventions 1410 01:00:54,190 --> 01:00:55,270 that happened in between. 1411 01:00:55,270 --> 01:00:58,120 We'll call that T, for treatment. 1412 01:00:58,120 --> 01:01:00,850 And the question that one would like 1413 01:01:00,850 --> 01:01:04,030 to ask in order to figure out how to optimally care 1414 01:01:04,030 --> 01:01:08,440 for the patient is one of, will admission to the ICU, 1415 01:01:08,440 --> 01:01:10,690 which is the intervention that we're considering here, 1416 01:01:10,690 --> 01:01:15,550 will that lower the likelihood of death for the patient? 1417 01:01:15,550 --> 01:01:18,610 And now when I say lower, I don't mean correlation, 1418 01:01:18,610 --> 01:01:19,660 I mean causation. 1419 01:01:19,660 --> 01:01:23,620 Will it actually lower the patient's risk of dying? 1420 01:01:23,620 --> 01:01:25,900 I think we need to hit these questions on the head 1421 01:01:25,900 --> 01:01:28,990 with actually thinking about causality to try 1422 01:01:28,990 --> 01:01:30,580 to formalize this properly. 1423 01:01:30,580 --> 01:01:32,770 And if you do that, this will be a solution 1424 01:01:32,770 --> 01:01:35,110 which will generalize to the high-dimensional settings 1425 01:01:35,110 --> 01:01:37,450 that we care about in machine learning. 1426 01:01:37,450 --> 01:01:40,870 And this will be a topic that we'll talk really in-depth 1427 01:01:40,870 --> 01:01:41,960 after spring break. 1428 01:01:41,960 --> 01:01:44,447 But I wanted to give you this as one motivation for why 1429 01:01:44,447 --> 01:01:46,530 it's so important-- there are many other reasons-- 1430 01:01:46,530 --> 01:01:50,700 to really think about it from a causal perspective. 1431 01:01:50,700 --> 01:01:55,570 OK, so subtlety number 3-- 1432 01:01:55,570 --> 01:01:58,510 there's been a ton of hype in the media about deep learning 1433 01:01:58,510 --> 01:01:59,590 and health care. 1434 01:01:59,590 --> 01:02:01,570 A lot of it is very well warranted. 1435 01:02:01,570 --> 01:02:03,340 For example, the advances we're seeing 1436 01:02:03,340 --> 01:02:07,390 in areas ranging from radiology and pathology 1437 01:02:07,390 --> 01:02:12,970 to interpretation of EKGs are all really 1438 01:02:12,970 --> 01:02:16,187 being transformed by deep learning algorithms. 1439 01:02:16,187 --> 01:02:17,770 But the problems I've been telling you 1440 01:02:17,770 --> 01:02:20,110 about for the last couple of weeks, 1441 01:02:20,110 --> 01:02:23,180 of doing risk stratification on electronic health record data, 1442 01:02:23,180 --> 01:02:26,920 such as taxed notes, such as lab test 1443 01:02:26,920 --> 01:02:32,230 results and vital signs, diagnosis codes, that's 1444 01:02:32,230 --> 01:02:33,110 a different story. 1445 01:02:33,110 --> 01:02:35,735 And in fact, if you look closely at all of the papers, 1446 01:02:35,735 --> 01:02:37,360 all the papers that have been published 1447 01:02:37,360 --> 01:02:40,058 in the last few years that have been trying 1448 01:02:40,058 --> 01:02:42,100 to apply the gauntlet of deep learning algorithms 1449 01:02:42,100 --> 01:02:46,923 at those problems, in fact, the gains are very small. 1450 01:02:46,923 --> 01:02:49,090 And so what I'm showing you here is just one example 1451 01:02:49,090 --> 01:02:50,210 of such a paper. 1452 01:02:50,210 --> 01:02:52,510 This is a paper that received a lot of media attention. 1453 01:02:52,510 --> 01:02:54,852 It's a Google paper called "Scalable 1454 01:02:54,852 --> 01:02:57,310 and Accurate Deep Learning with Electronic Health Records." 1455 01:02:57,310 --> 01:02:59,230 And if you go across the United States, 1456 01:02:59,230 --> 01:03:00,700 if you go internationally, you talk 1457 01:03:00,700 --> 01:03:02,610 to chief medical information officers, 1458 01:03:02,610 --> 01:03:04,120 they're all going to be telling you about this paper. 1459 01:03:04,120 --> 01:03:06,120 They've all read it, they've all heard about it, 1460 01:03:06,120 --> 01:03:08,217 and they all want to use it. 1461 01:03:08,217 --> 01:03:09,550 But what is this actually doing? 1462 01:03:09,550 --> 01:03:11,030 What's going on behind the scenes? 1463 01:03:11,030 --> 01:03:14,230 Well, this paper uses the same sorts 1464 01:03:14,230 --> 01:03:15,970 of data we've been talking about. 1465 01:03:15,970 --> 01:03:19,530 It takes vitals, notes, orders, medications, 1466 01:03:19,530 --> 01:03:22,417 thinks about it as a timeline, summarizes it, then 1467 01:03:22,417 --> 01:03:23,750 uses a recurrent neural network. 1468 01:03:23,750 --> 01:03:25,870 It also uses attentional architectures. 1469 01:03:25,870 --> 01:03:28,046 And there's some pretty smart people on this paper-- 1470 01:03:28,046 --> 01:03:30,670 you know, Greg Corrado, Jeff Dean, 1471 01:03:30,670 --> 01:03:33,137 are all co-authors of this paper. 1472 01:03:33,137 --> 01:03:34,345 They know what they're doing. 1473 01:03:34,345 --> 01:03:36,580 All right, so they use these algorithms to predict 1474 01:03:36,580 --> 01:03:39,808 a number of downstream problems-- readmission risk, 1475 01:03:39,808 --> 01:03:41,350 for example, 30-day readmission, like 1476 01:03:41,350 --> 01:03:44,710 you read about in your readings for this week. 1477 01:03:44,710 --> 01:03:49,150 And they see they get pretty good predictions. 1478 01:03:49,150 --> 01:03:53,513 But if you go to the supplementary material, which 1479 01:03:53,513 --> 01:03:55,930 is a bit hard to find, but here's the link for all of you, 1480 01:03:55,930 --> 01:03:58,390 and I'll post it to my slides. 1481 01:03:58,390 --> 01:04:00,790 And if you look at the very last figure 1482 01:04:00,790 --> 01:04:02,740 in that supplementary material, you'll 1483 01:04:02,740 --> 01:04:04,670 see something interesting. 1484 01:04:04,670 --> 01:04:06,490 So here are those three different tasks 1485 01:04:06,490 --> 01:04:08,115 that they studied-- inpatient mortality 1486 01:04:08,115 --> 01:04:11,720 prediction, 30-day readmission, length-of-stay prediction. 1487 01:04:11,720 --> 01:04:13,240 The first line each of these buckets 1488 01:04:13,240 --> 01:04:16,330 is what your deep learning algorithm does. 1489 01:04:16,330 --> 01:04:18,230 Over here, they have two different hospitals. 1490 01:04:18,230 --> 01:04:19,772 I think it might have been University 1491 01:04:19,772 --> 01:04:21,700 of Chicago and Stanford. 1492 01:04:21,700 --> 01:04:24,855 And they're showing the area under the ROC curve, which 1493 01:04:24,855 --> 01:04:27,550 we've talked about, performance for each 1494 01:04:27,550 --> 01:04:29,997 of these tasks for their best models. 1495 01:04:29,997 --> 01:04:32,330 And in the parentheses, they give confidence intervals-- 1496 01:04:32,330 --> 01:04:34,850 let's say something like 95% confidence intervals-- for area 1497 01:04:34,850 --> 01:04:36,640 under the ROC curve. 1498 01:04:36,640 --> 01:04:38,560 Now, the second line that you see 1499 01:04:38,560 --> 01:04:42,900 is called full-feature enhanced baseline. 1500 01:04:42,900 --> 01:04:44,890 It's using the same data, but it's 1501 01:04:44,890 --> 01:04:48,190 using something very close to the feature represetnation 1502 01:04:48,190 --> 01:04:50,530 that you saw in the paper by Narges Razavian, 1503 01:04:50,530 --> 01:04:52,030 so that paper on diabetes prediction 1504 01:04:52,030 --> 01:04:54,430 that I told you about and we've been criticizing. 1505 01:04:54,430 --> 01:04:56,470 So it's using that L1-regularized logistic 1506 01:04:56,470 --> 01:05:00,400 regression with a smart set of features. 1507 01:05:00,400 --> 01:05:04,210 And what you see across all three settings 1508 01:05:04,210 --> 01:05:07,090 is that the results are not physically significantly 1509 01:05:07,090 --> 01:05:09,460 different. 1510 01:05:09,460 --> 01:05:12,700 So let's look at the first one, hospital A, deep learning, 1511 01:05:12,700 --> 01:05:14,920 0.95 AUC. 1512 01:05:14,920 --> 01:05:18,400 This L1-regularized logistic regression, 0.93. 1513 01:05:18,400 --> 01:05:22,570 30-day readmission, 0.77, 0.75, 0.86, 0.85. 1514 01:05:22,570 --> 01:05:26,730 And the confidence intervals are all overlapping. 1515 01:05:26,730 --> 01:05:30,988 So what's going on? 1516 01:05:30,988 --> 01:05:33,030 So I think what you're seeing here, first of all, 1517 01:05:33,030 --> 01:05:37,680 is a recognition by the machine learning community that-- 1518 01:05:37,680 --> 01:05:40,110 in this case, a late recognition that simpler approaches 1519 01:05:40,110 --> 01:05:41,940 tend to work well with this type of data. 1520 01:05:41,940 --> 01:05:43,740 I don't think this was the first thing that they tried. 1521 01:05:43,740 --> 01:05:46,032 They tried probably the deep learning algorithms first. 1522 01:05:49,200 --> 01:05:51,150 Second, we're all grasping at this, 1523 01:05:51,150 --> 01:05:53,910 and we all want to come up with these better algorithms, 1524 01:05:53,910 --> 01:05:57,330 but so far we're not doing that well. 1525 01:05:57,330 --> 01:05:59,802 And I'll tell you more about that in just a second. 1526 01:05:59,802 --> 01:06:01,260 But before I finish with the slide, 1527 01:06:01,260 --> 01:06:04,247 I want to give you a punch line I think is really important. 1528 01:06:04,247 --> 01:06:05,830 You might come home from this and say, 1529 01:06:05,830 --> 01:06:07,260 you know what, it's not that much better, 1530 01:06:07,260 --> 01:06:08,510 but it's a little bit better-- 1531 01:06:08,510 --> 01:06:09,900 0.95 to 0.93. 1532 01:06:09,900 --> 01:06:12,030 Suppose it was tight confidence intervals, 1533 01:06:12,030 --> 01:06:13,738 there might be a few patients whose lives 1534 01:06:13,738 --> 01:06:15,200 you could save with that. 1535 01:06:15,200 --> 01:06:18,120 But because all the issues I've told you about up until now, 1536 01:06:18,120 --> 01:06:22,440 of non-stationary, for example, those gains disappear. 1537 01:06:22,440 --> 01:06:25,770 In many cases, they even reverse when you actually 1538 01:06:25,770 --> 01:06:28,850 go to deploy these models because of that data set shift 1539 01:06:28,850 --> 01:06:30,000 for non-stationarity. 1540 01:06:30,000 --> 01:06:31,920 It so happens that the simpler models 1541 01:06:31,920 --> 01:06:35,590 tend to generalize better when your data changes on you. 1542 01:06:35,590 --> 01:06:37,920 And this is nicely explored in this paper 1543 01:06:37,920 --> 01:06:41,730 from Kenneth Jung and Nigam Shah in Journal of Biomedical 1544 01:06:41,730 --> 01:06:44,040 Informatics, 2015. 1545 01:06:44,040 --> 01:06:46,420 So this is something that I want you to think about. 1546 01:06:46,420 --> 01:06:48,540 Now let's try to answer why. 1547 01:06:48,540 --> 01:06:50,610 Well, the areas where we've been seeing 1548 01:06:50,610 --> 01:06:52,560 recurrent neural networks doing really well-- 1549 01:06:52,560 --> 01:06:54,960 in, for example, speech recognition, 1550 01:06:54,960 --> 01:06:59,742 natural language processing, are areas where, often-- 1551 01:06:59,742 --> 01:07:01,200 for example, you're predicting what 1552 01:07:01,200 --> 01:07:02,880 is the next word in a sequence of words, 1553 01:07:02,880 --> 01:07:05,760 the previous few words are pretty predictive. 1554 01:07:05,760 --> 01:07:08,250 Like, what is the next [PAUSES] that I'm going to say? 1555 01:07:08,250 --> 01:07:08,780 What is it? 1556 01:07:08,780 --> 01:07:09,630 AUDIENCE: Word. 1557 01:07:09,630 --> 01:07:11,130 PROFESSOR: Word, right, and you knew 1558 01:07:11,130 --> 01:07:15,225 that, right, because it was pretty obvious to predict that. 1559 01:07:15,225 --> 01:07:17,100 And so the models that are good at predicting 1560 01:07:17,100 --> 01:07:18,210 for that type of data, it doesn't 1561 01:07:18,210 --> 01:07:19,770 mean that they should be good for predicting 1562 01:07:19,770 --> 01:07:20,940 for a different type of sequential data. 1563 01:07:20,940 --> 01:07:22,170 Sequential data which, by the way, 1564 01:07:22,170 --> 01:07:23,850 lives in many different time scales. 1565 01:07:23,850 --> 01:07:26,580 Patients who are hospitalized, you get tons of data for them 1566 01:07:26,580 --> 01:07:28,560 at a time, and then you might go months 1567 01:07:28,560 --> 01:07:29,790 without any data on them. 1568 01:07:29,790 --> 01:07:31,440 Data with lots of missing data. 1569 01:07:31,440 --> 01:07:33,200 Data with multivariate observations 1570 01:07:33,200 --> 01:07:35,233 at each point in time, not just a single word 1571 01:07:35,233 --> 01:07:36,150 at that point in time. 1572 01:07:36,150 --> 01:07:37,800 All right, so it's a different setting. 1573 01:07:37,800 --> 01:07:40,567 And we shouldn't expect that the same architectures that 1574 01:07:40,567 --> 01:07:42,150 have been developed for other problems 1575 01:07:42,150 --> 01:07:44,910 will generalize immediately to these problems. 1576 01:07:44,910 --> 01:07:46,710 Now, I do conjecture that there are 1577 01:07:46,710 --> 01:07:50,250 lots of nonlinear attractions where 1578 01:07:50,250 --> 01:07:51,960 deep neural networks could be very 1579 01:07:51,960 --> 01:07:53,220 powerful at predicting for. 1580 01:07:53,220 --> 01:07:55,020 But I think they're subtle. 1581 01:07:55,020 --> 01:07:58,380 And I don't think that we have enough data currently 1582 01:07:58,380 --> 01:08:03,270 to deal with the fact that the data is messy 1583 01:08:03,270 --> 01:08:05,940 and that the non-linear interactions are subtle. 1584 01:08:05,940 --> 01:08:07,470 We just can't find them right now. 1585 01:08:07,470 --> 01:08:09,690 But this shouldn't mean that we're not going to find them 1586 01:08:09,690 --> 01:08:10,565 a few years from now. 1587 01:08:10,565 --> 01:08:13,590 I think this deservedly is a very interesting research 1588 01:08:13,590 --> 01:08:15,143 direction to work on. 1589 01:08:15,143 --> 01:08:16,560 And a final reason to point out is 1590 01:08:16,560 --> 01:08:19,290 that the features that are going into these types of models 1591 01:08:19,290 --> 01:08:22,939 are actually really cleverly-chosen features. 1592 01:08:22,939 --> 01:08:26,609 A laboratory test result, like looking at your A1C-- 1593 01:08:26,609 --> 01:08:28,200 what is A1C? 1594 01:08:28,200 --> 01:08:31,470 So it's something that had been developed 1595 01:08:31,470 --> 01:08:34,050 over decades and decades of research, where you recognize 1596 01:08:34,050 --> 01:08:35,550 that looking at a particular protein 1597 01:08:35,550 --> 01:08:37,800 is actually informative as something about a patient's 1598 01:08:37,800 --> 01:08:38,550 health. 1599 01:08:38,550 --> 01:08:41,189 So the features that we're using that go into these models 1600 01:08:41,189 --> 01:08:42,660 were designed-- 1601 01:08:42,660 --> 01:08:44,698 first, they were designed for humans to look at. 1602 01:08:44,698 --> 01:08:46,740 And second, they were designed to really help you 1603 01:08:46,740 --> 01:08:49,332 with decision-making, or largely independent features 1604 01:08:49,332 --> 01:08:51,540 from other information that you have about a patient. 1605 01:08:51,540 --> 01:08:53,082 And all of those are reasons, really, 1606 01:08:53,082 --> 01:08:56,160 I think why we're observing these subtleties. 1607 01:08:56,160 --> 01:08:58,350 OK, so for the last 10 minutes of class-- 1608 01:08:58,350 --> 01:08:59,850 I'm going to have to hold questions, 1609 01:08:59,850 --> 01:09:01,808 because I want to get through all the material. 1610 01:09:01,808 --> 01:09:03,145 But please post them to Piazza. 1611 01:09:03,145 --> 01:09:04,750 For the last 10 minutes of class, 1612 01:09:04,750 --> 01:09:06,720 I want to change gears a little bit, 1613 01:09:06,720 --> 01:09:10,350 and talk about survival modeling. 1614 01:09:10,350 --> 01:09:14,490 So often we want want to talk about predicting time 1615 01:09:14,490 --> 01:09:16,600 to some event. 1616 01:09:16,600 --> 01:09:18,800 So this red dot here-- 1617 01:09:18,800 --> 01:09:23,740 sorry, this black line here is what I mean by an event. 1618 01:09:23,740 --> 01:09:26,299 That event might be, for example, a patient dying. 1619 01:09:26,299 --> 01:09:29,970 It might mean a married couple getting divorced. 1620 01:09:29,970 --> 01:09:35,649 It might mean the day that what you graduate from MIT. 1621 01:09:35,649 --> 01:09:39,330 And the red dot here denotes censored events. 1622 01:09:39,330 --> 01:09:42,960 So for whatever reason, we don't have 1623 01:09:42,960 --> 01:09:47,128 data on this patient, patient S3, after time step 4. 1624 01:09:47,128 --> 01:09:47,920 They were censored. 1625 01:09:47,920 --> 01:09:51,270 So we do know that the event didn't 1626 01:09:51,270 --> 01:09:53,670 occur prior to time step 4. 1627 01:09:53,670 --> 01:09:55,740 But we don't know if and when it's 1628 01:09:55,740 --> 01:09:57,510 going to occur after time step 4, 1629 01:09:57,510 --> 01:10:00,015 because we have missing data there. 1630 01:10:00,015 --> 01:10:04,170 OK, so this is what I mean by right-censored data. 1631 01:10:04,170 --> 01:10:07,980 So you might ask, why not just use classification-- 1632 01:10:07,980 --> 01:10:10,605 like binary classification-- in this setting? 1633 01:10:10,605 --> 01:10:12,230 And that's exactly what we did earlier. 1634 01:10:12,230 --> 01:10:16,470 We thought about formalizing the diabetes risk stratification 1635 01:10:16,470 --> 01:10:20,400 problem as looking to see what happens years 1 to 3 1636 01:10:20,400 --> 01:10:22,690 after the time of prediction. 1637 01:10:22,690 --> 01:10:26,075 That was with a gap of one year. 1638 01:10:26,075 --> 01:10:27,450 And there a couple of reasons why 1639 01:10:27,450 --> 01:10:30,720 that's perhaps not what you really wanted to do. 1640 01:10:30,720 --> 01:10:35,490 First, you have less data to use during training. 1641 01:10:35,490 --> 01:10:40,920 Because you've suddenly excluded patients for whom-- 1642 01:10:40,920 --> 01:10:48,300 or to differently-- if you have patients 1643 01:10:48,300 --> 01:10:51,528 for whom they were censored during that time window, 1644 01:10:51,528 --> 01:10:52,570 you're throwing them out. 1645 01:10:52,570 --> 01:10:54,350 So you have fewer data points there. 1646 01:10:54,350 --> 01:10:58,160 That was part of our inclusion/exclusion criteria. 1647 01:10:58,160 --> 01:11:01,740 Also, when you go to deploy these models, 1648 01:11:01,740 --> 01:11:03,960 your model might say, yes, this patient 1649 01:11:03,960 --> 01:11:06,450 is going to develop type 2 diabetes between one and three 1650 01:11:06,450 --> 01:11:07,990 years from now. 1651 01:11:07,990 --> 01:11:10,470 But in fact what happened is they develop type 2 diabetes 1652 01:11:10,470 --> 01:11:13,240 3.1 years from now. 1653 01:11:13,240 --> 01:11:15,750 So your model would count this as a negative. 1654 01:11:15,750 --> 01:11:19,390 Or it would be a false positive. 1655 01:11:19,390 --> 01:11:21,253 The prediction would be a false positive. 1656 01:11:21,253 --> 01:11:23,420 But in reality, your model wasn't actually that bad. 1657 01:11:23,420 --> 01:11:24,450 We did pretty well. 1658 01:11:24,450 --> 01:11:26,130 We didn't quite get the right range, 1659 01:11:26,130 --> 01:11:28,410 but they did get diagnosed diabetes right 1660 01:11:28,410 --> 01:11:30,170 outside that time window. 1661 01:11:30,170 --> 01:11:31,950 And so your measures of performance 1662 01:11:31,950 --> 01:11:34,020 are going to be pessimistic. 1663 01:11:34,020 --> 01:11:36,618 You might be doing better than you thought. 1664 01:11:36,618 --> 01:11:38,160 Now, you can try to address these two 1665 01:11:38,160 --> 01:11:39,180 challenges in many ways. 1666 01:11:39,180 --> 01:11:41,220 You can imagine a multi-task learning framework 1667 01:11:41,220 --> 01:11:43,357 where you try to predict what's going to happen one 1668 01:11:43,357 --> 01:11:44,815 to two years from now, what's going 1669 01:11:44,815 --> 01:11:46,740 to happen two to three years from now, three to four, 1670 01:11:46,740 --> 01:11:47,305 and so on. 1671 01:11:47,305 --> 01:11:49,680 Each of those are different binary classification models. 1672 01:11:49,680 --> 01:11:51,640 You might try to tie together the parameters 1673 01:11:51,640 --> 01:11:54,860 of those models via a multi-task learning formulation. 1674 01:11:54,860 --> 01:11:57,042 And that will get you closer to what you care about. 1675 01:11:57,042 --> 01:11:59,250 But what I'll tell you about in the last five minutes 1676 01:11:59,250 --> 01:12:02,910 is a much more elegant approach to trying to deal with that. 1677 01:12:02,910 --> 01:12:04,620 And it's akin to regression. 1678 01:12:04,620 --> 01:12:06,730 So that leads to my second point-- 1679 01:12:06,730 --> 01:12:08,970 why not just treat this as a regression problem? 1680 01:12:08,970 --> 01:12:10,800 Predict time to event. 1681 01:12:10,800 --> 01:12:13,170 You have some continuous valued outcome, 1682 01:12:13,170 --> 01:12:15,960 the time until diagnosis diabetes. 1683 01:12:15,960 --> 01:12:19,260 Just try to minimize mean squared-- 1684 01:12:19,260 --> 01:12:20,910 minimize your squared error trying 1685 01:12:20,910 --> 01:12:23,610 to predict that continuous value. 1686 01:12:23,610 --> 01:12:26,190 Well, the first challenge to think about 1687 01:12:26,190 --> 01:12:28,170 is, remember where that mean squared error loss 1688 01:12:28,170 --> 01:12:28,962 function came from. 1689 01:12:28,962 --> 01:12:33,630 It came from thinking about your data 1690 01:12:33,630 --> 01:12:35,930 as coming from a Gaussian distribution. 1691 01:12:35,930 --> 01:12:38,430 And if you do maximum likelihood estimation of this Gaussian 1692 01:12:38,430 --> 01:12:40,800 distribution, it turns out to look 1693 01:12:40,800 --> 01:12:44,100 like minimizing a squared loss. 1694 01:12:44,100 --> 01:12:46,350 So it's making a lot of assumptions about the outcome. 1695 01:12:46,350 --> 01:12:47,490 For one, it's making the assumption 1696 01:12:47,490 --> 01:12:49,282 that outcome could be negative or positive. 1697 01:12:49,282 --> 01:12:51,960 A Gaussian distribution doesn't have to be positive. 1698 01:12:51,960 --> 01:12:54,540 But here we know that T is always non-negative. 1699 01:12:54,540 --> 01:12:56,427 In addition, there might be long tails. 1700 01:12:56,427 --> 01:12:58,260 We might not know exactly when the patient's 1701 01:12:58,260 --> 01:12:59,190 going to develop diabetes, but we 1702 01:12:59,190 --> 01:13:00,440 know it's not going to be now. 1703 01:13:00,440 --> 01:13:02,640 It's going to be at some point in the far future. 1704 01:13:02,640 --> 01:13:04,480 And that may also look very non-Gaussian. 1705 01:13:04,480 --> 01:13:07,290 So typical regression approaches aren't quite what you want. 1706 01:13:07,290 --> 01:13:09,600 But there's another really important problem, 1707 01:13:09,600 --> 01:13:12,220 which is that if you naively remove those censored points-- 1708 01:13:12,220 --> 01:13:14,553 like, what do you do for the individuals where you never 1709 01:13:14,553 --> 01:13:15,540 observe the time-- 1710 01:13:15,540 --> 01:13:18,240 where the never get diabetes, because they were censored? 1711 01:13:18,240 --> 01:13:20,880 Well, if you just remove those from your learning algorithm, 1712 01:13:20,880 --> 01:13:23,560 then you're biasing your results. 1713 01:13:23,560 --> 01:13:27,390 So for example, if you think about the average age 1714 01:13:27,390 --> 01:13:30,612 of diabetes onset, if you only look at people who actually 1715 01:13:30,612 --> 01:13:32,070 were observed to get diabetes, it's 1716 01:13:32,070 --> 01:13:34,293 going to be much closer to now. 1717 01:13:34,293 --> 01:13:36,210 Because obviously the people who were censored 1718 01:13:36,210 --> 01:13:39,920 are people who got it much later from the censoring time. 1719 01:13:39,920 --> 01:13:42,040 So that's another serious problem. 1720 01:13:42,040 --> 01:13:43,957 So the way they we're trying to formalize this 1721 01:13:43,957 --> 01:13:45,340 mathematically is as follows. 1722 01:13:45,340 --> 01:13:47,800 Now we should think about having data which has, 1723 01:13:47,800 --> 01:13:51,270 again, features x, outcome-- 1724 01:13:51,270 --> 01:13:53,610 what we usually call Y for the outcome in regression, 1725 01:13:53,610 --> 01:13:55,277 but here I'll call it capital T, because 1726 01:13:55,277 --> 01:13:56,850 of the time to the event. 1727 01:13:56,850 --> 01:13:59,100 And now we have an additional variable-- 1728 01:13:59,100 --> 01:14:02,220 so it's no longer a two-point, now it's a triple-- 1729 01:14:02,220 --> 01:14:02,940 b. 1730 01:14:02,940 --> 01:14:05,610 And b is going to be a binary variable, which is saying, 1731 01:14:05,610 --> 01:14:08,260 was this individual censored-- was the time, t, 1732 01:14:08,260 --> 01:14:10,590 denoting a censoring event, or was it denoting 1733 01:14:10,590 --> 01:14:12,380 the actual event happening? 1734 01:14:12,380 --> 01:14:15,930 So it's distinguishing between the red and the black. 1735 01:14:15,930 --> 01:14:19,500 So black is b equals 0. 1736 01:14:19,500 --> 01:14:21,940 Red is b equals 1. 1737 01:14:21,940 --> 01:14:26,950 OK, so now we can talk about learning a density, 1738 01:14:26,950 --> 01:14:29,920 P of t, which I'll also call f of t, 1739 01:14:29,920 --> 01:14:34,020 which is the probability of death at time t. 1740 01:14:34,020 --> 01:14:36,660 And associated with any density, of course, 1741 01:14:36,660 --> 01:14:38,650 is the cumulative density function, 1742 01:14:38,650 --> 01:14:43,260 which is the integral from 0 to any point of the density. 1743 01:14:43,260 --> 01:14:45,960 Here we'll actually look at 1 minus the CDF, what's 1744 01:14:45,960 --> 01:14:47,230 called the survival function. 1745 01:14:47,230 --> 01:14:51,720 So it's looking at probability of T, actual time of the event, 1746 01:14:51,720 --> 01:14:54,823 being larger than some quantity, little t. 1747 01:14:54,823 --> 01:14:56,490 And that's, of course, just the integral 1748 01:14:56,490 --> 01:14:59,266 of the density from little t to infinity. 1749 01:14:59,266 --> 01:15:01,167 All right, so this is the survival function. 1750 01:15:01,167 --> 01:15:02,250 It's of a lot of interest. 1751 01:15:02,250 --> 01:15:03,833 You want to know, is the patient going 1752 01:15:03,833 --> 01:15:07,262 to be diagnosed with diabetes two or more years from now. 1753 01:15:07,262 --> 01:15:08,970 So pictorially, what you're interested in 1754 01:15:08,970 --> 01:15:09,928 is something like this. 1755 01:15:09,928 --> 01:15:12,240 You want to estimate these conditional distributions. 1756 01:15:12,240 --> 01:15:14,250 So I call it conditional because you 1757 01:15:14,250 --> 01:15:18,420 want to condition on the covariant to individual x. 1758 01:15:18,420 --> 01:15:20,580 So what I'm showing you, this black line, 1759 01:15:20,580 --> 01:15:23,670 is your density, little f of t. 1760 01:15:23,670 --> 01:15:27,950 And this white area here, the integral 1761 01:15:27,950 --> 01:15:31,430 from little t to infinity, meaning all this white area, 1762 01:15:31,430 --> 01:15:33,380 is capital S of t. 1763 01:15:33,380 --> 01:15:37,910 It's the probability of surviving longer than time 1764 01:15:37,910 --> 01:15:39,430 little t. 1765 01:15:39,430 --> 01:15:43,730 OK, so the first thing you might do is say, 1766 01:15:43,730 --> 01:15:46,520 we get these data, these tuples, and we 1767 01:15:46,520 --> 01:15:49,060 want to try to estimate that function, little 1768 01:15:49,060 --> 01:15:52,250 f, the probability of death at some time. 1769 01:15:52,250 --> 01:15:54,320 Or, equivalently, you might want to estimate 1770 01:15:54,320 --> 01:15:58,220 the survival time, capital S of t, which is the CDF version. 1771 01:15:58,220 --> 01:16:02,390 And these two are related to another just by some calculus. 1772 01:16:02,390 --> 01:16:06,040 So a method called the Kaplan-Meier estimator 1773 01:16:06,040 --> 01:16:10,280 is a non-parametric method for estimating that survival 1774 01:16:10,280 --> 01:16:13,070 probability, capital S of t. 1775 01:16:13,070 --> 01:16:15,290 So this is the probability that an individual lives 1776 01:16:15,290 --> 01:16:17,150 more than some time period. 1777 01:16:17,150 --> 01:16:20,420 So first I'll explain to you this plot, then 1778 01:16:20,420 --> 01:16:22,110 I'll tell you how to compute it. 1779 01:16:22,110 --> 01:16:24,860 So the x-axis of this plot is time. 1780 01:16:24,860 --> 01:16:28,130 The y-axis is this survival property, capital S of t. 1781 01:16:28,130 --> 01:16:30,050 It's the probability that an individual lives 1782 01:16:30,050 --> 01:16:32,330 more than this amount of time. 1783 01:16:32,330 --> 01:16:36,933 I think this x-axis is in days, so 500, 1,000, 1,500, 2,000. 1784 01:16:36,933 --> 01:16:39,350 This figure, by the way, was created by one of my students 1785 01:16:39,350 --> 01:16:42,800 who's studying a multiple myeloma data set. 1786 01:16:42,800 --> 01:16:47,930 So you could then ask, well, under what covariants do you 1787 01:16:47,930 --> 01:16:49,680 want to compute this survival? 1788 01:16:49,680 --> 01:16:52,430 So here, this method I'll tell you about, 1789 01:16:52,430 --> 01:16:56,125 is very good for when you don't have any features. 1790 01:16:56,125 --> 01:16:57,500 So all you want to do is estimate 1791 01:16:57,500 --> 01:16:58,460 that density by itself. 1792 01:16:58,460 --> 01:17:01,970 And of course you could apply a method 1793 01:17:01,970 --> 01:17:03,240 for multiple populations. 1794 01:17:03,240 --> 01:17:04,490 So what I'm showing you here is applying it 1795 01:17:04,490 --> 01:17:05,740 for two different populations. 1796 01:17:05,740 --> 01:17:08,040 Suppose there's just a single binary feature. 1797 01:17:08,040 --> 01:17:11,030 And we're going to apply it to the x equals 0 1798 01:17:11,030 --> 01:17:11,875 and to x equals 1. 1799 01:17:11,875 --> 01:17:13,670 That gets you two different curves out. 1800 01:17:13,670 --> 01:17:16,790 But here the estimator is going to work independently for each 1801 01:17:16,790 --> 01:17:19,140 of the two populations. 1802 01:17:19,140 --> 01:17:20,900 So what you see here on this red line 1803 01:17:20,900 --> 01:17:22,800 is for the x equals 0 population. 1804 01:17:22,800 --> 01:17:28,820 We see that, at time 0, everyone is alive, as you would expect. 1805 01:17:28,820 --> 01:17:35,000 And at time 1,000, roughly 60% individuals 1806 01:17:35,000 --> 01:17:37,340 are still alive for time 1,000. 1807 01:17:37,340 --> 01:17:39,480 And that sort of stays constant. 1808 01:17:39,480 --> 01:17:41,495 Now you see that, for the other subgroup, 1809 01:17:41,495 --> 01:17:46,010 the x equals 1 subgroup, again, time step 0, as you would 1810 01:17:46,010 --> 01:17:47,810 expect, everyone is alive. 1811 01:17:47,810 --> 01:17:50,180 But they survive much longer. 1812 01:17:50,180 --> 01:17:53,343 At time step 1,000, over 75% of them are still alive. 1813 01:17:53,343 --> 01:17:55,760 And of course of interest here is also confidence balance. 1814 01:17:55,760 --> 01:17:56,900 I'm not going to tell you how can you do that, 1815 01:17:56,900 --> 01:17:58,820 but it's in some of the optional readings. 1816 01:17:58,820 --> 01:18:01,250 And by the way, there are more optional readings given 1817 01:18:01,250 --> 01:18:03,820 on the bottom of these slides. 1818 01:18:03,820 --> 01:18:06,170 And so you see that there is a statistically significant 1819 01:18:06,170 --> 01:18:09,055 difference between x equals 1 and x equals 0. 1820 01:18:09,055 --> 01:18:10,430 These people seem to be surviving 1821 01:18:10,430 --> 01:18:11,490 longer than these people. 1822 01:18:11,490 --> 01:18:13,530 And you get that immediately from this curve. 1823 01:18:13,530 --> 01:18:15,240 So how do we compute that? 1824 01:18:15,240 --> 01:18:17,630 Well, we take those observed times, 1825 01:18:17,630 --> 01:18:24,410 those capital Ts, and here I'm going to call them just y. 1826 01:18:24,410 --> 01:18:25,610 I'm going to sort them. 1827 01:18:25,610 --> 01:18:28,280 So these are sorted times. 1828 01:18:28,280 --> 01:18:32,680 And I don't care whether they were censored or not censored. 1829 01:18:32,680 --> 01:18:35,840 So y is just all of the times for all of the patients, 1830 01:18:35,840 --> 01:18:38,490 whether they are censored or not. 1831 01:18:38,490 --> 01:18:40,310 dK I want you think about as 1. 1832 01:18:40,310 --> 01:18:43,310 It's the number of events that occurred at that time. 1833 01:18:43,310 --> 01:18:45,740 So if everyone had a unique time of censoring or death, 1834 01:18:45,740 --> 01:18:47,510 then dK is always 1. 1835 01:18:47,510 --> 01:18:49,910 K is indexing one of these things. 1836 01:18:49,910 --> 01:18:52,430 n of K is the number of individuals 1837 01:18:52,430 --> 01:18:56,930 alive and uncensored by the K-th time point. 1838 01:18:56,930 --> 01:19:01,160 Then what this estimator says is that S of t-- 1839 01:19:01,160 --> 01:19:03,200 so the estimator at any point in time-- 1840 01:19:03,200 --> 01:19:07,190 is given to you by the product over K 1841 01:19:07,190 --> 01:19:10,010 such that y of K is less than or equal to t. 1842 01:19:10,010 --> 01:19:14,570 So it's going over the observed times up to little t, 1843 01:19:14,570 --> 01:19:17,810 of 1 minus the ratio of 1 over-- 1844 01:19:17,810 --> 01:19:19,430 so I'm thinking about dK as 1-- 1845 01:19:19,430 --> 01:19:22,070 1 over the number of people who are alive and uncensored 1846 01:19:22,070 --> 01:19:23,860 by that time. 1847 01:19:23,860 --> 01:19:26,510 And that has a very intuitive definition. 1848 01:19:26,510 --> 01:19:29,300 And one can prove that this estimator gives you 1849 01:19:29,300 --> 01:19:32,270 a consistent estimator of the number of people 1850 01:19:32,270 --> 01:19:34,620 who are alive-- 1851 01:19:34,620 --> 01:19:37,370 sorry, the number of survival probability at any one 1852 01:19:37,370 --> 01:19:41,547 point in time for censored data. 1853 01:19:41,547 --> 01:19:42,380 And that's critical. 1854 01:19:42,380 --> 01:19:45,020 This works for censored data. 1855 01:19:45,020 --> 01:19:47,340 So I'm past time today. 1856 01:19:47,340 --> 01:19:51,520 So I'll finish the last few slides on Tuesday's lecture. 1857 01:19:51,520 --> 01:19:52,520 So that's all for today. 1858 01:19:52,520 --> 01:19:54,070 Thanks.