1 00:00:15,088 --> 00:00:16,880 DAVID SONTAG: So I'll begin today's lecture 2 00:00:16,880 --> 00:00:20,860 by giving a brief recap of risk stratification. 3 00:00:20,860 --> 00:00:24,048 We didn't get to finish talking survival modeling on Thursday, 4 00:00:24,048 --> 00:00:25,840 and so I'll go a little bit more into that, 5 00:00:25,840 --> 00:00:27,590 and I'll answer some of the questions that 6 00:00:27,590 --> 00:00:30,735 arose during our discussions and on Piazza since. 7 00:00:30,735 --> 00:00:32,860 And then the vast majority of today's lecture we'll 8 00:00:32,860 --> 00:00:35,100 be talking about a new topic-- 9 00:00:35,100 --> 00:00:37,600 in particular, physiological time series modeling. 10 00:00:37,600 --> 00:00:40,400 I'll give two examples of physiological time series 11 00:00:40,400 --> 00:00:43,810 modeling-- the first one coming from monitoring patients 12 00:00:43,810 --> 00:00:46,050 in intensive care units, and the second one 13 00:00:46,050 --> 00:00:47,800 asking a very different type of question-- 14 00:00:47,800 --> 00:00:53,620 that of diagnosing patients' heart conditions using EKGs. 15 00:00:53,620 --> 00:00:55,390 And both of these correspond to readings 16 00:00:55,390 --> 00:00:57,040 that you had for today's lecture, 17 00:00:57,040 --> 00:00:59,200 and we'll go into much more depth in these-- 18 00:00:59,200 --> 00:01:01,640 of those papers today, and I'll provide much more color 19 00:01:01,640 --> 00:01:02,140 around them. 20 00:01:05,379 --> 00:01:07,862 So just to briefly remind you where we were on Thursday, 21 00:01:07,862 --> 00:01:10,320 we talked about how one could formalize risk stratification 22 00:01:10,320 --> 00:01:12,840 instead of as a classification problem of what would happen, 23 00:01:12,840 --> 00:01:15,570 let's say, in some predefined time period, 24 00:01:15,570 --> 00:01:17,640 rather thinking about risk stratification 25 00:01:17,640 --> 00:01:21,390 as a regression question, or regression task. 26 00:01:21,390 --> 00:01:23,910 Given what you know about a patient at time zero, 27 00:01:23,910 --> 00:01:26,380 predicting time to event-- 28 00:01:26,380 --> 00:01:29,790 so for example, here the event might be death, divorce, 29 00:01:29,790 --> 00:01:31,150 college graduation. 30 00:01:31,150 --> 00:01:35,850 And patient one-- that event happened at time step nine. 31 00:01:35,850 --> 00:01:38,340 Patient two, that event happened at time step 12. 32 00:01:38,340 --> 00:01:42,510 And for patient four, we don't know when that event happened, 33 00:01:42,510 --> 00:01:45,240 because it was censored. 34 00:01:45,240 --> 00:01:47,710 In particular, after time step seven, 35 00:01:47,710 --> 00:01:50,340 we no longer get to view any of the patients' data, 36 00:01:50,340 --> 00:01:52,960 and so we don't know when that red dot would be-- 37 00:01:52,960 --> 00:01:55,360 sometime in the future or never. 38 00:01:55,360 --> 00:01:57,550 So this is what we mean by right censor data, which 39 00:01:57,550 --> 00:02:01,443 is precisely what survival modeling is aiming to solve. 40 00:02:01,443 --> 00:02:03,235 Are there questions about this setup first? 41 00:02:06,358 --> 00:02:08,259 AUDIENCE: You flipped the x on-- 42 00:02:08,259 --> 00:02:09,759 DAVID SONTAG: Yeah, I realized that. 43 00:02:09,759 --> 00:02:11,860 I flipped the x and the o in today's presentation, 44 00:02:11,860 --> 00:02:14,720 but that's not relevant. 45 00:02:14,720 --> 00:02:18,370 So f of t is the probability of death, 46 00:02:18,370 --> 00:02:20,742 or the event occurring at time step t. 47 00:02:20,742 --> 00:02:22,450 And although in this slide I'm showing it 48 00:02:22,450 --> 00:02:24,300 as an unconditional model, in general, 49 00:02:24,300 --> 00:02:25,875 you should think about this as a conditional density. 50 00:02:25,875 --> 00:02:28,333 So you might be conditioning on some covariates or features 51 00:02:28,333 --> 00:02:31,810 that you have for that patient at baseline. 52 00:02:31,810 --> 00:02:34,515 And very important for survival modeling 53 00:02:34,515 --> 00:02:35,890 and for the next things I'll tell 54 00:02:35,890 --> 00:02:39,790 you are the survival function, to note it as capital S of t. 55 00:02:39,790 --> 00:02:45,170 And that's simply 1 minus the cumulative density function. 56 00:02:45,170 --> 00:02:48,040 So it's the probability that the event occurring, 57 00:02:48,040 --> 00:02:49,120 which is time-- 58 00:02:49,120 --> 00:02:51,640 which is denoted here as capital T, 59 00:02:51,640 --> 00:02:54,430 occurs greater than some little t. 60 00:02:54,430 --> 00:02:56,770 So it's this function, which is simply 61 00:02:56,770 --> 00:02:58,450 given to you by the integral from 0 62 00:02:58,450 --> 00:03:01,370 to infinity of the density. 63 00:03:01,370 --> 00:03:04,520 So in pictures, this is the density. 64 00:03:04,520 --> 00:03:06,230 On the x-axis is time. 65 00:03:06,230 --> 00:03:08,040 The y-axis is the density function. 66 00:03:08,040 --> 00:03:11,660 And this black curve is what I'm denoting as f of t. 67 00:03:11,660 --> 00:03:18,390 And this white area is capital s of c, the survival probability, 68 00:03:18,390 --> 00:03:21,250 or survival function. 69 00:03:21,250 --> 00:03:21,990 Yes? 70 00:03:21,990 --> 00:03:23,532 AUDIENCE: So I just want to be clear. 71 00:03:23,532 --> 00:03:26,446 So if you were to integrate the entire curve, 72 00:03:26,446 --> 00:03:31,717 [INAUDIBLE] by infinity you're going to be [INAUDIBLE].. 73 00:03:31,717 --> 00:03:33,550 DAVID SONTAG: In the way that I described it 74 00:03:33,550 --> 00:03:38,950 to here, yes, because we're talking about the time 75 00:03:38,950 --> 00:03:41,050 to event. 76 00:03:41,050 --> 00:03:44,680 But often we might be in scenarios where the event may 77 00:03:44,680 --> 00:03:47,662 never occur, and so that-- 78 00:03:47,662 --> 00:03:49,870 you can formalize that in a couple of different ways. 79 00:03:49,870 --> 00:03:52,060 You could put that at point mass at s of infinity, 80 00:03:52,060 --> 00:03:55,270 or you could simply say that the integral from 0 to infinity 81 00:03:55,270 --> 00:03:57,730 is some quantity less than 1. 82 00:03:57,730 --> 00:03:59,440 And in the readings that I'm referencing 83 00:03:59,440 --> 00:04:00,850 in the very bottom of those slides-- it shows you 84 00:04:00,850 --> 00:04:03,292 how you can very easily modify all of the frameworks 85 00:04:03,292 --> 00:04:05,500 I'm telling you about here to deal with that scenario 86 00:04:05,500 --> 00:04:07,120 where the event may never occur. 87 00:04:07,120 --> 00:04:09,702 But for the purposes of my presentation, 88 00:04:09,702 --> 00:04:11,410 you can assume that the event will always 89 00:04:11,410 --> 00:04:13,000 occur at some point. 90 00:04:13,000 --> 00:04:17,350 It's a very minor modification where you, in essence, divide 91 00:04:17,350 --> 00:04:21,700 the densities by a constant, which accounts for the fact 92 00:04:21,700 --> 00:04:26,170 that it wouldn't integrate to one otherwise. 93 00:04:26,170 --> 00:04:30,100 Now, a key question that has to be solved 94 00:04:30,100 --> 00:04:33,400 when trying to use a parametric approach to survivor modeling 95 00:04:33,400 --> 00:04:35,240 is, what should that f of t look like? 96 00:04:35,240 --> 00:04:38,070 What should that density function look like? 97 00:04:38,070 --> 00:04:42,040 And what I'm showing you here is a table of some very commonly 98 00:04:42,040 --> 00:04:43,990 used density functions. 99 00:04:43,990 --> 00:04:46,060 What you see in these two columns-- 100 00:04:46,060 --> 00:04:49,600 on the right hand column is the density function f of t itself. 101 00:04:49,600 --> 00:04:52,180 Lambda denotes some parameter of the model. 102 00:04:52,180 --> 00:04:54,520 t is the time. 103 00:04:54,520 --> 00:04:57,940 And on this second middle column is the survival function. 104 00:04:57,940 --> 00:05:01,510 So this is obtained for these particular parametric forms 105 00:05:01,510 --> 00:05:06,790 by an analytical solution solving that integral from t 106 00:05:06,790 --> 00:05:08,320 to infinity. 107 00:05:08,320 --> 00:05:11,090 This is the analytic solution for that. 108 00:05:11,090 --> 00:05:13,600 And so these go by common names of exponential, 109 00:05:13,600 --> 00:05:15,880 weeble, log-normal, and so on. 110 00:05:15,880 --> 00:05:17,950 And critically, all of these have support only 111 00:05:17,950 --> 00:05:21,192 on the positive real numbers, because the event can ever 112 00:05:21,192 --> 00:05:22,150 occur at negative time. 113 00:05:24,690 --> 00:05:27,900 Now, we live in a day and age where 114 00:05:27,900 --> 00:05:32,340 we no longer have to make standard parametric assumptions 115 00:05:32,340 --> 00:05:33,600 for densities. 116 00:05:33,600 --> 00:05:36,990 We could, for example, try to formalize the density 117 00:05:36,990 --> 00:05:41,110 as some output of some deep neural network. 118 00:05:41,110 --> 00:05:44,833 But if we don't use a parametric approach, 119 00:05:44,833 --> 00:05:46,500 so there are two ways to try to do that. 120 00:05:46,500 --> 00:05:48,470 One way to do that would be to say that we're 121 00:05:48,470 --> 00:05:50,120 going to model the post-- 122 00:05:50,120 --> 00:05:56,660 the distribution, f of t, as one of these things, where lambda 123 00:05:56,660 --> 00:05:58,640 or whatever the parameters of distribution 124 00:05:58,640 --> 00:06:00,470 are given to by the output of, let's 125 00:06:00,470 --> 00:06:03,890 say, a deep neural network on the covariate x. 126 00:06:03,890 --> 00:06:05,413 So that would be one approach. 127 00:06:05,413 --> 00:06:06,830 A very different approach would be 128 00:06:06,830 --> 00:06:10,130 a non-parametric distribution where you say, OK, I'm 129 00:06:10,130 --> 00:06:12,200 going to define f of t extremely flexibly, 130 00:06:12,200 --> 00:06:15,980 not as one of these forms. 131 00:06:15,980 --> 00:06:18,500 And there one runs into a slightly different challenge, 132 00:06:18,500 --> 00:06:20,720 because as I'll show you in the next slide, 133 00:06:20,720 --> 00:06:22,520 to do maximum likelihood estimation 134 00:06:22,520 --> 00:06:24,860 of these distributions from censor data, 135 00:06:24,860 --> 00:06:27,680 one needs to get-- one needs to make use of this survival 136 00:06:27,680 --> 00:06:29,820 function, s of t. 137 00:06:29,820 --> 00:06:33,060 And so if you're f if t is complex, 138 00:06:33,060 --> 00:06:37,153 and you don't have a nice analytic solution for s of t, 139 00:06:37,153 --> 00:06:38,820 then you're going to have to somehow use 140 00:06:38,820 --> 00:06:41,400 a numerical approximation of s of t during limiting. 141 00:06:41,400 --> 00:06:43,200 So it's definitely possible, but it's going 142 00:06:43,200 --> 00:06:44,492 to be a little bit more effort. 143 00:06:46,510 --> 00:06:49,010 So now here's where I'm going to get into maximum likelihood 144 00:06:49,010 --> 00:06:51,530 estimation of these distributions, 145 00:06:51,530 --> 00:06:54,350 and to define for you the likelihood function, 146 00:06:54,350 --> 00:06:57,150 I'm going to break it down into two different settings. 147 00:06:57,150 --> 00:06:59,300 The first setting is an observation 148 00:06:59,300 --> 00:07:03,530 which is uncensored, meaning we do observe when the event-- 149 00:07:03,530 --> 00:07:05,710 death, for example-- occurs. 150 00:07:05,710 --> 00:07:08,720 And in that case, the probability of the event-- 151 00:07:08,720 --> 00:07:09,450 it's very simple. 152 00:07:09,450 --> 00:07:13,345 It's just probability of the event occurring at capital-- 153 00:07:13,345 --> 00:07:15,230 at capital T, random variable T, equals 154 00:07:15,230 --> 00:07:16,580 a little t-- is just f or t. 155 00:07:16,580 --> 00:07:17,080 Done. 156 00:07:19,600 --> 00:07:22,870 However, what happens if, for this data point, 157 00:07:22,870 --> 00:07:26,650 you don't observe when the event occurred because of censoring? 158 00:07:26,650 --> 00:07:30,120 Well, of course, you could just throw away that data point, 159 00:07:30,120 --> 00:07:32,380 not use it in your estimation, but that's 160 00:07:32,380 --> 00:07:34,930 precisely what we mentioned at the very beginning 161 00:07:34,930 --> 00:07:38,625 of last week's lecture-- was the goal of survival modeling 162 00:07:38,625 --> 00:07:40,000 to not do that, because if we did 163 00:07:40,000 --> 00:07:44,440 that, it would introduce bias into our estimation procedure. 164 00:07:44,440 --> 00:07:48,110 So we would like to be able to use that observation 165 00:07:48,110 --> 00:07:50,770 that this data point was censored, 166 00:07:50,770 --> 00:07:53,530 but the only information we can get from that observation 167 00:07:53,530 --> 00:07:57,040 is that capital T, the event time, 168 00:07:57,040 --> 00:08:00,490 must have occurred some time larger 169 00:08:00,490 --> 00:08:03,040 than the observed-- the time of censoring, which 170 00:08:03,040 --> 00:08:05,410 is little t here. 171 00:08:05,410 --> 00:08:09,130 So we don't know precisely when capital T was, but we 172 00:08:09,130 --> 00:08:12,560 know it's something larger than the observed centering time 173 00:08:12,560 --> 00:08:13,660 little of t. 174 00:08:13,660 --> 00:08:17,830 And that, remember, is precisely what the survival function 175 00:08:17,830 --> 00:08:19,000 is capturing. 176 00:08:19,000 --> 00:08:20,680 So for a censored observation, we're 177 00:08:20,680 --> 00:08:24,253 going to use capital S of t within the likelihood. 178 00:08:24,253 --> 00:08:26,920 So now we can then combine these two for censored and uncensored 179 00:08:26,920 --> 00:08:30,820 data, and what we get is the following likelihood objective. 180 00:08:30,820 --> 00:08:33,880 This is-- I'm showing you here the log likelihood objective. 181 00:08:33,880 --> 00:08:38,320 Recall from last week that little b of i simply denotes 182 00:08:38,320 --> 00:08:40,570 is this observation censored or not? 183 00:08:40,570 --> 00:08:44,290 So if bi is 1, it means the time that you're given 184 00:08:44,290 --> 00:08:47,200 is the time of the censoring event. 185 00:08:47,200 --> 00:08:49,540 And if bi is 0, it means the time you're given 186 00:08:49,540 --> 00:08:51,787 is the time that the event occurs. 187 00:08:51,787 --> 00:08:54,370 So here what we're going to do is now sum over all of the data 188 00:08:54,370 --> 00:08:56,320 points in your data set from little i 189 00:08:56,320 --> 00:09:02,500 equals 1 to little n of bi times log of probability 190 00:09:02,500 --> 00:09:06,660 under the censored model plus 1 minus bi times log 191 00:09:06,660 --> 00:09:08,410 of probability under the uncensored model. 192 00:09:08,410 --> 00:09:10,510 And so this bi is just going to switch on which of these two 193 00:09:10,510 --> 00:09:12,760 you're going to use for that given data point. 194 00:09:12,760 --> 00:09:15,550 So the learning objective for maximum likelihood estimation 195 00:09:15,550 --> 00:09:18,070 here is very similar to what you're used to 196 00:09:18,070 --> 00:09:21,640 in learning distributions with the big difference 197 00:09:21,640 --> 00:09:23,470 that, for censored data, we're going 198 00:09:23,470 --> 00:09:29,080 to use the survival function to estimate its probability. 199 00:09:29,080 --> 00:09:30,313 Are there any questions? 200 00:09:35,150 --> 00:09:37,270 And this, of course, could then be 201 00:09:37,270 --> 00:09:39,760 optimized via your favorite algorithm, 202 00:09:39,760 --> 00:09:42,400 whether it be stochastic gradient descent, 203 00:09:42,400 --> 00:09:43,920 or second order method, and so on. 204 00:09:43,920 --> 00:09:44,420 Yep? 205 00:09:44,420 --> 00:09:45,785 AUDIENCE: I have a question about the a kind 206 00:09:45,785 --> 00:09:46,452 of side project. 207 00:09:46,452 --> 00:09:48,620 You mentioned that we could use [INAUDIBLE].. 208 00:09:48,620 --> 00:09:49,370 DAVID SONTAG: Yes. 209 00:09:49,370 --> 00:09:51,310 AUDIENCE: And then combine it with the parametric approach. 210 00:09:51,310 --> 00:09:51,730 DAVID SONTAG: Yes. 211 00:09:51,730 --> 00:09:53,563 AUDIENCE: So is that true that we just still 212 00:09:53,563 --> 00:09:55,705 have the parametric assumption that we kind of map 213 00:09:55,705 --> 00:09:57,250 the input to the parameters? 214 00:09:57,250 --> 00:09:58,385 DAVID SONTAG: Exactly. 215 00:09:58,385 --> 00:09:59,260 That's exactly right. 216 00:09:59,260 --> 00:10:04,980 So consider the following picture where for-- 217 00:10:04,980 --> 00:10:08,500 this is time, t. 218 00:10:08,500 --> 00:10:11,812 And this is f of t. 219 00:10:11,812 --> 00:10:13,670 You can imagine for any one patient 220 00:10:13,670 --> 00:10:15,907 you might have a different function. 221 00:10:15,907 --> 00:10:18,490 You might-- but they might all be of the same parametric form. 222 00:10:18,490 --> 00:10:21,332 So they might be like that, or maybe 223 00:10:21,332 --> 00:10:22,540 they're shifted a little bit. 224 00:10:25,130 --> 00:10:27,070 So you think about each of these three 225 00:10:27,070 --> 00:10:30,355 things as being from the same parametric family 226 00:10:30,355 --> 00:10:33,310 of distributions, but with different means. 227 00:10:33,310 --> 00:10:35,380 And in this case, then the mean is 228 00:10:35,380 --> 00:10:37,603 given to as the output of the deep neural network. 229 00:10:37,603 --> 00:10:39,520 And so that would be the way it would be used, 230 00:10:39,520 --> 00:10:41,812 and then one could just back propagate in the usual way 231 00:10:41,812 --> 00:10:43,106 to do learning. 232 00:10:43,106 --> 00:10:43,692 Yep? 233 00:10:43,692 --> 00:10:45,400 AUDIENCE: Can you repeat what b sub i is? 234 00:10:45,400 --> 00:10:45,780 DAVID SONTAG: Excuse me? 235 00:10:45,780 --> 00:10:47,572 AUDIENCE: Could you repeat what b sub i is? 236 00:10:47,572 --> 00:10:50,740 DAVID SONTAG: b sub i is just an indicator whether the i-th data 237 00:10:50,740 --> 00:10:54,510 point was censored or not censored. 238 00:10:54,510 --> 00:10:55,020 Yes? 239 00:10:55,020 --> 00:10:59,200 AUDIENCE: So [INAUDIBLE] equal it's more a probability density 240 00:10:59,200 --> 00:11:00,130 function [INAUDIBLE]. 241 00:11:00,130 --> 00:11:01,880 DAVID SONTAG: Cumulative density function. 242 00:11:01,880 --> 00:11:05,750 AUDIENCE: Yeah, but [INAUDIBLE] probability. 243 00:11:05,750 --> 00:11:10,350 No, for the [INAUDIBLE] it's probability density function. 244 00:11:10,350 --> 00:11:11,820 DAVID SONTAG Yes, so just to-- 245 00:11:11,820 --> 00:11:13,200 AUDIENCE: [INAUDIBLE] 246 00:11:13,200 --> 00:11:15,113 DAVID SONTAG: Excuse me? 247 00:11:15,113 --> 00:11:16,530 AUDIENCE: Will that be any problem 248 00:11:16,530 --> 00:11:18,440 to combine those two types there? 249 00:11:18,440 --> 00:11:20,190 DAVID SONTAG: That's a very good question. 250 00:11:20,190 --> 00:11:24,550 So the observation was that you have two different types 251 00:11:24,550 --> 00:11:27,412 of probabilities used here. 252 00:11:27,412 --> 00:11:28,870 In this case, we're using something 253 00:11:28,870 --> 00:11:32,200 like the cumulative density, whereas here we're 254 00:11:32,200 --> 00:11:35,380 using the probability density function. 255 00:11:35,380 --> 00:11:38,890 The question was, are these two on different scales? 256 00:11:38,890 --> 00:11:40,540 Does it make sense to combine them 257 00:11:40,540 --> 00:11:43,360 in this type of linear fashion with the same weighting? 258 00:11:43,360 --> 00:11:45,490 And I think it does make sense. 259 00:11:45,490 --> 00:11:59,430 So think about a setting where you have a very small time 260 00:11:59,430 --> 00:12:00,210 range. 261 00:12:00,210 --> 00:12:02,273 You're not exactly sure when this event occurs. 262 00:12:02,273 --> 00:12:03,690 It's something in this time range. 263 00:12:06,690 --> 00:12:10,440 In the setting of the censored data, 264 00:12:10,440 --> 00:12:14,180 where that time range could potentially be very large, 265 00:12:14,180 --> 00:12:17,650 your model is providing-- 266 00:12:21,150 --> 00:12:23,670 your log probability is somehow going 267 00:12:23,670 --> 00:12:28,590 to be much more flat, because you're covering 268 00:12:28,590 --> 00:12:29,715 much more probability mass. 269 00:12:32,930 --> 00:12:34,830 And so that observation, I think, 270 00:12:34,830 --> 00:12:37,200 intuitively is likely to have a much-- 271 00:12:37,200 --> 00:12:41,640 a bit of a smaller effect on the overall learning algorithm. 272 00:12:41,640 --> 00:12:44,590 These observations-- you know precisely where they are, 273 00:12:44,590 --> 00:12:51,280 and so as you deviate from that, you incur the corresponding log 274 00:12:51,280 --> 00:12:53,740 loss penalty. 275 00:12:53,740 --> 00:12:55,918 But I do think that it makes sense 276 00:12:55,918 --> 00:12:57,210 to have them in the same scale. 277 00:12:57,210 --> 00:12:59,977 If anyone in the room has done work with [INAUDIBLE] modeling 278 00:12:59,977 --> 00:13:02,310 and has a different answer to that, I'd love to hear it. 279 00:13:06,450 --> 00:13:09,510 Not today, but maybe someone in the future 280 00:13:09,510 --> 00:13:11,260 will answer this question differently. 281 00:13:11,260 --> 00:13:14,250 I'm going to move on for now. 282 00:13:14,250 --> 00:13:18,480 So the remaining question that I want to talk about today 283 00:13:18,480 --> 00:13:22,020 is how one evaluates survival models. 284 00:13:22,020 --> 00:13:27,240 So we talked about binary classification a lot 285 00:13:27,240 --> 00:13:29,090 in the context of risk stratification 286 00:13:29,090 --> 00:13:31,590 in the beginning, and we talked about how area under the ROC 287 00:13:31,590 --> 00:13:34,900 curve is one measure of classification performance, 288 00:13:34,900 --> 00:13:36,630 but here we're doing more-- 289 00:13:36,630 --> 00:13:40,120 something more akin to regression, not classification. 290 00:13:40,120 --> 00:13:43,180 A standard measure that's used to measure performance 291 00:13:43,180 --> 00:13:46,240 is known as the C-statistic, or concordance index. 292 00:13:46,240 --> 00:13:48,130 Those are one in the same-- 293 00:13:48,130 --> 00:13:49,630 and is defined as follows. 294 00:13:49,630 --> 00:13:52,050 And it has a very intuitive definition. 295 00:13:52,050 --> 00:13:55,300 It sums over pairs of data points 296 00:13:55,300 --> 00:13:59,120 that can be compared to one another, 297 00:13:59,120 --> 00:14:04,860 and it says, OK, what is the likelihood 298 00:14:04,860 --> 00:14:08,550 of the event happening for an event that 299 00:14:08,550 --> 00:14:11,100 occurs before an event-- 300 00:14:11,100 --> 00:14:11,820 another event. 301 00:14:11,820 --> 00:14:16,020 And what you want is that the likelihood of the event that, 302 00:14:16,020 --> 00:14:18,060 on average, in essence, should occur later 303 00:14:18,060 --> 00:14:21,215 should be larger than the event that should occur earlier. 304 00:14:21,215 --> 00:14:23,340 I'm going to first illustrate it with this picture, 305 00:14:23,340 --> 00:14:24,840 and then I'll work through the math. 306 00:14:24,840 --> 00:14:28,510 So here's the picture, and then we'll talk about the math. 307 00:14:28,510 --> 00:14:31,960 So what I'm showing you here are every single observation 308 00:14:31,960 --> 00:14:34,130 in your data set, and they're sorted 309 00:14:34,130 --> 00:14:40,900 by either the censoring time or the event time. 310 00:14:40,900 --> 00:14:45,130 So by black, I'm illustrating uncensored data points. 311 00:14:45,130 --> 00:14:50,010 And by red, I'm denoting censored data points. 312 00:14:50,010 --> 00:14:54,140 Now, here we see that this data point-- 313 00:14:54,140 --> 00:14:58,593 the event happened before this data point's censoring event. 314 00:14:58,593 --> 00:15:00,260 Now, since this data point was censored, 315 00:15:00,260 --> 00:15:03,030 it means it's true event time you could think about as 316 00:15:03,030 --> 00:15:05,700 sometime into the far future. 317 00:15:05,700 --> 00:15:11,330 So what we would want is that the model 318 00:15:11,330 --> 00:15:20,050 gives that the probability that this event happens by this time 319 00:15:20,050 --> 00:15:24,010 should be larger than the probability 320 00:15:24,010 --> 00:15:29,190 that this event happens by this time, 321 00:15:29,190 --> 00:15:31,320 because this actually occurred first. 322 00:15:31,320 --> 00:15:33,842 And these two are comparable together-- to each other. 323 00:15:33,842 --> 00:15:35,550 On the other hand, it wouldn't make sense 324 00:15:35,550 --> 00:15:39,450 to compare y2 and y4, because both of these 325 00:15:39,450 --> 00:15:41,610 were censored data points, and we don't know 326 00:15:41,610 --> 00:15:43,090 precisely when they occurred. 327 00:15:43,090 --> 00:15:45,090 So for example, it could have very well happened 328 00:15:45,090 --> 00:15:50,280 that the event 2 happened after event 4. 329 00:15:50,280 --> 00:15:53,250 So what I'm showing you here with each of these lines 330 00:15:53,250 --> 00:15:54,750 are the pairwise comparisons that 331 00:15:54,750 --> 00:15:56,325 are actually possible to make. 332 00:15:56,325 --> 00:15:58,200 You can make pairwise comparisons, of course, 333 00:15:58,200 --> 00:16:00,730 between any pair of events that actually did occur, 334 00:16:00,730 --> 00:16:02,272 and you can make pairwise comparisons 335 00:16:02,272 --> 00:16:06,690 between censored events and events that occurred before it. 336 00:16:06,690 --> 00:16:11,500 Now, if you now look at this formula, the formula in this 337 00:16:11,500 --> 00:16:15,100 indicate-- this is looking at an indicator of survival 338 00:16:15,100 --> 00:16:18,080 functions between pairs of data points, and which pairs of data 339 00:16:18,080 --> 00:16:18,580 points? 340 00:16:18,580 --> 00:16:21,170 It was precisely those pairs of data points, 341 00:16:21,170 --> 00:16:24,700 which I'm showing comparisons of with these blue lines here. 342 00:16:24,700 --> 00:16:28,210 So we're going to sum over i such that bi is equal to 0, 343 00:16:28,210 --> 00:16:32,830 and remember that means it is an uncensored data point. 344 00:16:32,830 --> 00:16:35,650 And then we look at-- 345 00:16:35,650 --> 00:16:41,050 we look at yi compared to all other yj that's great-- 346 00:16:41,050 --> 00:16:45,520 that has a value greater than-- both censored and uncensored. 347 00:16:45,520 --> 00:16:51,860 Now, if your data had no sensor data points in it, 348 00:16:51,860 --> 00:16:56,310 then you can verify that, in fact, this corresponds-- 349 00:16:56,310 --> 00:16:58,310 so there's one other assumption one has to make, 350 00:16:58,310 --> 00:16:59,610 which is that-- 351 00:16:59,610 --> 00:17:02,817 suppose that your outcome is binary. 352 00:17:02,817 --> 00:17:05,109 And so if you might wonder how you get a binary outcome 353 00:17:05,109 --> 00:17:09,760 from this, imagine that your density function 354 00:17:09,760 --> 00:17:13,359 looked a little bit like this, where it could occur either 355 00:17:13,359 --> 00:17:18,490 at time 1 or time 2. 356 00:17:18,490 --> 00:17:21,160 So something like that. 357 00:17:24,130 --> 00:17:29,300 So if the event can occur at only two times, 358 00:17:29,300 --> 00:17:31,770 not a whole range of times, then this 359 00:17:31,770 --> 00:17:35,570 is analogous to a binary outcome. 360 00:17:35,570 --> 00:17:37,210 And so if you have a binary outcome 361 00:17:37,210 --> 00:17:40,630 like this and no censoring, then, in fact, that C-statistic 362 00:17:40,630 --> 00:17:42,763 is exactly equal to the area under the ROC curve. 363 00:17:42,763 --> 00:17:44,930 So that just connects it a little bit back to things 364 00:17:44,930 --> 00:17:45,400 we're used to. 365 00:17:45,400 --> 00:17:45,850 Yep? 366 00:17:45,850 --> 00:17:47,767 AUDIENCE: Just to make sure that I understand. 367 00:17:47,767 --> 00:17:50,370 So y1 is going to be we observed an event, 368 00:17:50,370 --> 00:17:53,920 and y2 is going to be we know that no event occurred 369 00:17:53,920 --> 00:17:55,210 until that day? 370 00:17:55,210 --> 00:17:58,320 DAVID SONTAG: Every dot corresponds to one event, 371 00:17:58,320 --> 00:17:59,370 either censored or not. 372 00:17:59,370 --> 00:18:00,070 AUDIENCE: Thank you. 373 00:18:00,070 --> 00:18:01,445 DAVID SONTAG: And they're sorted. 374 00:18:01,445 --> 00:18:04,110 In this figure, they're sorted by the time 375 00:18:04,110 --> 00:18:08,475 of either the censoring or the event occurring. 376 00:18:14,420 --> 00:18:16,570 So I talked to-- 377 00:18:16,570 --> 00:18:18,190 when I talked about C-statistic, it-- 378 00:18:18,190 --> 00:18:21,730 that's one way to measure performance of your survival 379 00:18:21,730 --> 00:18:23,780 modeling, but you might remember that I-- 380 00:18:23,780 --> 00:18:25,780 that when we talked about binary classification, 381 00:18:25,780 --> 00:18:27,363 we said how area under there ROC curve 382 00:18:27,363 --> 00:18:29,080 in itself is very limiting, and so we 383 00:18:29,080 --> 00:18:30,340 should think through other performance 384 00:18:30,340 --> 00:18:31,215 metrics of relevance. 385 00:18:31,215 --> 00:18:33,652 So here are a few other things that you could do. 386 00:18:33,652 --> 00:18:35,110 One thing you could do is you could 387 00:18:35,110 --> 00:18:38,330 use the mean squared error. 388 00:18:38,330 --> 00:18:41,410 So again, thinking about this as a regression problem. 389 00:18:41,410 --> 00:18:43,090 But of course, that only makes sense 390 00:18:43,090 --> 00:18:45,430 for uncensored data points. 391 00:18:45,430 --> 00:18:47,563 So focus just in the uncensored data points, 392 00:18:47,563 --> 00:18:49,480 look to see how well we're doing at predicting 393 00:18:49,480 --> 00:18:51,410 when the event occurs. 394 00:18:51,410 --> 00:18:55,270 The second thing one could do, since you have the ability 395 00:18:55,270 --> 00:18:58,490 to define the likelihood of an observation, 396 00:18:58,490 --> 00:19:02,260 censored or not censored, one could hold out data, 397 00:19:02,260 --> 00:19:05,080 and look at the held-out likelihood or log likelihood 398 00:19:05,080 --> 00:19:07,170 of that held-out data. 399 00:19:07,170 --> 00:19:08,760 And the third thing you could do is 400 00:19:08,760 --> 00:19:12,270 you can-- after learning using this survival modeling 401 00:19:12,270 --> 00:19:17,250 framework, one could then turn it into a binary classification 402 00:19:17,250 --> 00:19:19,380 problem by, for example, artificially 403 00:19:19,380 --> 00:19:24,060 choosing time ranges, like greater than three months is 1. 404 00:19:24,060 --> 00:19:25,830 Less than three months is 0. 405 00:19:25,830 --> 00:19:27,460 That would be one crude definition. 406 00:19:27,460 --> 00:19:29,010 And then once you've done a reduction 407 00:19:29,010 --> 00:19:30,468 to a binary classification problem, 408 00:19:30,468 --> 00:19:32,763 you could use all of the existing performance 409 00:19:32,763 --> 00:19:35,430 metrics they're used to thinking about for binary classification 410 00:19:35,430 --> 00:19:37,020 to evaluate the performance there-- 411 00:19:37,020 --> 00:19:40,550 things like positive predictive value, for example. 412 00:19:40,550 --> 00:19:42,990 And you could, of course, choose different reductions 413 00:19:42,990 --> 00:19:44,970 and get different performance statistics out. 414 00:19:44,970 --> 00:19:47,700 So this is just a small subset of ways 415 00:19:47,700 --> 00:19:49,710 to try to evaluate survivor modeling, 416 00:19:49,710 --> 00:19:51,812 but it's a very, very rich literature. 417 00:19:51,812 --> 00:19:53,520 And again, on the bottom of these slides, 418 00:19:53,520 --> 00:19:54,978 I pointed you to several references 419 00:19:54,978 --> 00:19:57,640 that you could go to to learn more. 420 00:19:57,640 --> 00:19:59,710 The final comment I wanted to make 421 00:19:59,710 --> 00:20:02,470 is that I only told you about one estimator 422 00:20:02,470 --> 00:20:05,830 in today's lecture, and that's known as the likelihood based 423 00:20:05,830 --> 00:20:07,030 estimator. 424 00:20:07,030 --> 00:20:09,307 But there is a whole other estimation approach 425 00:20:09,307 --> 00:20:11,890 for survival modelings, which is very important to know about, 426 00:20:11,890 --> 00:20:14,218 that are called partial likelihood estimators. 427 00:20:14,218 --> 00:20:16,510 And for those of you who have heard of Cox proportional 428 00:20:16,510 --> 00:20:18,468 hazards models-- and I know they were discussed 429 00:20:18,468 --> 00:20:19,750 in Friday's recitation-- 430 00:20:19,750 --> 00:20:21,700 that's an example of a class of model 431 00:20:21,700 --> 00:20:26,040 that's commonly used within this partial likelihood estimator. 432 00:20:26,040 --> 00:20:28,540 Now, at a very intuitive level, what this partial likelihood 433 00:20:28,540 --> 00:20:31,600 estimator is doing is it's working with something 434 00:20:31,600 --> 00:20:33,100 like the C-statistic. 435 00:20:33,100 --> 00:20:38,890 So notice how the C-statistic only looks at relative 436 00:20:38,890 --> 00:20:40,810 orderings of events-- 437 00:20:40,810 --> 00:20:44,130 of their event occurrences. 438 00:20:44,130 --> 00:20:47,330 It doesn't care about exactly when the event occurred or not. 439 00:20:47,330 --> 00:20:52,100 In some sense, there's a constant. 440 00:20:52,100 --> 00:20:55,910 There's-- in this survival function, 441 00:20:55,910 --> 00:21:01,400 which could be divided out from both sides of this inequality, 442 00:21:01,400 --> 00:21:05,330 and it wouldn't affect anything about the statistic. 443 00:21:05,330 --> 00:21:07,488 And so one could think about other ways of learning 444 00:21:07,488 --> 00:21:09,030 these models by saying, well, we want 445 00:21:09,030 --> 00:21:10,770 to learn a survival function such 446 00:21:10,770 --> 00:21:14,620 that it gets the ordering correct between data points. 447 00:21:14,620 --> 00:21:17,760 Now, such a survival function wouldn't do a very good job. 448 00:21:17,760 --> 00:21:20,400 There's no reason it would do any good at getting 449 00:21:20,400 --> 00:21:23,700 the precise time of when an event occurs, 450 00:21:23,700 --> 00:21:28,980 but if your goal were to just figure out 451 00:21:28,980 --> 00:21:31,405 what is the sorted order of patients by risk 452 00:21:31,405 --> 00:21:33,780 so that you're going to do an intervention on the 10 most 453 00:21:33,780 --> 00:21:37,710 risky people, then getting that order incorrect is going to be 454 00:21:37,710 --> 00:21:40,290 enough, and that's precisely the intuition used 455 00:21:40,290 --> 00:21:42,190 behind these partial likelihood estimators-- 456 00:21:42,190 --> 00:21:44,940 so they focus on something which is a little bit less 457 00:21:44,940 --> 00:21:47,040 than the original goal, but in doing 458 00:21:47,040 --> 00:21:49,570 so, they can have much better statistical complexity, 459 00:21:49,570 --> 00:21:51,445 meaning the amount of data they need in order 460 00:21:51,445 --> 00:21:52,780 to fit this models well. 461 00:21:52,780 --> 00:21:54,390 And again, this is a very rich topic. 462 00:21:54,390 --> 00:21:56,100 All I wanted to do is give you a pointer to it 463 00:21:56,100 --> 00:21:57,940 so that you can go read more about it if this is something 464 00:21:57,940 --> 00:21:58,732 of interest to you. 465 00:22:01,910 --> 00:22:06,590 So now moving on into the recap, one 466 00:22:06,590 --> 00:22:09,110 of the most important points that we discussed last week 467 00:22:09,110 --> 00:22:10,910 was about non-stationarity. 468 00:22:10,910 --> 00:22:13,250 And there was a question posted to Piazza, 469 00:22:13,250 --> 00:22:14,990 which was really interesting, which is how do you actually 470 00:22:14,990 --> 00:22:16,115 deal with non-stationarity. 471 00:22:16,115 --> 00:22:18,080 And I spoke a lot about it existing, 472 00:22:18,080 --> 00:22:19,700 and I talked about how to test for it, 473 00:22:19,700 --> 00:22:23,527 but I didn't say what to do if you have it. 474 00:22:23,527 --> 00:22:25,610 So I thought this was such an interesting question 475 00:22:25,610 --> 00:22:28,190 that I would also talk about it a bit during lecture. 476 00:22:28,190 --> 00:22:32,540 So the short answer is, if you have to have a solution 477 00:22:32,540 --> 00:22:36,280 that you deploy tomorrow, then here's 478 00:22:36,280 --> 00:22:38,442 the hack that sometimes works. 479 00:22:38,442 --> 00:22:40,900 You take your most recent data, like the last three months' 480 00:22:40,900 --> 00:22:42,358 data, and you hope that there's not 481 00:22:42,358 --> 00:22:45,490 much non-stationarity within last three months. 482 00:22:45,490 --> 00:22:47,800 You throw out all the historical data, 483 00:22:47,800 --> 00:22:51,410 and you just train using the most recent data. 484 00:22:51,410 --> 00:22:55,110 So a bit unsatisfying, because you 485 00:22:55,110 --> 00:22:57,950 might have now extremely little data left to learn with, 486 00:22:57,950 --> 00:23:02,885 but if you have enough volume, it might be good enough. 487 00:23:02,885 --> 00:23:04,260 But the real interesting question 488 00:23:04,260 --> 00:23:06,460 from a research perspective is how could you optimally use 489 00:23:06,460 --> 00:23:07,540 that historical data. 490 00:23:07,540 --> 00:23:10,150 So here are three different ways. 491 00:23:10,150 --> 00:23:14,420 So one way has to do with imputation. 492 00:23:14,420 --> 00:23:17,713 Imagine that the way in which your data was non-stationary 493 00:23:17,713 --> 00:23:19,130 was because there were, let's say, 494 00:23:19,130 --> 00:23:24,110 parts of time when certain features were just unavailable. 495 00:23:24,110 --> 00:23:27,452 I gave you this example last week of laboratory test results 496 00:23:27,452 --> 00:23:29,660 across time, and I showed you how there are sometimes 497 00:23:29,660 --> 00:23:31,202 these really big blocks of time where 498 00:23:31,202 --> 00:23:34,810 no lab tests are available, or very few are available. 499 00:23:34,810 --> 00:23:37,387 Well, luckily we live in a world with high dimensional data, 500 00:23:37,387 --> 00:23:39,720 and what that means is there's often a lot of redundancy 501 00:23:39,720 --> 00:23:40,930 in the data. 502 00:23:40,930 --> 00:23:45,840 So what you could imagine doing is imputing features 503 00:23:45,840 --> 00:23:48,390 that you observed to be missing, such 504 00:23:48,390 --> 00:23:50,520 that the missingness properties, in fact, 505 00:23:50,520 --> 00:23:54,177 aren't changing as much across time after imputation. 506 00:23:54,177 --> 00:23:56,010 And if you do that as a pre-processing step, 507 00:23:56,010 --> 00:23:57,810 it may allow you to make use of much 508 00:23:57,810 --> 00:24:00,690 more of the historical data. 509 00:24:00,690 --> 00:24:03,570 A different approach, which is intimately tied to that, 510 00:24:03,570 --> 00:24:05,490 has to do with transforming the data. 511 00:24:05,490 --> 00:24:07,770 Instead of imputing it, transforming it 512 00:24:07,770 --> 00:24:10,710 into another representation altogether, such that 513 00:24:10,710 --> 00:24:15,102 that presentation is invariant across time. 514 00:24:15,102 --> 00:24:16,560 And here I'm giving you a reference 515 00:24:16,560 --> 00:24:19,380 to this paper by Ganin et al from the Journal of Machine 516 00:24:19,380 --> 00:24:21,660 Learning Research 2016, which talks about how 517 00:24:21,660 --> 00:24:24,815 to do domain and variant learning of neural networks, 518 00:24:24,815 --> 00:24:26,190 and that's one approach to do so. 519 00:24:26,190 --> 00:24:28,482 And I view those two as being very similar-- imputation 520 00:24:28,482 --> 00:24:30,210 and transformations. 521 00:24:30,210 --> 00:24:32,970 A second approach is to re-weight the data 522 00:24:32,970 --> 00:24:36,230 to look like the current data. 523 00:24:36,230 --> 00:24:38,170 So imagine that you go back in time, 524 00:24:38,170 --> 00:24:39,670 and you say, you know what? 525 00:24:39,670 --> 00:24:43,050 I ICD-10 codes, for some very weird reason-- 526 00:24:43,050 --> 00:24:44,530 this is not true, by the way-- 527 00:24:44,530 --> 00:24:47,260 ICD-10 codes in this untrue world 528 00:24:47,260 --> 00:24:51,870 happen to be used between March and April of 2003. 529 00:24:51,870 --> 00:24:57,190 And then they weren't used again until 2015. 530 00:24:57,190 --> 00:24:59,630 So instead of throwing away all of the previous data, 531 00:24:59,630 --> 00:25:02,630 we're going to recognize that those-- 532 00:25:02,630 --> 00:25:04,740 that three month interval 10 years ago 533 00:25:04,740 --> 00:25:07,680 was actually drawn from a very similar distribution as what 534 00:25:07,680 --> 00:25:09,200 we're going to be testing on today. 535 00:25:09,200 --> 00:25:12,230 So we're going to weight those data points up very much, 536 00:25:12,230 --> 00:25:14,330 and down weight the data points that are 537 00:25:14,330 --> 00:25:16,760 less like the ones from today. 538 00:25:16,760 --> 00:25:19,790 That's the intuition behind these re-weighting approaches, 539 00:25:19,790 --> 00:25:22,010 and we're going to talk much more about that 540 00:25:22,010 --> 00:25:23,900 in the context of causal inference, 541 00:25:23,900 --> 00:25:25,953 not because these two have to do with each other, 542 00:25:25,953 --> 00:25:28,370 but they have-- they end up using a very similar technique 543 00:25:28,370 --> 00:25:32,600 for how to deal with datas that shift, or covariate shift. 544 00:25:32,600 --> 00:25:34,910 And the final technique that I'll mention 545 00:25:34,910 --> 00:25:37,710 is based on online learning algorithms. 546 00:25:37,710 --> 00:25:44,420 So the idea there is that there might be cut points, change 547 00:25:44,420 --> 00:25:47,760 points across time. 548 00:25:47,760 --> 00:25:52,275 So maybe the data looks one way up until this change point, 549 00:25:52,275 --> 00:25:53,900 and then suddenly the data looks really 550 00:25:53,900 --> 00:25:55,590 different until this change point, 551 00:25:55,590 --> 00:25:57,132 and then suddenly the data looks very 552 00:25:57,132 --> 00:25:59,440 different on into the future. 553 00:25:59,440 --> 00:26:01,940 So here I'm showing you there are two change points in which 554 00:26:01,940 --> 00:26:04,610 data set shift happens. 555 00:26:04,610 --> 00:26:06,900 What these online learning algorithms do is they say, 556 00:26:06,900 --> 00:26:09,350 OK, suppose we were forced to make predictions 557 00:26:09,350 --> 00:26:11,360 throughout this time period using 558 00:26:11,360 --> 00:26:13,400 only the historical data to make predictions 559 00:26:13,400 --> 00:26:15,200 at each point in time. 560 00:26:15,200 --> 00:26:18,650 Well, if we could somehow recognize 561 00:26:18,650 --> 00:26:21,172 that there might be these shifts, 562 00:26:21,172 --> 00:26:22,880 we could design algorithms that are going 563 00:26:22,880 --> 00:26:25,910 to be robust to those shifts. 564 00:26:25,910 --> 00:26:28,040 And then one could try to analyze-- mathematically 565 00:26:28,040 --> 00:26:30,350 analyze those algorithms based on the amount of regret 566 00:26:30,350 --> 00:26:33,770 they would have to, for example, an algorithm that knew exactly 567 00:26:33,770 --> 00:26:35,098 when those changes were. 568 00:26:35,098 --> 00:26:36,890 And of course, we don't know precisely when 569 00:26:36,890 --> 00:26:38,970 those changes were. 570 00:26:38,970 --> 00:26:41,660 And so there's a whole field of algorithms trying to do that, 571 00:26:41,660 --> 00:26:44,930 and here I'm just give me one citation for a recent work. 572 00:26:47,680 --> 00:26:49,240 So to conclude risk stratification-- 573 00:26:49,240 --> 00:26:51,970 this is the last slide here. 574 00:26:51,970 --> 00:26:55,080 Maybe ask your question after class. 575 00:26:55,080 --> 00:26:56,490 We've talked about two approaches 576 00:26:56,490 --> 00:26:58,890 for formalizing risk stratification-- first 577 00:26:58,890 --> 00:27:00,000 as binary classification. 578 00:27:00,000 --> 00:27:02,430 Second as regression. 579 00:27:02,430 --> 00:27:04,950 And in the regression framework, one 580 00:27:04,950 --> 00:27:06,772 has to think about censoring, which is why 581 00:27:06,772 --> 00:27:07,980 we call it survival modeling. 582 00:27:11,090 --> 00:27:16,550 Second, in our examples, and again in your homework 583 00:27:16,550 --> 00:27:20,990 assignment that's coming up next week, 584 00:27:20,990 --> 00:27:25,010 we'll see that often the variables, 585 00:27:25,010 --> 00:27:29,850 the features that are most predictive make a lot of sense. 586 00:27:29,850 --> 00:27:32,480 In the diabetes case, we said-- 587 00:27:32,480 --> 00:27:36,740 we saw how patients having comorbidities of diabetes, 588 00:27:36,740 --> 00:27:39,080 like hypertension, or patients being obese 589 00:27:39,080 --> 00:27:42,370 were very predictive of patients getting diabetes. 590 00:27:42,370 --> 00:27:46,120 So you might ask yourself, is there something causal there? 591 00:27:46,120 --> 00:27:49,870 Are those features that are very predictive in fact causing-- 592 00:27:49,870 --> 00:27:52,180 what's causing the patient to develop type 2 diabetes? 593 00:27:52,180 --> 00:27:55,580 Like, for example, obesity causing diabetes. 594 00:27:55,580 --> 00:27:58,290 And this is where I want to caution you. 595 00:27:58,290 --> 00:28:02,190 You shouldn't interpret these very predictive features 596 00:28:02,190 --> 00:28:04,950 in a causal fashion, particularly 597 00:28:04,950 --> 00:28:07,650 not when one starts to work with high dimensional data, 598 00:28:07,650 --> 00:28:12,680 as we do in this course. 599 00:28:12,680 --> 00:28:15,290 The reason for that is very subtle, 600 00:28:15,290 --> 00:28:18,200 and we'll talk about that in the causal inference lectures, 601 00:28:18,200 --> 00:28:20,180 but I just wanted to give you a pointer 602 00:28:20,180 --> 00:28:22,500 now that you shouldn't think about it in that way. 603 00:28:22,500 --> 00:28:26,540 And you'll understand why in just a few weeks. 604 00:28:26,540 --> 00:28:31,820 And finally we talked about ways of dealing with missing data. 605 00:28:31,820 --> 00:28:35,620 I gave you one feature representation 606 00:28:35,620 --> 00:28:39,407 for the diabetes case, which was designed 607 00:28:39,407 --> 00:28:40,490 to deal with missing data. 608 00:28:40,490 --> 00:28:46,890 It said, was there any diagnosis code 250.01 609 00:28:46,890 --> 00:28:49,280 in the last three months? 610 00:28:49,280 --> 00:28:50,650 And if there was, you have a 1. 611 00:28:50,650 --> 00:28:51,317 If you don't, 0. 612 00:28:51,317 --> 00:28:53,178 So it's designed to recognize that you 613 00:28:53,178 --> 00:28:55,720 don't have information, perhaps, for some large chunk of time 614 00:28:55,720 --> 00:28:58,100 in that window. 615 00:28:58,100 --> 00:29:01,520 But that missing data could also be dangerous 616 00:29:01,520 --> 00:29:05,690 if that missingness itself has caused you to non-stationarity, 617 00:29:05,690 --> 00:29:09,290 which is then going to result in your test distribution looking 618 00:29:09,290 --> 00:29:11,490 different from your training distribution. 619 00:29:11,490 --> 00:29:14,450 And that's where approaches that are based on imputation 620 00:29:14,450 --> 00:29:17,840 could actually be very valuable, not because they improve 621 00:29:17,840 --> 00:29:20,240 your predictive accuracy when everything goes right, 622 00:29:20,240 --> 00:29:22,820 but because they might improve your predictive accuracy when 623 00:29:22,820 --> 00:29:24,565 things go wrong. 624 00:29:24,565 --> 00:29:26,690 And so one of your readings for last week's lecture 625 00:29:26,690 --> 00:29:29,510 was actually an example of that, where they used a Gaussian 626 00:29:29,510 --> 00:29:34,190 process model to impute much of the missing data in a patient's 627 00:29:34,190 --> 00:29:36,200 continuous vital signs, and then they 628 00:29:36,200 --> 00:29:38,210 used a recurrent neural network to predict 629 00:29:38,210 --> 00:29:41,480 based on that imputed data. 630 00:29:41,480 --> 00:29:46,050 So in that case, there are really two things going on. 631 00:29:46,050 --> 00:29:48,345 First is this robustness to data set shift, 632 00:29:48,345 --> 00:29:49,720 but there's a second thing, which 633 00:29:49,720 --> 00:29:51,220 is going on as well, which has to do 634 00:29:51,220 --> 00:29:54,370 with a trade-off between the amount of data you have 635 00:29:54,370 --> 00:29:58,960 and the complexity of the prediction problem. 636 00:29:58,960 --> 00:30:00,520 By doing imputations, sometimes you 637 00:30:00,520 --> 00:30:02,320 make your problem look a bit simpler, 638 00:30:02,320 --> 00:30:05,170 and simpler algorithms might succeed where otherwise they 639 00:30:05,170 --> 00:30:07,665 would fail because not having enough data. 640 00:30:07,665 --> 00:30:09,040 And that's something that you saw 641 00:30:09,040 --> 00:30:12,150 in that last week's reading. 642 00:30:12,150 --> 00:30:14,730 So I'm done with risk stratification. 643 00:30:14,730 --> 00:30:18,030 I'll take a one minute breather for everyone in the room, 644 00:30:18,030 --> 00:30:20,765 and then we'll start with the main topic 645 00:30:20,765 --> 00:30:22,890 of this lecture, which is physiological time-series 646 00:30:22,890 --> 00:30:23,390 modeling. 647 00:30:27,870 --> 00:30:28,720 Let's say started. 648 00:30:37,047 --> 00:30:38,880 So here's a baby that's not doing very well. 649 00:30:42,050 --> 00:30:44,180 This baby is in the intensive care unit. 650 00:30:48,050 --> 00:30:51,230 Maybe it was a premature infant. 651 00:30:51,230 --> 00:30:56,360 Maybe it's a baby who has some chronic disease, 652 00:30:56,360 --> 00:30:59,510 and, of course, parents are very worried. 653 00:30:59,510 --> 00:31:02,410 This baby is getting very close monitoring. 654 00:31:02,410 --> 00:31:04,220 It's connected to lots of different probes. 655 00:31:07,031 --> 00:31:10,160 In number one here, it's illustrating a three probe-- 656 00:31:10,160 --> 00:31:13,670 three lead ECG, which we'll be talking about much more, which 657 00:31:13,670 --> 00:31:17,270 is measuring its heart, how the baby's heart is doing. 658 00:31:17,270 --> 00:31:21,170 Over here, this number three is something 659 00:31:21,170 --> 00:31:24,206 attached to the baby's foot, which is measuring its-- 660 00:31:24,206 --> 00:31:27,563 it's a pulse oximeter, which is measuring the baby's oxygen 661 00:31:27,563 --> 00:31:29,480 saturation, the amount of oxygen in the blood. 662 00:31:32,780 --> 00:31:35,900 Number four is a probe which is measuring the baby's 663 00:31:35,900 --> 00:31:37,040 temperature and so on. 664 00:31:37,040 --> 00:31:40,168 And so we're really taking really close measurements 665 00:31:40,168 --> 00:31:41,960 of this baby, because we want to understand 666 00:31:41,960 --> 00:31:43,400 how is this baby doing. 667 00:31:43,400 --> 00:31:47,120 We recognize that there might be really sudden changes 668 00:31:47,120 --> 00:31:49,010 in the baby's state of health that we 669 00:31:49,010 --> 00:31:52,650 want to be able to recognize as early as possible. 670 00:31:52,650 --> 00:31:56,240 And so behind the scenes, next to this baby, 671 00:31:56,240 --> 00:31:58,790 you'll, of course, have a huge number of monitors, 672 00:31:58,790 --> 00:32:00,915 each of the monitors showing the readouts from each 673 00:32:00,915 --> 00:32:03,200 of these different signals. 674 00:32:03,200 --> 00:32:07,870 And this type of data is really prevalent in intensive care 675 00:32:07,870 --> 00:32:10,600 units, but you'll also see in today's lecture 676 00:32:10,600 --> 00:32:12,760 how some aspects of this data are now 677 00:32:12,760 --> 00:32:15,040 starting to make its way to the home, as well. 678 00:32:15,040 --> 00:32:20,590 So for example, EKGs are now available on Apple and Samsung 679 00:32:20,590 --> 00:32:24,250 watches to help understand-- 680 00:32:24,250 --> 00:32:27,010 help to help with diagnosis of arrhythmias, 681 00:32:27,010 --> 00:32:29,290 even for people at home. 682 00:32:29,290 --> 00:32:30,790 And so from this type of data, there 683 00:32:30,790 --> 00:32:34,210 are a number of really important use cases to think about. 684 00:32:34,210 --> 00:32:36,210 The first one is to recognize that often we're 685 00:32:36,210 --> 00:32:39,030 getting really noisy data, and we want to try 686 00:32:39,030 --> 00:32:40,710 to infer the true signal. 687 00:32:40,710 --> 00:32:43,170 So imagine, for example, the temperature probe. 688 00:32:43,170 --> 00:32:47,100 The baby's true temperature might be 98.5, 689 00:32:47,100 --> 00:32:50,640 but for whatever reason-- we'll see a few reasons here today-- 690 00:32:50,640 --> 00:32:53,197 maybe you're getting an observation of 93. 691 00:32:53,197 --> 00:32:54,030 And you didn't know. 692 00:32:54,030 --> 00:32:56,190 Is that actually the true baby temperature? 693 00:32:56,190 --> 00:32:57,300 In which case we-- 694 00:32:57,300 --> 00:32:59,250 it would be in a lot of trouble. 695 00:32:59,250 --> 00:33:01,080 Or is that an anomalous reading? 696 00:33:01,080 --> 00:33:03,288 So we like t be able to distinguish between those two 697 00:33:03,288 --> 00:33:04,440 things. 698 00:33:04,440 --> 00:33:09,090 And in other cases, we are interested in not necessarily 699 00:33:09,090 --> 00:33:12,030 fully understanding what's going on with the baby along each 700 00:33:12,030 --> 00:33:15,660 of those axes, but we just want to use that data 701 00:33:15,660 --> 00:33:17,880 for predictive purposes, for risk stratification, 702 00:33:17,880 --> 00:33:19,367 for example. 703 00:33:19,367 --> 00:33:21,200 And so the type of machine learning approach 704 00:33:21,200 --> 00:33:25,520 that we'll take here will depend on the following three factors. 705 00:33:25,520 --> 00:33:28,350 First, do we have label data available? 706 00:33:28,350 --> 00:33:30,470 For example, do we know the ground truth 707 00:33:30,470 --> 00:33:34,130 of what the baby's true temperature was, 708 00:33:34,130 --> 00:33:38,630 at least for a few of the babies in the training set? 709 00:33:38,630 --> 00:33:39,680 Second. 710 00:33:39,680 --> 00:33:43,310 Do we have a good mechanistic or statistical model 711 00:33:43,310 --> 00:33:46,113 of how this data might evolve across time? 712 00:33:46,113 --> 00:33:47,780 We know a lot about hearts, for example. 713 00:33:47,780 --> 00:33:49,655 Cardiology is one of those fields of medicine 714 00:33:49,655 --> 00:33:51,500 where it's really well studied. 715 00:33:51,500 --> 00:33:53,360 There are good simulators of hearts, 716 00:33:53,360 --> 00:33:54,950 and how they beat across time, and how 717 00:33:54,950 --> 00:34:01,150 that affects the electrical stimulation across the body. 718 00:34:01,150 --> 00:34:03,970 And if we have these good mechanistic 719 00:34:03,970 --> 00:34:05,770 or statistical models, that can often 720 00:34:05,770 --> 00:34:08,800 allow one to trade off not having much label data, 721 00:34:08,800 --> 00:34:11,540 or just not having much data period. 722 00:34:11,540 --> 00:34:13,850 And it's really these three points 723 00:34:13,850 --> 00:34:16,429 which I want to illustrate the extremes of in today's 724 00:34:16,429 --> 00:34:16,955 lecture-- 725 00:34:16,955 --> 00:34:18,830 what do you do when you don't have much data, 726 00:34:18,830 --> 00:34:20,000 and what you do when-- what you can 727 00:34:20,000 --> 00:34:21,395 do when you have a ton of data. 728 00:34:21,395 --> 00:34:24,054 And I think it's going to be really informative for us 729 00:34:24,054 --> 00:34:26,179 as we go out into the world and will have to tackle 730 00:34:26,179 --> 00:34:27,304 each of those two settings. 731 00:34:30,159 --> 00:34:33,500 So here's an example of two different babies with very 732 00:34:33,500 --> 00:34:35,150 different trajectories. 733 00:34:35,150 --> 00:34:38,449 One in the x-axis here is time in seconds. 734 00:34:38,449 --> 00:34:41,688 The y-axis here-- 735 00:34:41,688 --> 00:34:42,980 I think seconds, maybe minutes. 736 00:34:42,980 --> 00:34:46,130 The y-axis here is beats per minute of the baby's heart 737 00:34:46,130 --> 00:34:50,630 rate, and you see in some cases it's really 738 00:34:50,630 --> 00:34:51,938 fluctuating a lot up and down. 739 00:34:51,938 --> 00:34:54,230 In some cases, it's sort of going in a similar-- in one 740 00:34:54,230 --> 00:34:58,700 direction, and in all cases, the short term observations 741 00:34:58,700 --> 00:35:03,562 are very different from the long range trajectories. 742 00:35:03,562 --> 00:35:05,020 So the first problem that I want us 743 00:35:05,020 --> 00:35:10,450 to think about is one of trying to understand, 744 00:35:10,450 --> 00:35:13,972 how do we deconvolve between the truth of what's going on with, 745 00:35:13,972 --> 00:35:15,680 for example, the patient's blood pressure 746 00:35:15,680 --> 00:35:20,163 or oxygen versus interventions that are happening to them? 747 00:35:20,163 --> 00:35:21,580 So on the bottom here, I'm showing 748 00:35:21,580 --> 00:35:24,750 examples of interventions. 749 00:35:24,750 --> 00:35:27,810 Here in this oxygen uptake, we notice 750 00:35:27,810 --> 00:35:31,047 how between roughly 1,000 and 2,000 seconds suddenly 751 00:35:31,047 --> 00:35:32,255 there's no signal whatsoever. 752 00:35:32,255 --> 00:35:34,213 And that's an example of what's called dropout. 753 00:35:36,520 --> 00:35:39,650 Over here, we see a different type of-- 754 00:35:39,650 --> 00:35:42,430 the effect of a different intervention, 755 00:35:42,430 --> 00:35:44,770 which is due to a probe recalibration. 756 00:35:44,770 --> 00:35:46,870 Now, at that time, there was a drop out 757 00:35:46,870 --> 00:35:50,170 followed by a sudden change in the values, 758 00:35:50,170 --> 00:35:52,720 and that's really happening due to a recalibration step. 759 00:35:52,720 --> 00:35:55,710 And in both of these cases, what's 760 00:35:55,710 --> 00:35:58,132 going on with the individual might be relatively 761 00:35:58,132 --> 00:36:00,090 constant across time, but what's being observed 762 00:36:00,090 --> 00:36:04,240 is dramatically affected by those interventions. 763 00:36:04,240 --> 00:36:06,070 So we want to ask the question, can we 764 00:36:06,070 --> 00:36:08,788 identify those artifactual processes? 765 00:36:08,788 --> 00:36:11,080 Can we identify that these interventions were happening 766 00:36:11,080 --> 00:36:12,080 at those points in time? 767 00:36:15,680 --> 00:36:18,000 And then, if we could identify them, 768 00:36:18,000 --> 00:36:21,120 then we could potentially subtract their effect out. 769 00:36:21,120 --> 00:36:27,210 So we could impute the data, which we know-- now 770 00:36:27,210 --> 00:36:30,390 know to be missing, and then have this much higher quality 771 00:36:30,390 --> 00:36:33,130 signal used for some downstream predictive purpose, 772 00:36:33,130 --> 00:36:34,910 for example. 773 00:36:34,910 --> 00:36:37,510 And the second reason why this can be really important 774 00:36:37,510 --> 00:36:40,660 is to tackle this problem called alarm fatigue. 775 00:36:43,370 --> 00:36:47,030 Alarm fatigue is one of the most important challenges facing 776 00:36:47,030 --> 00:36:48,500 medicine today. 777 00:36:48,500 --> 00:36:52,370 As we get better and better in doing risk stratification, 778 00:36:52,370 --> 00:36:58,700 as we come up with more and more diagnostic tools and tests, 779 00:36:58,700 --> 00:37:02,090 that means these red flags are being raised more and more 780 00:37:02,090 --> 00:37:03,690 often. 781 00:37:03,690 --> 00:37:08,170 And each one of these has some associated false positive rate 782 00:37:08,170 --> 00:37:09,800 for it. 783 00:37:09,800 --> 00:37:13,510 And so the more tests you have-- 784 00:37:13,510 --> 00:37:15,250 suppose the false positive rate is 785 00:37:15,250 --> 00:37:18,160 kept constant-- the more tests you have, the more likely 786 00:37:18,160 --> 00:37:20,140 it is that the union of all of those 787 00:37:20,140 --> 00:37:24,568 is going to be some error. 788 00:37:24,568 --> 00:37:27,540 And so when you're in an intensive care unit, 789 00:37:27,540 --> 00:37:29,500 there are alarms going off all the time. 790 00:37:29,500 --> 00:37:31,630 And something that happens is that nurses end up 791 00:37:31,630 --> 00:37:35,110 starting to ignore those alarms, because so often 792 00:37:35,110 --> 00:37:37,480 those alarms are false positives, 793 00:37:37,480 --> 00:37:39,700 are due to, for example, artifacts 794 00:37:39,700 --> 00:37:41,835 like what I'm showing you here. 795 00:37:41,835 --> 00:37:43,960 And so if we had techniques, such as the ones we'll 796 00:37:43,960 --> 00:37:47,680 talk about right now, which could recognize when, 797 00:37:47,680 --> 00:37:50,470 for example, the sudden drop in a patient's heart rate 798 00:37:50,470 --> 00:37:54,940 is due to an artifact and not due to the patient's true heart 799 00:37:54,940 --> 00:37:56,958 rate dropping-- 800 00:37:56,958 --> 00:37:58,500 if we had enough confidence in that-- 801 00:37:58,500 --> 00:37:59,958 in distinguishing those two things, 802 00:37:59,958 --> 00:38:03,150 then we might not decide to raise that red flag. 803 00:38:03,150 --> 00:38:06,430 And that might reduce the amount of false alarms, 804 00:38:06,430 --> 00:38:09,150 and that then might reduce the amount of alarm fatigue. 805 00:38:09,150 --> 00:38:11,850 And that could have a very big impact on health care. 806 00:38:15,980 --> 00:38:19,150 So the technique which we'll talk about today 807 00:38:19,150 --> 00:38:24,170 goes by the name of switching linear dynamical systems. 808 00:38:24,170 --> 00:38:25,820 Who here has seen a picture like this 809 00:38:25,820 --> 00:38:29,630 on-- this picture on the bottom before. 810 00:38:29,630 --> 00:38:32,173 About half of the room. 811 00:38:32,173 --> 00:38:33,590 So for the other half of the room, 812 00:38:33,590 --> 00:38:36,620 I'm going to give a bit of a recap 813 00:38:36,620 --> 00:38:38,960 into probabilistic modeling. 814 00:38:38,960 --> 00:38:43,830 All of you are now familiar with general probabilities. 815 00:38:43,830 --> 00:38:48,230 So you're used to thinking about, for example, 816 00:38:48,230 --> 00:38:51,230 univariate Gaussian distributions. 817 00:38:51,230 --> 00:38:54,050 We talked about how one could model survival, which 818 00:38:54,050 --> 00:38:57,440 was an example of such a distribution, 819 00:38:57,440 --> 00:38:59,088 but for today's lecture, we're going 820 00:38:59,088 --> 00:39:01,130 to be thinking now about multivariate probability 821 00:39:01,130 --> 00:39:01,820 distributions. 822 00:39:01,820 --> 00:39:05,870 In particular, we'll be thinking about how a patient's state-- 823 00:39:05,870 --> 00:39:08,120 let's say their true blood pressure-- 824 00:39:08,120 --> 00:39:09,990 evolves across time. 825 00:39:09,990 --> 00:39:14,570 And so now we're interested in not just the random variable 826 00:39:14,570 --> 00:39:16,740 at one point in time, but that same random variable 827 00:39:16,740 --> 00:39:18,782 at the second point in time, third point in time, 828 00:39:18,782 --> 00:39:21,488 fourth point in time, fifth point in time, and so on. 829 00:39:21,488 --> 00:39:23,030 So what I'm showing you here is known 830 00:39:23,030 --> 00:39:26,270 as a graphical model, also known as a Bayesian network. 831 00:39:26,270 --> 00:39:29,050 And it's one way of illustrating a multivariate probability 832 00:39:29,050 --> 00:39:31,460 distribution that has particular conditional independence 833 00:39:31,460 --> 00:39:33,490 properties. 834 00:39:33,490 --> 00:39:40,690 Specifically, in this model, one node 835 00:39:40,690 --> 00:39:42,260 corresponds to one random variable. 836 00:39:42,260 --> 00:39:46,840 So this is describing a joint distribution on x1 837 00:39:46,840 --> 00:39:55,117 through x6, y1 through y6. 838 00:39:55,117 --> 00:39:56,700 So it's this multivariate distribution 839 00:39:56,700 --> 00:40:00,570 on 12 random variables. 840 00:40:00,570 --> 00:40:03,600 The fact that this is shaded in simply 841 00:40:03,600 --> 00:40:07,110 denotes that, at test time, when we use these models, typically 842 00:40:07,110 --> 00:40:09,780 these y variables are observed. 843 00:40:09,780 --> 00:40:13,410 Whereas our goal is usually to infer the x variables. 844 00:40:13,410 --> 00:40:16,950 Those are typically unobserved, meaning that our typical task 845 00:40:16,950 --> 00:40:20,340 is one of doing posterior inference to infer 846 00:40:20,340 --> 00:40:22,725 the x's given the y's. 847 00:40:25,470 --> 00:40:28,860 Now, associated with this graph, I already 848 00:40:28,860 --> 00:40:31,740 told you the nodes correspond to random variables. 849 00:40:31,740 --> 00:40:36,330 The graph tells us how is this joint distribution factorized. 850 00:40:36,330 --> 00:40:41,130 In particular, it's going to be factorized 851 00:40:41,130 --> 00:40:42,240 in the following way-- 852 00:40:42,240 --> 00:40:45,210 as the product over random variables 853 00:40:45,210 --> 00:40:49,000 of the probability of the i-th random variable. 854 00:40:49,000 --> 00:40:51,840 I'm going to use z to just denote a random variable. 855 00:40:51,840 --> 00:40:55,680 Think of z as the union of x and y. 856 00:40:55,680 --> 00:40:59,610 zi conditioned on the parents-- 857 00:40:59,610 --> 00:41:01,800 the values of the parents of zi. 858 00:41:05,820 --> 00:41:10,080 So I'm going to assume this factorization, 859 00:41:10,080 --> 00:41:13,800 and in particular for this graphical model, which 860 00:41:13,800 --> 00:41:15,870 goes by the name of a Markov model, 861 00:41:15,870 --> 00:41:18,810 it has a very specific factorization. 862 00:41:18,810 --> 00:41:22,180 And we're just going to read it off from this definition. 863 00:41:22,180 --> 00:41:26,340 So we're going to go in order-- first x1, then y1, 864 00:41:26,340 --> 00:41:28,410 then x2, then y2, and so on, which 865 00:41:28,410 --> 00:41:36,630 is going based on a root to children 866 00:41:36,630 --> 00:41:39,340 transversal of this graph. 867 00:41:39,340 --> 00:41:44,410 So the first random variable is x1. 868 00:41:44,410 --> 00:41:50,230 Second variable is y2, and what are the parents of y-- 869 00:41:50,230 --> 00:41:51,757 sorry, what are the parents of y1. 870 00:41:51,757 --> 00:41:52,840 Everyone can say out loud. 871 00:41:52,840 --> 00:41:54,070 AUDIENCE: x1. 872 00:41:54,070 --> 00:41:55,090 DAVID SONTAG: x1. 873 00:41:55,090 --> 00:42:01,450 So y1 in this factorization is only going to depend on x1. 874 00:42:01,450 --> 00:42:02,740 Next we have x2. 875 00:42:02,740 --> 00:42:03,940 What are the parents of x2? 876 00:42:03,940 --> 00:42:05,390 Everyone say out loud? 877 00:42:05,390 --> 00:42:06,370 AUDIENCE: x1. 878 00:42:06,370 --> 00:42:07,840 DAVID SONTAG: x1. 879 00:42:07,840 --> 00:42:09,790 Then we have y2. 880 00:42:09,790 --> 00:42:11,633 What are the parents of y2. 881 00:42:11,633 --> 00:42:12,550 Everyone say out loud. 882 00:42:12,550 --> 00:42:14,080 AUDIENCE: x2. 883 00:42:14,080 --> 00:42:16,960 DAVID SONTAG: x2 and so on. 884 00:42:16,960 --> 00:42:20,920 So this joint distribution is going 885 00:42:20,920 --> 00:42:23,560 to have a particularly simple form, which 886 00:42:23,560 --> 00:42:26,280 is given to by this factorization shown here. 887 00:42:26,280 --> 00:42:28,420 And this factorization corresponds one to one 888 00:42:28,420 --> 00:42:32,400 with the particular graph in the way that I just told you. 889 00:42:32,400 --> 00:42:35,760 And in this way, we can define a very complex probability 890 00:42:35,760 --> 00:42:39,900 distribution by a number of much simpler conditional probability 891 00:42:39,900 --> 00:42:41,220 distributions. 892 00:42:41,220 --> 00:42:44,740 For example, if each of the random variables were binary, 893 00:42:44,740 --> 00:42:48,840 then to describe probability of y1 given x1, 894 00:42:48,840 --> 00:42:50,250 we only need two numbers. 895 00:42:50,250 --> 00:42:52,840 For each value of x1, either 0 or 1, 896 00:42:52,840 --> 00:42:55,290 we give the probability of y1 equals 1. 897 00:42:55,290 --> 00:42:59,530 And then, of course, probably y1 equals 0 is just 1 minus that. 898 00:42:59,530 --> 00:43:02,290 So we can describe that very complicated joint distribution 899 00:43:02,290 --> 00:43:07,200 by a number of much smaller distributions. 900 00:43:07,200 --> 00:43:10,700 Now, the reason why I'm drawing it in this way 901 00:43:10,700 --> 00:43:13,940 is because we're making some really strong assumptions 902 00:43:13,940 --> 00:43:18,020 about the temporal dynamics in this problem. 903 00:43:18,020 --> 00:43:23,360 In particular, the fact that x3 only 904 00:43:23,360 --> 00:43:27,720 has an arrow from x2 and not from x1 905 00:43:27,720 --> 00:43:32,540 implies that x3 is conditionally independent of x1. 906 00:43:32,540 --> 00:43:34,400 If you knew x2's value. 907 00:43:34,400 --> 00:43:37,970 So in some sense, think about this as cutting. 908 00:43:37,970 --> 00:43:40,700 If you're to take x2 out of the model 909 00:43:40,700 --> 00:43:43,040 and remove all edges incident on it, 910 00:43:43,040 --> 00:43:46,490 then x1 and x3 are now separated from one another. 911 00:43:46,490 --> 00:43:48,110 They're independent. 912 00:43:48,110 --> 00:43:51,740 Now, for those of you who do know graphical models, 913 00:43:51,740 --> 00:43:54,770 you'll recognize that that type of independent statement that I 914 00:43:54,770 --> 00:43:56,480 made is only true for Markov models, 915 00:43:56,480 --> 00:43:58,605 and the semantics for Bayesian networks 916 00:43:58,605 --> 00:43:59,730 are a little bit different. 917 00:43:59,730 --> 00:44:02,058 But actually for this model, it's-- they're one 918 00:44:02,058 --> 00:44:02,600 and the same. 919 00:44:05,910 --> 00:44:08,990 So we're going to make the following assumptions 920 00:44:08,990 --> 00:44:12,890 for the conditional distributions shown here. 921 00:44:12,890 --> 00:44:16,850 First, we're going to suppose that xt is given to you 922 00:44:16,850 --> 00:44:19,490 by a Gaussian distribution. 923 00:44:19,490 --> 00:44:23,570 Remember xt-- t is denoting a time step. 924 00:44:23,570 --> 00:44:26,815 Let's say 3-- it only depends in this picture-- 925 00:44:26,815 --> 00:44:28,190 the conditional distribution only 926 00:44:28,190 --> 00:44:30,650 depends on the previous time step's value, x2, 927 00:44:30,650 --> 00:44:32,310 or xt minus 1. 928 00:44:32,310 --> 00:44:34,850 So you'll notice how I'm going to say here 929 00:44:34,850 --> 00:44:36,620 xt is going to distribute as something, 930 00:44:36,620 --> 00:44:38,690 but the only random variables in this something 931 00:44:38,690 --> 00:44:42,680 can be xt minus 1, according to these assumptions. 932 00:44:42,680 --> 00:44:44,180 In particular, we're going to assume 933 00:44:44,180 --> 00:44:47,930 that it's some Gaussian distribution, whose mean is 934 00:44:47,930 --> 00:44:51,020 some linear transformation of xt minus 1, 935 00:44:51,020 --> 00:44:55,240 and which has a fixed covariance matrix q. 936 00:44:55,240 --> 00:45:00,310 So at each step of this process, the next random variable 937 00:45:00,310 --> 00:45:03,700 is some random walk from the previous random variable 938 00:45:03,700 --> 00:45:07,833 where you're moving according to some Gaussian distribution. 939 00:45:07,833 --> 00:45:09,250 In a very similar way, we're going 940 00:45:09,250 --> 00:45:17,410 to assume that yt is drawn also as a Gaussian distribution, 941 00:45:17,410 --> 00:45:20,550 but now depending on xt. 942 00:45:20,550 --> 00:45:24,120 So I want you to think about xt as the true state 943 00:45:24,120 --> 00:45:25,410 of the patient. 944 00:45:25,410 --> 00:45:28,590 It's a vector that's summarizing their blood 945 00:45:28,590 --> 00:45:31,200 pressure, their oxygen saturation, 946 00:45:31,200 --> 00:45:33,150 a whole bunch of other parameters, 947 00:45:33,150 --> 00:45:35,460 or maybe even just one of those. 948 00:45:35,460 --> 00:45:39,300 And y1 are the observations that you do observe. 949 00:45:39,300 --> 00:45:41,890 So let's say x1 is the patient's true blood pressure. 950 00:45:41,890 --> 00:45:43,980 y1 is the observed blood pressure, 951 00:45:43,980 --> 00:45:47,010 what comes from your monitor. 952 00:45:47,010 --> 00:45:48,660 So then a reasonable assumption would 953 00:45:48,660 --> 00:45:52,350 be that, well, if all this were equal, 954 00:45:52,350 --> 00:45:53,910 if it was a true observation, then 955 00:45:53,910 --> 00:45:55,750 y1 should be very close to x1. 956 00:45:55,750 --> 00:45:58,680 So you might assume that this covariance matrix is-- 957 00:45:58,680 --> 00:46:01,460 the covariance is-- the variance is very, very small. 958 00:46:01,460 --> 00:46:07,280 y1 should be very close to x1 if it's a good observation. 959 00:46:07,280 --> 00:46:10,100 And of course, if it's a noisy observation-- 960 00:46:10,100 --> 00:46:15,680 like, for example, if the probe was disconnected from the baby, 961 00:46:15,680 --> 00:46:19,790 then y1 should have no relationship to x1. 962 00:46:19,790 --> 00:46:23,460 And that dependence on the actual state of the world 963 00:46:23,460 --> 00:46:26,730 I'm denoting here by these superscripts, s of t. 964 00:46:26,730 --> 00:46:28,730 I'm ignoring that right now, and I'll bring that 965 00:46:28,730 --> 00:46:31,910 in in the next slide. 966 00:46:31,910 --> 00:46:36,230 Similarly, the relationship between x2 and x1 967 00:46:36,230 --> 00:46:38,510 should be one which captures some of the dynamics 968 00:46:38,510 --> 00:46:42,140 that I showed in the previous slides, where I showed over 969 00:46:42,140 --> 00:46:46,040 here now this is the patient's true heart rate evolving 970 00:46:46,040 --> 00:46:48,080 across time, let's say. 971 00:46:48,080 --> 00:46:51,800 Notice how, if you look very locally, 972 00:46:51,800 --> 00:46:56,720 it looks like there are some very, very big local dynamics. 973 00:46:56,720 --> 00:46:58,790 Whereas if you look more globally, 974 00:46:58,790 --> 00:47:01,340 again, there's some smoothness, but there are some-- again, 975 00:47:01,340 --> 00:47:03,590 it looks like some random changes across time. 976 00:47:03,590 --> 00:47:10,070 And so those-- that drift has to somehow 977 00:47:10,070 --> 00:47:13,550 be summarized in this model by that A random variable. 978 00:47:13,550 --> 00:47:16,130 And I'll get into more detail about that in just a moment. 979 00:47:18,750 --> 00:47:20,990 So what I just showed you was an example 980 00:47:20,990 --> 00:47:23,360 of a linear dynamical system, but it 981 00:47:23,360 --> 00:47:27,170 was assuming that there were none of these events happening, 982 00:47:27,170 --> 00:47:30,082 none of these artifacts happening. 983 00:47:30,082 --> 00:47:31,540 The actual model that we were going 984 00:47:31,540 --> 00:47:33,040 to want to be able to use then is 985 00:47:33,040 --> 00:47:34,330 going to also incorporate the fact 986 00:47:34,330 --> 00:47:35,320 that there might be artifacts. 987 00:47:35,320 --> 00:47:36,640 And to model that, we need to introduce 988 00:47:36,640 --> 00:47:38,473 additional random variables corresponding to 989 00:47:38,473 --> 00:47:40,250 whether those artifacts occurred or not. 990 00:47:40,250 --> 00:47:42,290 And so that's now this model. 991 00:47:42,290 --> 00:47:45,370 So I'm going to let these S's-- 992 00:47:45,370 --> 00:47:47,850 these are other random variables, 993 00:47:47,850 --> 00:47:51,310 which are denoting artifactual events. 994 00:47:51,310 --> 00:47:52,970 They are also evolving with time. 995 00:47:52,970 --> 00:47:55,420 For example, if there's artifactual factual event 996 00:47:55,420 --> 00:47:57,875 at three seconds, maybe there's also an artifactual event 997 00:47:57,875 --> 00:47:58,720 at four seconds. 998 00:47:58,720 --> 00:48:00,887 And we like to model the relationship between those. 999 00:48:00,887 --> 00:48:02,600 That's why you have these arrows. 1000 00:48:02,600 --> 00:48:08,180 And then the way that we interpret the observations 1001 00:48:08,180 --> 00:48:12,620 that we do get depends on both the true value 1002 00:48:12,620 --> 00:48:14,340 of what's going on with the patient 1003 00:48:14,340 --> 00:48:17,612 and whether there was an artifactual event or not. 1004 00:48:17,612 --> 00:48:19,070 And you'll notice that there's also 1005 00:48:19,070 --> 00:48:20,780 an edge going from the artifactual events 1006 00:48:20,780 --> 00:48:23,270 to the true values to note the fact 1007 00:48:23,270 --> 00:48:27,680 that those interventions might actually 1008 00:48:27,680 --> 00:48:29,030 be affecting the patient. 1009 00:48:29,030 --> 00:48:31,040 For example, if you give them a medication 1010 00:48:31,040 --> 00:48:36,800 to change their blood pressure, then that procedure 1011 00:48:36,800 --> 00:48:39,895 is going to affect the next time step's value of the patient's 1012 00:48:39,895 --> 00:48:40,520 blood pressure. 1013 00:48:44,360 --> 00:48:47,917 So when one wants to learn this model, 1014 00:48:47,917 --> 00:48:49,750 you have to ask yourself, what types of data 1015 00:48:49,750 --> 00:48:51,167 do you have available? 1016 00:48:54,370 --> 00:48:59,680 Unfortunately, it's very hard to get data on both the ground 1017 00:48:59,680 --> 00:49:02,210 truth, what's going on with the patient, 1018 00:49:02,210 --> 00:49:06,530 and whether these artifacts truly occurred or not. 1019 00:49:06,530 --> 00:49:09,530 Instead, what we actually have are just these observations. 1020 00:49:09,530 --> 00:49:13,450 We get these very noisy blood pressure draws across time. 1021 00:49:13,450 --> 00:49:16,500 So what this paper does is it uses a maximum likelihood 1022 00:49:16,500 --> 00:49:18,797 estimation approach, where it recognizes 1023 00:49:18,797 --> 00:49:20,880 that we're going to be learning from missing data. 1024 00:49:20,880 --> 00:49:23,940 We're going to explicitly think of these x's and the s's 1025 00:49:23,940 --> 00:49:25,875 as latent variables. 1026 00:49:25,875 --> 00:49:27,990 And we're going to maximize the likelihood 1027 00:49:27,990 --> 00:49:31,820 of the whole entire model, marginalizing over x and s. 1028 00:49:31,820 --> 00:49:34,485 So just maximizing the marginal likelihood over the y's. 1029 00:49:37,240 --> 00:49:39,740 Now, for those of you who have studied unsupervised learning 1030 00:49:39,740 --> 00:49:43,570 before, you might recognize that as a very hard learning 1031 00:49:43,570 --> 00:49:44,070 problem. 1032 00:49:44,070 --> 00:49:47,780 In fact, it's-- that likelihood is non-convex. 1033 00:49:47,780 --> 00:49:51,990 And one could imagine all sorts of a heuristics for learning, 1034 00:49:51,990 --> 00:49:55,460 such as gradient descent, or, as this paper uses, 1035 00:49:55,460 --> 00:49:59,180 expectation maximization, and because of that non-convexity, 1036 00:49:59,180 --> 00:50:00,750 each of these algorithms typically 1037 00:50:00,750 --> 00:50:04,040 will only reach a local maxima of the likelihood. 1038 00:50:04,040 --> 00:50:08,420 So this paper uses EM, which intuitively iterates 1039 00:50:08,420 --> 00:50:14,420 between inferring those missing variables-- so imputing the x's 1040 00:50:14,420 --> 00:50:17,210 and the s's given the current model, 1041 00:50:17,210 --> 00:50:20,300 and doing posterior inference to infer the missing 1042 00:50:20,300 --> 00:50:22,760 variables given the observed variables, using 1043 00:50:22,760 --> 00:50:24,140 the current model. 1044 00:50:24,140 --> 00:50:27,020 And then, once you've imputed those variables, 1045 00:50:27,020 --> 00:50:28,910 attempting to refit the model. 1046 00:50:28,910 --> 00:50:30,920 So that's called the m-step for maximization, 1047 00:50:30,920 --> 00:50:32,900 which updates the model and just iterates between those two 1048 00:50:32,900 --> 00:50:33,400 things. 1049 00:50:33,400 --> 00:50:36,590 That's one learning algorithm which 1050 00:50:36,590 --> 00:50:39,650 is guaranteed to reach a local maxima of the likelihood 1051 00:50:39,650 --> 00:50:42,830 under some regularity assumptions. 1052 00:50:42,830 --> 00:50:44,690 And so this paper uses that algorithm, 1053 00:50:44,690 --> 00:50:46,520 but you need to be asking yourself, 1054 00:50:46,520 --> 00:50:50,270 if all you ever observe are the y's, 1055 00:50:50,270 --> 00:50:54,830 then will this algorithm ever recover anything 1056 00:50:54,830 --> 00:50:56,600 close to the true model? 1057 00:50:56,600 --> 00:50:58,310 For example, there might be large amounts 1058 00:50:58,310 --> 00:51:00,080 of non-identifiability here. 1059 00:51:00,080 --> 00:51:04,490 It could be that you could swap the meaning 1060 00:51:04,490 --> 00:51:10,170 of the s's, and you'd get a similar likelihood on the y's. 1061 00:51:10,170 --> 00:51:14,010 That's where bringing in domain knowledge becomes critical. 1062 00:51:14,010 --> 00:51:17,670 So this is going to be an example where we have no label 1063 00:51:17,670 --> 00:51:22,948 data or very little label data. 1064 00:51:22,948 --> 00:51:24,740 And we're going to do unsupervised learning 1065 00:51:24,740 --> 00:51:26,282 of this model, but we're going to use 1066 00:51:26,282 --> 00:51:28,790 a ton of domain knowledge in order to constrain 1067 00:51:28,790 --> 00:51:31,050 the model as much as possible. 1068 00:51:31,050 --> 00:51:33,490 So what is that domain knowledge? 1069 00:51:33,490 --> 00:51:37,730 Well, first we're going to use the fact 1070 00:51:37,730 --> 00:51:47,200 that we know that a true heart rate evolves in a fashion that 1071 00:51:47,200 --> 00:51:53,530 can be very well modeled by an autoregressive process. 1072 00:51:53,530 --> 00:51:56,260 So the autoregressive process that's used in this paper 1073 00:51:56,260 --> 00:51:58,630 is used to model the normal heart rate dynamics. 1074 00:51:58,630 --> 00:52:01,060 In a moment, I'll tell you how to model the abnormal heart 1075 00:52:01,060 --> 00:52:03,370 rate observations. 1076 00:52:03,370 --> 00:52:05,530 And intuitively-- I'll first go over the intuition, 1077 00:52:05,530 --> 00:52:06,850 then I'll give you the math. 1078 00:52:06,850 --> 00:52:08,650 Intuitively what it does is it recognizes 1079 00:52:08,650 --> 00:52:14,060 that this complicated signal can be decomposed into two pieces. 1080 00:52:14,060 --> 00:52:18,020 The first piece shown here is called a baseline signal, 1081 00:52:18,020 --> 00:52:20,315 and that, if you squint your eyes 1082 00:52:20,315 --> 00:52:22,700 and you sort or ignore the very local fluctuations, 1083 00:52:22,700 --> 00:52:24,860 this is what you get out. 1084 00:52:24,860 --> 00:52:27,230 And then you can look at the residual 1085 00:52:27,230 --> 00:52:32,330 of subtracting this signal, subtracting this baseline 1086 00:52:32,330 --> 00:52:33,710 from the signal. 1087 00:52:33,710 --> 00:52:36,250 And what you get out looks like this. 1088 00:52:36,250 --> 00:52:39,770 Notice here it's around 0 mean. 1089 00:52:39,770 --> 00:52:42,585 So it's a 0 mean signal with some random fluctuations, 1090 00:52:42,585 --> 00:52:44,210 and the fluctuations are happening here 1091 00:52:44,210 --> 00:52:47,210 at a much faster rate than-- 1092 00:52:47,210 --> 00:52:49,830 and for the original baseline. 1093 00:52:49,830 --> 00:52:56,910 And so the sum of bt and this residual is a very-- 1094 00:52:56,910 --> 00:53:00,200 it looks-- is exactly equal to the true heart rate. 1095 00:53:00,200 --> 00:53:03,290 And each of these two things we can model very well. 1096 00:53:03,290 --> 00:53:08,210 This we can model by a random walk with-- 1097 00:53:08,210 --> 00:53:10,970 which goes very slowly, and this we 1098 00:53:10,970 --> 00:53:15,297 can model by a random walk which goes very quickly. 1099 00:53:15,297 --> 00:53:17,630 And that is exactly what I'm now going to show over here 1100 00:53:17,630 --> 00:53:19,180 on the left hand side. 1101 00:53:19,180 --> 00:53:22,880 bt, this baseline signal, we're going 1102 00:53:22,880 --> 00:53:26,540 to model as a Gaussian distribution, which 1103 00:53:26,540 --> 00:53:29,600 is parameterized as a function of not just bt minus 1, 1104 00:53:29,600 --> 00:53:32,480 but also bt minus 2, and bt minus 3. 1105 00:53:32,480 --> 00:53:34,940 And so we're going to be taking a weighted average 1106 00:53:34,940 --> 00:53:39,560 of the previous few time steps, where we're smoothing out, 1107 00:53:39,560 --> 00:53:45,220 in essence, the observation-- the previous few observations. 1108 00:53:45,220 --> 00:53:47,970 If you were to-- 1109 00:53:47,970 --> 00:53:50,310 if you're being a keen observer, you'll 1110 00:53:50,310 --> 00:53:53,790 notice that this is no longer a Markov model. 1111 00:54:04,870 --> 00:54:11,460 For example, if this p1 and p2 are equal to 2, 1112 00:54:11,460 --> 00:54:14,790 this then corresponds to a second order Markov model, 1113 00:54:14,790 --> 00:54:18,600 because each random variable depends on the previous two 1114 00:54:18,600 --> 00:54:24,530 time steps of the Markov chain. 1115 00:54:24,530 --> 00:54:31,790 And so after-- so you would model now bt by this process, 1116 00:54:31,790 --> 00:54:34,880 and you would probably be averaging 1117 00:54:34,880 --> 00:54:36,920 over a large number of previous time steps 1118 00:54:36,920 --> 00:54:39,020 to get this smooth property. 1119 00:54:39,020 --> 00:54:45,620 And then you'd model xt minus bt by this autoregressive process, 1120 00:54:45,620 --> 00:54:47,780 where you might, for example, just 1121 00:54:47,780 --> 00:54:50,313 be looking at just the previous couple of time steps. 1122 00:54:50,313 --> 00:54:51,980 And you recognize that you're just doing 1123 00:54:51,980 --> 00:54:55,600 much more random fluctuations. 1124 00:54:55,600 --> 00:54:59,480 And then-- so that's how one would now model normal heart 1125 00:54:59,480 --> 00:55:00,650 rate dynamics. 1126 00:55:00,650 --> 00:55:02,900 And again, it's just-- 1127 00:55:02,900 --> 00:55:04,730 this is an example of a statistical model. 1128 00:55:04,730 --> 00:55:06,110 There is no mechanistic knowledge 1129 00:55:06,110 --> 00:55:08,540 of hearts being used here, but we 1130 00:55:08,540 --> 00:55:13,710 can fit the data of normal hearts pretty well using this. 1131 00:55:13,710 --> 00:55:15,960 But the next question and the most interesting one 1132 00:55:15,960 --> 00:55:20,510 is, how does one now model artifactual events? 1133 00:55:20,510 --> 00:55:26,120 So for that, that's where some mechanistic knowledge comes in. 1134 00:55:26,120 --> 00:55:30,180 So one models that the probe dropouts 1135 00:55:30,180 --> 00:55:35,120 are given by recognizing that, if a probe 1136 00:55:35,120 --> 00:55:39,020 is removed from the baby, then there should no longer be-- 1137 00:55:39,020 --> 00:55:41,253 or at least if you-- after a small amount of time, 1138 00:55:41,253 --> 00:55:42,920 there should no longer be any dependence 1139 00:55:42,920 --> 00:55:44,450 on the true value of the baby. 1140 00:55:44,450 --> 00:55:48,080 For example, the blood pressure, once the blood pressure probe 1141 00:55:48,080 --> 00:55:50,870 is removed, is no longer related to the baby's true blood 1142 00:55:50,870 --> 00:55:52,910 pressure. 1143 00:55:52,910 --> 00:55:57,130 But there might be some delay to that lack of dependence. 1144 00:55:57,130 --> 00:55:59,450 And so-- and that is going to be encoded in some domain 1145 00:55:59,450 --> 00:55:59,950 knowledge. 1146 00:55:59,950 --> 00:56:01,840 So for example, in the temperature probe, 1147 00:56:01,840 --> 00:56:04,480 when you remove the temperature probe from the baby, 1148 00:56:04,480 --> 00:56:07,682 it starts heating up again-- or it starts cooling, so 1149 00:56:07,682 --> 00:56:09,640 assuming that the ambient temperature is cooler 1150 00:56:09,640 --> 00:56:11,280 than the baby's temperature. 1151 00:56:11,280 --> 00:56:12,790 So you take it off the baby. 1152 00:56:12,790 --> 00:56:14,170 It starts cooling down. 1153 00:56:14,170 --> 00:56:15,692 How fast does it cool down? 1154 00:56:15,692 --> 00:56:17,400 Well, you could assume that it cools down 1155 00:56:17,400 --> 00:56:20,320 with some exponential decay from the baby's temperature. 1156 00:56:20,320 --> 00:56:22,750 And this is something that is very reasonable, 1157 00:56:22,750 --> 00:56:24,490 and you could imagine, maybe if you 1158 00:56:24,490 --> 00:56:26,530 had label data for just a few of the babies, 1159 00:56:26,530 --> 00:56:28,780 you could try to fit the parameters of the exponential 1160 00:56:28,780 --> 00:56:30,840 very quickly. 1161 00:56:30,840 --> 00:56:33,160 And in this way, now, we parameterize the conditional 1162 00:56:33,160 --> 00:56:39,040 distribution of the temperature probe, given both the state 1163 00:56:39,040 --> 00:56:42,220 and whether the artifact occurred or not, 1164 00:56:42,220 --> 00:56:45,710 using this very simple exponential decay. 1165 00:56:45,710 --> 00:56:49,957 And in this paper, they give a very similar type of-- 1166 00:56:49,957 --> 00:56:51,790 they make similar types of-- analogous types 1167 00:56:51,790 --> 00:56:54,588 of assumptions for all of the other artifactual probes. 1168 00:56:54,588 --> 00:56:56,380 You should think about this as constraining 1169 00:56:56,380 --> 00:56:58,757 these conditional distributions I showed you here. 1170 00:56:58,757 --> 00:57:01,090 They're no longer allowed to be arbitrary distributions, 1171 00:57:01,090 --> 00:57:03,910 and so that, when one does now expectation maximization 1172 00:57:03,910 --> 00:57:06,573 to try to maximize the marginal likelihood of the data, 1173 00:57:06,573 --> 00:57:07,990 you've now constrained it in a way 1174 00:57:07,990 --> 00:57:10,073 that you hopefully are moved on to identifyability 1175 00:57:10,073 --> 00:57:11,310 of the learning problem. 1176 00:57:11,310 --> 00:57:13,330 It makes all of the difference in learning here. 1177 00:57:18,130 --> 00:57:21,730 So in this paper, their evaluation 1178 00:57:21,730 --> 00:57:23,830 did a little bit of fine tuning for each baby. 1179 00:57:23,830 --> 00:57:26,650 In particular, they assumed that the first 30 minutes 1180 00:57:26,650 --> 00:57:31,150 near the start consists of normal dynamics 1181 00:57:31,150 --> 00:57:33,190 so that's there are no artifacts. 1182 00:57:33,190 --> 00:57:34,750 That's, of course, a big assumption, 1183 00:57:34,750 --> 00:57:39,100 but they use that to try to fine tune the dynamic model 1184 00:57:39,100 --> 00:57:43,540 to fine tune it for each baby and for themselves. 1185 00:57:43,540 --> 00:57:45,070 And then they looked at the ability 1186 00:57:45,070 --> 00:57:47,357 to try to identify artifactual processes. 1187 00:57:47,357 --> 00:57:49,690 Now, I want to go a little bit slowly through this plot, 1188 00:57:49,690 --> 00:57:52,350 because it's quite interesting. 1189 00:57:52,350 --> 00:57:57,990 So what I'm showing you here is a ROC curve 1190 00:57:57,990 --> 00:58:00,292 of the ability to predict each of the four 1191 00:58:00,292 --> 00:58:01,500 different types of artifacts. 1192 00:58:01,500 --> 00:58:03,810 For example, at any one point in time, 1193 00:58:03,810 --> 00:58:05,990 was there a blood sample being taken or not? 1194 00:58:05,990 --> 00:58:07,890 At any one point in time, was there 1195 00:58:07,890 --> 00:58:12,270 a core temperature disconnect of the core temperature probe? 1196 00:58:12,270 --> 00:58:13,770 And to evaluate it, they're assuming 1197 00:58:13,770 --> 00:58:18,850 that they have some label data for evaluation purposes only. 1198 00:58:18,850 --> 00:58:22,110 And of course, you want to be at the very far top left corner 1199 00:58:22,110 --> 00:58:23,866 up here. 1200 00:58:23,866 --> 00:58:27,820 And what we're showing here are three different curves-- 1201 00:58:27,820 --> 00:58:31,120 the very faint dotted line, which 1202 00:58:31,120 --> 00:58:34,780 I'm going to trace out with my cursor, is the baseline. 1203 00:58:34,780 --> 00:58:39,068 Think of that as a much worse algorithm. 1204 00:58:41,640 --> 00:58:42,140 Sorry. 1205 00:58:42,140 --> 00:58:44,523 That's that line over there. 1206 00:58:44,523 --> 00:58:45,190 Everyone see it? 1207 00:58:49,030 --> 00:58:52,110 And this approach are the other two lines. 1208 00:58:52,110 --> 00:58:54,800 Now, what's differentiating those other two lines 1209 00:58:54,800 --> 00:58:57,940 corresponds to the particular type of approximate inference 1210 00:58:57,940 --> 00:59:00,120 algorithm that's used. 1211 00:59:00,120 --> 00:59:05,640 To do this posterior inference, to infer 1212 00:59:05,640 --> 00:59:10,290 the true value of the x's given your noisy observations 1213 00:59:10,290 --> 00:59:14,160 in the model given here is actually a very hard inference 1214 00:59:14,160 --> 00:59:15,920 problem. 1215 00:59:15,920 --> 00:59:18,330 Mathematically, I think one can show 1216 00:59:18,330 --> 00:59:21,692 that it's an NP-hard computational problem. 1217 00:59:21,692 --> 00:59:23,650 And so they have to approximate it in some way, 1218 00:59:23,650 --> 00:59:26,010 and they use two different approximations here. 1219 00:59:26,010 --> 00:59:28,400 The first approximation is based on what they're 1220 00:59:28,400 --> 00:59:31,110 calling a Gaussian sum approximation, 1221 00:59:31,110 --> 00:59:33,420 and it's a deterministic approximation. 1222 00:59:33,420 --> 00:59:37,240 The second approximation is based on a Monte Carlo method. 1223 00:59:37,240 --> 00:59:40,290 And what you see here is that the Gaussian sum approximation 1224 00:59:40,290 --> 00:59:41,970 is actually dramatically better. 1225 00:59:41,970 --> 00:59:43,920 So for example, in this blood sample one, 1226 00:59:43,920 --> 00:59:48,750 that the ROC curve looks like this for the Gaussian sum 1227 00:59:48,750 --> 00:59:49,640 approximation. 1228 00:59:49,640 --> 00:59:51,390 Whereas for the Monte Carlo approximation, 1229 00:59:51,390 --> 00:59:54,510 it's actually significantly lower. 1230 00:59:54,510 --> 00:59:56,400 And this is just to point out that, even 1231 00:59:56,400 --> 01:00:03,660 in this setting, where we have very little data, 1232 01:00:03,660 --> 01:00:06,780 we're using a lot of domain knowledge, the actual details 1233 01:00:06,780 --> 01:00:09,053 of how one does the math-- in particular, 1234 01:00:09,053 --> 01:00:10,470 the proximate inference-- can make 1235 01:00:10,470 --> 01:00:13,047 a really big difference in the performance of this system. 1236 01:00:13,047 --> 01:00:14,880 And so it's something that one should really 1237 01:00:14,880 --> 01:00:16,047 think deeply about, as well. 1238 01:00:18,666 --> 01:00:21,700 I'm going to skip that slide, and then just mention 1239 01:00:21,700 --> 01:00:23,170 very briefly this one. 1240 01:00:23,170 --> 01:00:28,640 This is showing an inference of the events. 1241 01:00:28,640 --> 01:00:34,600 So here I'm showing you three different observations. 1242 01:00:34,600 --> 01:00:39,130 And on the bottom here, I'm showing the prediction 1243 01:00:39,130 --> 01:00:43,950 of when artifact-- two different artifactual events happened. 1244 01:00:43,950 --> 01:00:46,020 And these predictions were actually quite good, 1245 01:00:46,020 --> 01:00:48,180 using this model. 1246 01:00:48,180 --> 01:00:52,210 So I'm done with that first example, and-- 1247 01:00:52,210 --> 01:00:55,380 and the-- just to recap the important points 1248 01:00:55,380 --> 01:01:01,300 of that example, it was that we had almost no label data. 1249 01:01:01,300 --> 01:01:05,470 We're tackling this problem using a cleverly chosen 1250 01:01:05,470 --> 01:01:08,780 statistical model with some domain knowledge built in, 1251 01:01:08,780 --> 01:01:12,040 and that can go really far. 1252 01:01:12,040 --> 01:01:14,500 So now we'll shift gears to talk about a different type 1253 01:01:14,500 --> 01:01:18,340 of problem involving physiological data, 1254 01:01:18,340 --> 01:01:22,570 and that's of detecting atrial fibrillation. 1255 01:01:22,570 --> 01:01:26,280 So what I'm showing you here is an AliveCore device. 1256 01:01:26,280 --> 01:01:27,850 I own one of these. 1257 01:01:27,850 --> 01:01:30,540 So if you want to drop by my E25 545 office, 1258 01:01:30,540 --> 01:01:32,860 you can-- you can play around with it. 1259 01:01:32,860 --> 01:01:35,930 And if you attach it to your mobile phone, 1260 01:01:35,930 --> 01:01:43,800 it'll show you your electric conductance through your heart 1261 01:01:43,800 --> 01:01:46,710 as measured through your two fingers 1262 01:01:46,710 --> 01:01:48,670 touching this device shown over here. 1263 01:01:48,670 --> 01:01:51,270 And from that, one can try to detect whether the patient has 1264 01:01:51,270 --> 01:01:52,990 atrial fibrillation. 1265 01:01:52,990 --> 01:01:54,941 So what is atrial fibrillation? 1266 01:01:58,617 --> 01:01:59,200 Good question. 1267 01:01:59,200 --> 01:02:00,284 It's [INAUDIBLE]. 1268 01:02:04,240 --> 01:02:10,270 So this is from the American Heart Association. 1269 01:02:10,270 --> 01:02:13,810 They defined atrial fibrillation as a quivering or irregular 1270 01:02:13,810 --> 01:02:16,450 heartbeat, also known as arrhythmia. 1271 01:02:16,450 --> 01:02:18,220 And one of the big challenges is that it 1272 01:02:18,220 --> 01:02:21,030 could lead to blood clot, stroke, heart failure, and so 1273 01:02:21,030 --> 01:02:21,530 on. 1274 01:02:21,530 --> 01:02:23,980 So here is how a patient might describe 1275 01:02:23,980 --> 01:02:26,020 having atrial fibrillation. 1276 01:02:26,020 --> 01:02:28,180 My heart flip-flops, skips beats, 1277 01:02:28,180 --> 01:02:31,150 feels like it's banging against my chest wall, 1278 01:02:31,150 --> 01:02:33,790 particularly when I'm carrying stuff up my stairs 1279 01:02:33,790 --> 01:02:35,542 or bending down. 1280 01:02:35,542 --> 01:02:37,250 Now let's try to look at a picture of it. 1281 01:02:48,040 --> 01:02:55,330 So this is a normal heartbeat. 1282 01:02:55,330 --> 01:02:59,860 Hearts move-- pumping like this. 1283 01:02:59,860 --> 01:03:03,130 And if you were to look at the signal 1284 01:03:03,130 --> 01:03:04,810 output of the EKG of a normal heartbeat, 1285 01:03:04,810 --> 01:03:05,620 it would look like this. 1286 01:03:05,620 --> 01:03:07,735 And it's roughly corresponding to the different-- 1287 01:03:07,735 --> 01:03:09,840 the signal is corresponding to different cycles 1288 01:03:09,840 --> 01:03:12,420 of the heartbeat. 1289 01:03:12,420 --> 01:03:15,000 Now for a patient who has atrial fibrillation, 1290 01:03:15,000 --> 01:03:16,290 it looks more like this. 1291 01:03:21,650 --> 01:03:25,677 So much more obviously abnormal, at least in this figure. 1292 01:03:25,677 --> 01:03:27,510 And if you look at the corresponding signal, 1293 01:03:27,510 --> 01:03:29,382 it also looks very different. 1294 01:03:29,382 --> 01:03:31,590 So this is just to give you some intuition about what 1295 01:03:31,590 --> 01:03:33,577 I mean by atrial fibrillation. 1296 01:03:36,990 --> 01:03:39,930 So what we're going to try to do now is to detect it. 1297 01:03:39,930 --> 01:03:44,090 So we're going to take data like that 1298 01:03:44,090 --> 01:03:48,580 and try to classify it into a number of different categories. 1299 01:03:48,580 --> 01:03:52,630 Now this is something which has been studied for decades, 1300 01:03:52,630 --> 01:03:57,430 and last year, 2017, there was a competition 1301 01:03:57,430 --> 01:04:01,450 run by Professor Roger Mark, who is here 1302 01:04:01,450 --> 01:04:04,390 at MIT, which is trying to see, well, how could-- 1303 01:04:04,390 --> 01:04:06,460 how good are we at trying to figure out 1304 01:04:06,460 --> 01:04:09,940 which patients have different types of heart rhythms 1305 01:04:09,940 --> 01:04:11,780 based on data that looks like this? 1306 01:04:11,780 --> 01:04:13,300 So this is a normal rhythm, which 1307 01:04:13,300 --> 01:04:16,700 is also called a sinus rhythm. 1308 01:04:16,700 --> 01:04:18,750 And over here it's atrial-- 1309 01:04:18,750 --> 01:04:22,120 this is an example one patient who has atrial fibrillation. 1310 01:04:22,120 --> 01:04:25,200 This is another type of rhythm that's not atrial fibrillation, 1311 01:04:25,200 --> 01:04:26,590 but is abnormal. 1312 01:04:26,590 --> 01:04:29,670 And this is a noisy recording-- for example, if a patient's-- 1313 01:04:29,670 --> 01:04:32,220 doesn't really have their two fingers very well put 1314 01:04:32,220 --> 01:04:35,180 on to the two leads of the device. 1315 01:04:35,180 --> 01:04:41,040 So given one of these categories, can we predict-- 1316 01:04:41,040 --> 01:04:42,760 one of these signals, could predict 1317 01:04:42,760 --> 01:04:45,355 which category it came from? 1318 01:04:45,355 --> 01:04:47,230 So if you looked at this, you might recognize 1319 01:04:47,230 --> 01:04:48,970 that they look a bit different. 1320 01:04:48,970 --> 01:04:53,380 So could some of you guess what might 1321 01:04:53,380 --> 01:04:55,780 be predictive features that differentiate 1322 01:04:55,780 --> 01:04:59,440 one of these signals from the other? 1323 01:04:59,440 --> 01:05:00,303 In the back? 1324 01:05:00,303 --> 01:05:01,720 AUDIENCE: The presence and absence 1325 01:05:01,720 --> 01:05:07,065 of one of the peaks the QRS complex are [INAUDIBLE].. 1326 01:05:07,065 --> 01:05:08,440 DAVID SONTAG: So speak in English 1327 01:05:08,440 --> 01:05:10,722 for people who don't know what these terms mean. 1328 01:05:10,722 --> 01:05:12,680 AUDIENCE: There is one large piece, which can-- 1329 01:05:12,680 --> 01:05:16,730 probably we can consider one mV and there is another peak, 1330 01:05:16,730 --> 01:05:18,520 which is sort of like-- 1331 01:05:18,520 --> 01:05:20,630 they have reverse polarity between normal rhythm 1332 01:05:20,630 --> 01:05:21,310 and [INAUDIBLE]. 1333 01:05:21,310 --> 01:05:22,102 DAVID SONTAG: Good. 1334 01:05:22,102 --> 01:05:23,820 So are you a cardiologist? 1335 01:05:23,820 --> 01:05:24,710 AUDIENCE: No. 1336 01:05:24,710 --> 01:05:26,440 DAVID SONTAG: No, OK. 1337 01:05:26,440 --> 01:05:29,050 So what the student suggested is one 1338 01:05:29,050 --> 01:05:31,660 could look for sort of these inversions 1339 01:05:31,660 --> 01:05:34,670 to try to describe it a little bit differently. 1340 01:05:34,670 --> 01:05:41,290 So here you're suggesting the lack of those inversions 1341 01:05:41,290 --> 01:05:45,430 is predictive of an abnormal rhythm. 1342 01:05:45,430 --> 01:05:47,655 What about another feature that could be predictive? 1343 01:05:47,655 --> 01:05:48,155 Yep? 1344 01:05:48,155 --> 01:05:49,840 AUDIENCE: The spacing between the peaks 1345 01:05:49,840 --> 01:05:52,030 is more irregular with the AF. 1346 01:05:52,030 --> 01:05:53,740 DAVID SONTAG: The spacing between beats 1347 01:05:53,740 --> 01:05:56,853 is more irregular with the AF rhythm. 1348 01:05:56,853 --> 01:05:58,270 So you're sort of looking at this. 1349 01:05:58,270 --> 01:06:00,160 You see how here this spacing is very 1350 01:06:00,160 --> 01:06:01,538 different from this spacing. 1351 01:06:01,538 --> 01:06:03,580 Whereas in the normal rhythm, sort of the spacing 1352 01:06:03,580 --> 01:06:05,690 looks pretty darn regular. 1353 01:06:05,690 --> 01:06:07,060 All right, good. 1354 01:06:07,060 --> 01:06:11,050 So if I was to show you 40 examples of these 1355 01:06:11,050 --> 01:06:12,940 and then ask you to classify some new ones, 1356 01:06:12,940 --> 01:06:15,280 how well do you think you'll be able to do? 1357 01:06:15,280 --> 01:06:15,780 Pretty well? 1358 01:06:20,970 --> 01:06:23,550 I would be surprised if you couldn't do reasonably 1359 01:06:23,550 --> 01:06:26,250 well at least distinguishing between normal rhythm and AF 1360 01:06:26,250 --> 01:06:30,510 rhythm, because there seem to be some pretty clear signals here. 1361 01:06:30,510 --> 01:06:32,580 Of course, as you get into alternatives, 1362 01:06:32,580 --> 01:06:34,848 then the story gets much more complex. 1363 01:06:34,848 --> 01:06:36,390 But let me dig in a little bit deeper 1364 01:06:36,390 --> 01:06:37,980 into what I mean by this. 1365 01:06:37,980 --> 01:06:39,600 So let's define some of these terms. 1366 01:06:39,600 --> 01:06:44,430 Well, cardiologists have studied this for a really long time, 1367 01:06:44,430 --> 01:06:46,530 and they have-- so what I'm showing 1368 01:06:46,530 --> 01:06:49,380 you here is one heart cycle. 1369 01:06:49,380 --> 01:06:53,220 And they've-- you can put names to each of the peaks that you 1370 01:06:53,220 --> 01:06:55,860 would see in a regular heart cycle-- so that-- for example, 1371 01:06:55,860 --> 01:06:59,250 that very high peak is known as the R peak. 1372 01:06:59,250 --> 01:07:03,060 And you could look at, for example, the interval-- 1373 01:07:03,060 --> 01:07:06,720 so this is one beat. 1374 01:07:06,720 --> 01:07:10,320 You could look at the interval between the R peak of one beat 1375 01:07:10,320 --> 01:07:13,050 and the R peak of another peak, and define 1376 01:07:13,050 --> 01:07:15,440 that to be the RR interval. 1377 01:07:15,440 --> 01:07:18,050 In a similar way, one could take-- 1378 01:07:18,050 --> 01:07:21,060 one could find different distinctive elements 1379 01:07:21,060 --> 01:07:22,140 of the signal-- 1380 01:07:22,140 --> 01:07:23,032 by the way, each-- 1381 01:07:25,680 --> 01:07:28,110 each time step corresponds to the heart 1382 01:07:28,110 --> 01:07:30,410 being in a different position. 1383 01:07:30,410 --> 01:07:33,860 For a healthy heart, these are relatively deterministic. 1384 01:07:33,860 --> 01:07:36,330 And so you could look at other distances and derive 1385 01:07:36,330 --> 01:07:38,010 features from those distances, as well, 1386 01:07:38,010 --> 01:07:40,160 just like we were talking about, both within a beat 1387 01:07:40,160 --> 01:07:42,220 and across beats. 1388 01:07:42,220 --> 01:07:42,895 Yep? 1389 01:07:42,895 --> 01:07:44,312 AUDIENCE: So what's the difference 1390 01:07:44,312 --> 01:07:46,090 between a segment and an interval again? 1391 01:07:48,333 --> 01:07:50,250 DAVID SONTAG: I don't know what the difference 1392 01:07:50,250 --> 01:07:51,420 between a segment and an interval is. 1393 01:07:51,420 --> 01:07:52,070 Does anyone else know? 1394 01:07:52,070 --> 01:07:54,070 I mean, I guess the interval is between probably 1395 01:07:54,070 --> 01:07:56,490 the heads of peaks, whereas segments might refer to 1396 01:07:56,490 --> 01:07:59,193 within a interval. 1397 01:07:59,193 --> 01:07:59,860 That's my guess. 1398 01:07:59,860 --> 01:08:00,902 Does someone know better? 1399 01:08:04,190 --> 01:08:05,630 For the purpose of today's class, 1400 01:08:05,630 --> 01:08:07,366 that's a good enough understanding. 1401 01:08:10,940 --> 01:08:14,060 The point is this is well understood. 1402 01:08:14,060 --> 01:08:16,093 One could derive features from this. 1403 01:08:16,093 --> 01:08:16,776 AUDIENCE: By us. 1404 01:08:16,776 --> 01:08:17,609 DAVID SONTAG: By us. 1405 01:08:20,180 --> 01:08:23,399 So what would a traditional approach be to this problem? 1406 01:08:23,399 --> 01:08:24,020 So this is-- 1407 01:08:24,020 --> 01:08:27,050 I'm pulling this figure from a paper from 2002. 1408 01:08:27,050 --> 01:08:30,200 What it'll do is it'll take in that signal. 1409 01:08:30,200 --> 01:08:32,960 It'll do some filtering of it. 1410 01:08:32,960 --> 01:08:35,750 Then it'll run a peak detection logic, which 1411 01:08:35,750 --> 01:08:38,840 will find these peaks, and then it'll 1412 01:08:38,840 --> 01:08:43,939 measure intervals between these peaks and within a beat. 1413 01:08:43,939 --> 01:08:48,069 And it'll take those computations 1414 01:08:48,069 --> 01:08:49,760 or make some decision based on it. 1415 01:08:49,760 --> 01:08:51,590 So that's a traditional algorithm, 1416 01:08:51,590 --> 01:08:54,310 and they work pretty reasonably. 1417 01:08:54,310 --> 01:08:56,560 And so what do I mean by signal processing? 1418 01:08:56,560 --> 01:08:58,790 Well, this is an example of that. 1419 01:08:58,790 --> 01:09:01,880 I encourage any of you to go home today and try to code up 1420 01:09:01,880 --> 01:09:03,140 a peaked finding algorithm. 1421 01:09:03,140 --> 01:09:06,819 It's not that hard, at least not to get an OK one. 1422 01:09:06,819 --> 01:09:11,149 You might imagine keeping a running tab 1423 01:09:11,149 --> 01:09:13,811 of what's the highest signal you've seen so far. 1424 01:09:13,811 --> 01:09:16,019 Then you look to see what is the first time it drops, 1425 01:09:16,019 --> 01:09:18,394 and the second time-- and the next time it goes up larger 1426 01:09:18,394 --> 01:09:22,064 than, let's say, the previous-- 1427 01:09:22,064 --> 01:09:22,939 suppose that one of-- 1428 01:09:22,939 --> 01:09:26,689 you want to look for when the drop is-- the maximum value-- 1429 01:09:26,689 --> 01:09:28,790 recent maximum value divided by 2. 1430 01:09:28,790 --> 01:09:31,279 And then you-- then you reset. 1431 01:09:31,279 --> 01:09:33,800 And you can imagine in this way very quickly coding up 1432 01:09:33,800 --> 01:09:37,755 a peak finding algorithm. 1433 01:09:37,755 --> 01:09:39,380 And so this is just, again, to give you 1434 01:09:39,380 --> 01:09:43,130 some intuition behind what a traditional approach would be. 1435 01:09:43,130 --> 01:09:46,790 And then you can very quickly see that that-- 1436 01:09:46,790 --> 01:09:49,729 once you start to look at some intervals between peaks, 1437 01:09:49,729 --> 01:09:52,880 that alone is often good enough for predicting 1438 01:09:52,880 --> 01:09:55,050 whether a patient has atrial fibrillation. 1439 01:09:55,050 --> 01:09:58,940 So this is a figure taken from paper in 2001 1440 01:09:58,940 --> 01:10:01,310 showing a single patient's time series. 1441 01:10:01,310 --> 01:10:04,940 So the x-axis is for that single patient, 1442 01:10:04,940 --> 01:10:07,250 their heart beats across time. 1443 01:10:07,250 --> 01:10:09,830 The y-axis is just showing the RR interval 1444 01:10:09,830 --> 01:10:14,300 between the previous beat and the current beat. 1445 01:10:14,300 --> 01:10:18,080 And down here in the bottom is the ground truth 1446 01:10:18,080 --> 01:10:20,990 of whether the patient is assessed to have-- 1447 01:10:20,990 --> 01:10:27,650 to be in-- to have a normal rhythm or atrial fibrillation, 1448 01:10:27,650 --> 01:10:30,630 which is noted as this higher value here. 1449 01:10:30,630 --> 01:10:33,830 So these are AF rhythms. 1450 01:10:33,830 --> 01:10:34,710 This is normal. 1451 01:10:34,710 --> 01:10:36,800 This is AF again. 1452 01:10:36,800 --> 01:10:40,670 And what you can see is that the RR interval actually 1453 01:10:40,670 --> 01:10:41,640 gets you pretty far. 1454 01:10:41,640 --> 01:10:44,210 You notice how it's pretty high up here. 1455 01:10:44,210 --> 01:10:46,130 Suddenly it drops. 1456 01:10:46,130 --> 01:10:47,930 The RR interval drops for a while, 1457 01:10:47,930 --> 01:10:50,450 and that's when the patient has AF. 1458 01:10:50,450 --> 01:10:51,860 Then it goes up again. 1459 01:10:51,860 --> 01:10:54,780 Then it drops again, and so on. 1460 01:10:54,780 --> 01:10:56,780 And so it's not deterministic, the relationship, 1461 01:10:56,780 --> 01:10:59,143 but there's definitely a lot of signal just from that. 1462 01:10:59,143 --> 01:11:00,560 So you might say, OK, well, what's 1463 01:11:00,560 --> 01:11:02,480 the next thing we could do to try to clean up the signal 1464 01:11:02,480 --> 01:11:03,230 a little bit more? 1465 01:11:03,230 --> 01:11:11,210 So flash backwards from 2001 to 1970 here at MIT, studied by-- 1466 01:11:11,210 --> 01:11:13,760 actually, no, this is not MIT. 1467 01:11:13,760 --> 01:11:16,070 This is somewhere else, sorry. 1468 01:11:16,070 --> 01:11:21,398 But still 1970-- where they used a Markov model very 1469 01:11:21,398 --> 01:11:23,690 similar to the Markov models we were just talking about 1470 01:11:23,690 --> 01:11:30,410 in the previous example to model what a sequence of normal RR 1471 01:11:30,410 --> 01:11:34,310 intervals looks like versus what a sequence of abnormal, 1472 01:11:34,310 --> 01:11:37,370 for example, AF RR intervals looks like. 1473 01:11:37,370 --> 01:11:39,590 And in that way, one can recognize 1474 01:11:39,590 --> 01:11:42,980 that, for any one observation of an RR interval 1475 01:11:42,980 --> 01:11:45,540 might not by itself be perfectly predictive, 1476 01:11:45,540 --> 01:11:47,480 but if you look at sort of a sequence of them 1477 01:11:47,480 --> 01:11:50,480 for a patient with atrial fibrillation, 1478 01:11:50,480 --> 01:11:53,420 there is some common pattern to it. 1479 01:11:53,420 --> 01:11:56,090 And you can-- one can detect it by just looking at likelihood 1480 01:11:56,090 --> 01:11:59,450 of that sequence under each of these two different models, 1481 01:11:59,450 --> 01:12:01,230 normal and abnormal. 1482 01:12:01,230 --> 01:12:04,070 And that did pretty well-- even better than the previous 1483 01:12:04,070 --> 01:12:05,310 approaches for-- 1484 01:12:05,310 --> 01:12:08,370 for predicting atrial fibrillation. 1485 01:12:08,370 --> 01:12:11,790 This is the paper I wanted to say from MIT. 1486 01:12:11,790 --> 01:12:15,880 Now 1991, this is also from Roger Mark's group. 1487 01:12:15,880 --> 01:12:19,480 Now this is a neural network based approach, where it says, 1488 01:12:19,480 --> 01:12:22,108 OK, we're going to take a bunch of these things. 1489 01:12:22,108 --> 01:12:24,150 We're going to derive a bunch of these intervals, 1490 01:12:24,150 --> 01:12:25,890 and then we're going to throw that through a black box 1491 01:12:25,890 --> 01:12:27,432 supervised machine learning algorithm 1492 01:12:27,432 --> 01:12:30,240 to predict whether a patient has AF or not. 1493 01:12:30,240 --> 01:12:32,220 So these are very-- 1494 01:12:32,220 --> 01:12:34,890 first of all, there are some simple approaches here 1495 01:12:34,890 --> 01:12:36,540 that work reasonably well. 1496 01:12:36,540 --> 01:12:42,280 Using neural networks in this domain is not a new thing, 1497 01:12:42,280 --> 01:12:44,140 but where are we as a field? 1498 01:12:44,140 --> 01:12:46,920 So as I mentioned, there was this competition last year, 1499 01:12:46,920 --> 01:12:48,887 and what I'm showing you here-- the citation 1500 01:12:48,887 --> 01:12:50,470 is from one of the winning approaches. 1501 01:12:50,470 --> 01:12:52,845 And this winning approach really brings the two paradigms 1502 01:12:52,845 --> 01:12:53,910 together. 1503 01:12:53,910 --> 01:12:57,600 It extracts a large number of expert derived features-- 1504 01:12:57,600 --> 01:12:59,342 so shown here. 1505 01:12:59,342 --> 01:13:01,050 And these are exactly the types of things 1506 01:13:01,050 --> 01:13:06,390 you might think, like proportion, median RR 1507 01:13:06,390 --> 01:13:11,417 interval of regular rhythms, max RR irregularity measure. 1508 01:13:11,417 --> 01:13:13,500 And there's just a whole range of different things 1509 01:13:13,500 --> 01:13:16,160 that you can imagine manually deriving from the data. 1510 01:13:16,160 --> 01:13:17,910 And you throw all of these features 1511 01:13:17,910 --> 01:13:21,840 into a machine learning algorithm, 1512 01:13:21,840 --> 01:13:25,040 maybe a random forest, maybe a neural network, doesn't matter. 1513 01:13:25,040 --> 01:13:27,180 And what you get out is a slightly better algorithm 1514 01:13:27,180 --> 01:13:28,555 than what if you had just come up 1515 01:13:28,555 --> 01:13:30,510 with a simple rule on your own. 1516 01:13:30,510 --> 01:13:33,470 That was the winning algorithm then. 1517 01:13:33,470 --> 01:13:36,970 And in the summary paper, they conjectured that, well, maybe 1518 01:13:36,970 --> 01:13:39,357 it's the case that they were-- 1519 01:13:39,357 --> 01:13:41,440 they'd expected that convolutional neural networks 1520 01:13:41,440 --> 01:13:42,443 would win. 1521 01:13:42,443 --> 01:13:44,860 And they were surprised that none of the winning solutions 1522 01:13:44,860 --> 01:13:47,070 involved convolution neural networks. 1523 01:13:47,070 --> 01:13:50,297 And they conjectured that may be the reason why is because maybe 1524 01:13:50,297 --> 01:13:52,630 with these 8,000 patients that they had [INAUDIBLE] that 1525 01:13:52,630 --> 01:13:56,590 just wasn't enough to give the more complex models advantage. 1526 01:13:56,590 --> 01:14:00,370 So flip forward now to this year and the article 1527 01:14:00,370 --> 01:14:05,840 that you read in your readings in Nature Medicine, 1528 01:14:05,840 --> 01:14:07,420 where the Stanford group now showed 1529 01:14:07,420 --> 01:14:10,540 how a convolutional neural network approach, which 1530 01:14:10,540 --> 01:14:13,960 is, in many ways, extremely naive-- all it does 1531 01:14:13,960 --> 01:14:17,870 is it takes the sequence data in. 1532 01:14:17,870 --> 01:14:20,710 It makes no attempt at trying to understand the underlying 1533 01:14:20,710 --> 01:14:23,800 physiology, and just predicts from that-- 1534 01:14:23,800 --> 01:14:25,647 can do really, really well. 1535 01:14:25,647 --> 01:14:27,230 And so there are couple of differences 1536 01:14:27,230 --> 01:14:29,590 that I want to emphasize to the previous work. 1537 01:14:29,590 --> 01:14:31,360 First, the censor is different. 1538 01:14:31,360 --> 01:14:35,580 Whereas the previous work used this alive core censor, 1539 01:14:35,580 --> 01:14:37,420 in this paper from Stanford, they're 1540 01:14:37,420 --> 01:14:40,870 using a different censor called the Zio patch, which 1541 01:14:40,870 --> 01:14:44,110 is attached to the human body and conceivably 1542 01:14:44,110 --> 01:14:45,580 much less noisy. 1543 01:14:45,580 --> 01:14:47,560 So that's one big difference. 1544 01:14:47,560 --> 01:14:49,810 The second big difference is that there's dramatically 1545 01:14:49,810 --> 01:14:50,770 more data. 1546 01:14:50,770 --> 01:14:52,510 Instead of 8,000 patients to train from, 1547 01:14:52,510 --> 01:14:54,790 now they have over 90,000 records 1548 01:14:54,790 --> 01:14:58,060 from 50,000 different patients to train from. 1549 01:14:58,060 --> 01:14:59,740 The third major difference is that now, 1550 01:14:59,740 --> 01:15:02,740 rather than just trying to classify into four categories-- 1551 01:15:02,740 --> 01:15:06,723 normal, abnormal, other, or noisy-- 1552 01:15:06,723 --> 01:15:08,140 now we're going to try to classify 1553 01:15:08,140 --> 01:15:09,880 into 14 different categories. 1554 01:15:09,880 --> 01:15:12,850 We're, in essence, breaking apart that other class 1555 01:15:12,850 --> 01:15:15,610 into much finer grain detail of different types 1556 01:15:15,610 --> 01:15:17,780 of abnormal rhythms. 1557 01:15:17,780 --> 01:15:20,110 And so here are some of those other abnormal rhythms, 1558 01:15:20,110 --> 01:15:28,140 things like complete heart block, 1559 01:15:28,140 --> 01:15:31,650 and a bunch of other names I can't pronounce. 1560 01:15:31,650 --> 01:15:34,472 And from each one of these, they gathered a lot of data. 1561 01:15:34,472 --> 01:15:35,430 And that actually did-- 1562 01:15:35,430 --> 01:15:36,870 so it's not described in the paper, 1563 01:15:36,870 --> 01:15:38,160 but I've talked to the authors, and they 1564 01:15:38,160 --> 01:15:40,690 did-- they gathered this data in a very interesting way. 1565 01:15:40,690 --> 01:15:42,720 So they sort of-- they did their training iteratively. 1566 01:15:42,720 --> 01:15:44,460 They looked to see where their errors were, 1567 01:15:44,460 --> 01:15:46,752 and then they went and gathered more data from patients 1568 01:15:46,752 --> 01:15:48,180 with that subcategory. 1569 01:15:48,180 --> 01:15:51,930 So many of these other categories 1570 01:15:51,930 --> 01:15:54,267 are very under-- might be underrepresented 1571 01:15:54,267 --> 01:15:56,100 in the general population, but they actually 1572 01:15:56,100 --> 01:15:57,810 gather a lot of patients of that type 1573 01:15:57,810 --> 01:16:00,520 in their data set for training purposes. 1574 01:16:00,520 --> 01:16:02,700 And so I think those three things ended up 1575 01:16:02,700 --> 01:16:05,320 making a very big difference. 1576 01:16:05,320 --> 01:16:07,050 So what is their convolutional network? 1577 01:16:07,050 --> 01:16:10,180 Well, first of all, it's a 1-D signal. 1578 01:16:10,180 --> 01:16:12,180 So it's a little bit different from the con nets 1579 01:16:12,180 --> 01:16:13,380 you typically see in computer vision, 1580 01:16:13,380 --> 01:16:15,088 and I'll show you an illustration of that 1581 01:16:15,088 --> 01:16:16,080 in the next slide. 1582 01:16:16,080 --> 01:16:17,430 It's a very deep model. 1583 01:16:17,430 --> 01:16:20,100 So it's 34 layers. 1584 01:16:20,100 --> 01:16:23,010 So the input comes in on the very top in this picture. 1585 01:16:23,010 --> 01:16:26,730 It's passed through a number of layers. 1586 01:16:26,730 --> 01:16:30,210 Each layer consists of convolution followed 1587 01:16:30,210 --> 01:16:33,600 by rectified linear units, and there is sub 1588 01:16:33,600 --> 01:16:35,790 sampling at every other layer so that you 1589 01:16:35,790 --> 01:16:38,010 go from a very wide signal-- 1590 01:16:38,010 --> 01:16:39,645 so a very long-- 1591 01:16:39,645 --> 01:16:40,770 I can't remember how long-- 1592 01:16:40,770 --> 01:16:43,830 1 second long signal summarized down 1593 01:16:43,830 --> 01:16:47,165 into sort of much-- just many smaller number of dimensions, 1594 01:16:47,165 --> 01:16:49,290 which you then have a sort of fully connected layer 1595 01:16:49,290 --> 01:16:52,770 at the bottom to do for your predictions. 1596 01:16:52,770 --> 01:16:55,590 And then they also have these shortcut connections, 1597 01:16:55,590 --> 01:16:58,770 which allow you to pass information from earlier layers 1598 01:16:58,770 --> 01:17:00,630 down to the very end of the network, 1599 01:17:00,630 --> 01:17:02,255 or even into intermediate layers. 1600 01:17:02,255 --> 01:17:04,380 And for those of you who are familiar with residual 1601 01:17:04,380 --> 01:17:06,850 networks, it's the same idea. 1602 01:17:06,850 --> 01:17:08,340 So what is a 1D convolution? 1603 01:17:08,340 --> 01:17:10,270 Well, it looks a little bit like this. 1604 01:17:10,270 --> 01:17:12,960 So this is the signal. 1605 01:17:12,960 --> 01:17:15,570 I'm going to just approximate it by a bunch of 1's and 0's. 1606 01:17:15,570 --> 01:17:16,560 I'll say this is a 1. 1607 01:17:16,560 --> 01:17:17,360 This is a 0. 1608 01:17:17,360 --> 01:17:18,480 This is a 1, 1, so on. 1609 01:17:21,620 --> 01:17:25,280 A convolutional network has a filter associated with it. 1610 01:17:25,280 --> 01:17:28,070 That filter is then applied in a 1D model. 1611 01:17:28,070 --> 01:17:29,630 It's applied in a linear fashion. 1612 01:17:29,630 --> 01:17:32,240 It's just taken a dot product with the filter's values, 1613 01:17:32,240 --> 01:17:35,150 with the values of the signal at each point in time. 1614 01:17:35,150 --> 01:17:38,130 So it looks a little bit like this, 1615 01:17:38,130 --> 01:17:39,450 and this is what you get out. 1616 01:17:39,450 --> 01:17:42,330 So this is the convolution of a single filter 1617 01:17:42,330 --> 01:17:44,760 with the whole signal. 1618 01:17:44,760 --> 01:17:47,140 And the computation I did there-- so for example, 1619 01:17:47,140 --> 01:17:49,860 this first number came from taking the dot product 1620 01:17:49,860 --> 01:17:51,360 of the first three numbers-- 1621 01:17:51,360 --> 01:17:53,370 1, 0, 1-- with the filter. 1622 01:17:53,370 --> 01:18:01,548 So it's 1 times 2 plus 3 times 0 plus 1 times 1, which is 3. 1623 01:18:01,548 --> 01:18:03,090 And so each of the subsequent numbers 1624 01:18:03,090 --> 01:18:04,900 was computed in the same way. 1625 01:18:04,900 --> 01:18:09,060 And I usually have you figure out what this last one is, 1626 01:18:09,060 --> 01:18:12,440 but I'll leave that for you to do at home. 1627 01:18:12,440 --> 01:18:14,097 And that's what a 1D convolution is. 1628 01:18:14,097 --> 01:18:16,680 And so they have-- they do this for lots of different filters. 1629 01:18:16,680 --> 01:18:19,155 Each of those filters might be of varying lengths, 1630 01:18:19,155 --> 01:18:21,030 and each of those will detect different types 1631 01:18:21,030 --> 01:18:23,040 of signal patterns. 1632 01:18:23,040 --> 01:18:25,800 And in this way, after having many layers of these, 1633 01:18:25,800 --> 01:18:28,320 one can, in an automatic fashion, 1634 01:18:28,320 --> 01:18:31,080 extract many of the same types of signals used in that earlier 1635 01:18:31,080 --> 01:18:32,997 work, but also be much more flexible to detect 1636 01:18:32,997 --> 01:18:34,420 some new ones, as well. 1637 01:18:34,420 --> 01:18:37,120 Hold your question, because I need to wrap up. 1638 01:18:37,120 --> 01:18:38,710 So in the paper that you read, they 1639 01:18:38,710 --> 01:18:41,902 talked about how they evaluated this. 1640 01:18:41,902 --> 01:18:44,110 And so I'm not going to go into much depth in it now. 1641 01:18:44,110 --> 01:18:46,330 I just want to point out two different metrics 1642 01:18:46,330 --> 01:18:47,320 that they used. 1643 01:18:47,320 --> 01:18:48,910 So the first metric they used was 1644 01:18:48,910 --> 01:18:52,690 what they called a sequential error metric. 1645 01:18:52,690 --> 01:18:55,990 What that looked at is you had this very long sequence 1646 01:18:55,990 --> 01:19:00,670 for each patient, and they labeled different one 1647 01:19:00,670 --> 01:19:02,350 second intervals of that sequence 1648 01:19:02,350 --> 01:19:05,690 into abnormal, normal, and so on. 1649 01:19:05,690 --> 01:19:07,113 So you could ask, how good are we 1650 01:19:07,113 --> 01:19:08,780 at labeling each of the different points 1651 01:19:08,780 --> 01:19:09,600 along the sequence? 1652 01:19:09,600 --> 01:19:11,720 And that's the sequence metric. 1653 01:19:11,720 --> 01:19:14,510 The different-- the second metric is the set metric, 1654 01:19:14,510 --> 01:19:16,520 and that looks at, if the patient has 1655 01:19:16,520 --> 01:19:19,730 something that's abnormal anywhere, did you detect it? 1656 01:19:19,730 --> 01:19:22,040 So that's, in essence, taking an or of 1657 01:19:22,040 --> 01:19:23,510 each of those 1 second intervals, 1658 01:19:23,510 --> 01:19:25,310 and then looking across patients. 1659 01:19:25,310 --> 01:19:27,410 And from a clinical diagnostic perspective, 1660 01:19:27,410 --> 01:19:29,510 the set metric might be most useful, but then 1661 01:19:29,510 --> 01:19:31,340 when you want to introspect and understand 1662 01:19:31,340 --> 01:19:34,370 where is that happening, then the sequential metric is 1663 01:19:34,370 --> 01:19:35,600 important. 1664 01:19:35,600 --> 01:19:38,300 And the key take home message from the paper is that, 1665 01:19:38,300 --> 01:19:41,240 if you compared the model's predictions-- this is, I think, 1666 01:19:41,240 --> 01:19:44,990 using an f1 metric-- 1667 01:19:44,990 --> 01:19:49,790 to what you would get from a panel of cardiologists, 1668 01:19:49,790 --> 01:19:53,510 these models are doing as well, if not better than these panels 1669 01:19:53,510 --> 01:19:54,500 of cardiologists. 1670 01:19:54,500 --> 01:19:56,930 So this is extremely exciting. 1671 01:19:56,930 --> 01:19:58,700 This is technology-- or variance of this 1672 01:19:58,700 --> 01:20:02,240 is technology that you're going to see deployed now. 1673 01:20:02,240 --> 01:20:04,760 So for those of you who have purchased these Apple watches, 1674 01:20:04,760 --> 01:20:07,220 these Samsung watches, I don't know exactly what they're 1675 01:20:07,220 --> 01:20:08,637 using, but I wouldn't be surprised 1676 01:20:08,637 --> 01:20:10,580 if they're using techniques similar to this. 1677 01:20:10,580 --> 01:20:12,390 And you're going to see much more of that in the future. 1678 01:20:12,390 --> 01:20:14,030 So this is going to be really the first example 1679 01:20:14,030 --> 01:20:15,447 in this course so far of something 1680 01:20:15,447 --> 01:20:18,280 that's really been deployed. 1681 01:20:18,280 --> 01:20:20,660 And so in summary, we're very often 1682 01:20:20,660 --> 01:20:22,450 in the realm of not enough data. 1683 01:20:22,450 --> 01:20:24,860 And in this lecture today, we gave two examples 1684 01:20:24,860 --> 01:20:26,030 how you can deal with that. 1685 01:20:26,030 --> 01:20:31,340 First, you can try to use mechanistic and statistical 1686 01:20:31,340 --> 01:20:38,150 models to try to work in settings where 1687 01:20:38,150 --> 01:20:39,590 you don't have much data. 1688 01:20:39,590 --> 01:20:42,333 And in other extremes, you do have a lot of data, 1689 01:20:42,333 --> 01:20:44,000 and you can try to ignore that, and just 1690 01:20:44,000 --> 01:20:45,292 use these black box approaches. 1691 01:20:45,292 --> 01:20:46,930 That's all for today.