1 00:00:15,700 --> 00:00:19,300 DAVID SONTAG: So today's lecture is going to be about causality. 2 00:00:22,102 --> 00:00:23,560 Who's heard about causality before? 3 00:00:23,560 --> 00:00:24,227 Raise your hand. 4 00:00:27,130 --> 00:00:30,970 What's the number one thing that you hear about when 5 00:00:30,970 --> 00:00:33,210 thinking about causality? 6 00:00:33,210 --> 00:00:33,710 Yeah? 7 00:00:33,710 --> 00:00:35,970 AUDIENCE: Correlation does not imply causation. 8 00:00:35,970 --> 00:00:39,465 DAVID SONTAG: Correlation does not imply causation. 9 00:00:39,465 --> 00:00:40,590 Anything else come to mind? 10 00:00:40,590 --> 00:00:41,490 That's what came to my mind. 11 00:00:41,490 --> 00:00:42,615 Anything else come to mind? 12 00:00:46,560 --> 00:00:48,660 So up until now in the semester, we've 13 00:00:48,660 --> 00:00:50,760 been talking about purely predictive questions. 14 00:00:50,760 --> 00:00:52,950 And for purely predictive questions, 15 00:00:52,950 --> 00:00:56,160 one could argue that correlation is good enough. 16 00:00:56,160 --> 00:00:57,780 If we have some signs in our data 17 00:00:57,780 --> 00:01:00,355 that are predictive of some outcome of interest, 18 00:01:00,355 --> 00:01:02,230 we want to be able to take advantage of that. 19 00:01:02,230 --> 00:01:04,739 Whether it's upstream, downstream, 20 00:01:04,739 --> 00:01:10,020 the causal directionality is irrelevant for that purpose. 21 00:01:10,020 --> 00:01:12,030 Although even that isn't quite true, right, 22 00:01:12,030 --> 00:01:16,050 because Pete and I have been hinting throughout the semester 23 00:01:16,050 --> 00:01:19,260 that there are times when the data changes 24 00:01:19,260 --> 00:01:23,370 on you, for example, when you go from one institution to another 25 00:01:23,370 --> 00:01:26,100 or when you have non-stationary. 26 00:01:26,100 --> 00:01:30,210 And in those situations, having a deeper understanding 27 00:01:30,210 --> 00:01:32,220 about the data might allow one to build 28 00:01:32,220 --> 00:01:35,790 an additional robustness to that type of data set shift. 29 00:01:35,790 --> 00:01:37,598 But there are other reasons as well why 30 00:01:37,598 --> 00:01:40,140 understanding something about your underlying data generating 31 00:01:40,140 --> 00:01:41,795 processes can be really important. 32 00:01:41,795 --> 00:01:43,170 It's because often, the questions 33 00:01:43,170 --> 00:01:45,990 that we want to answer when it comes to health care 34 00:01:45,990 --> 00:01:48,542 are not predictive questions, their causal questions. 35 00:01:48,542 --> 00:01:51,000 And so what I'll do now is I'll walk through a few examples 36 00:01:51,000 --> 00:01:53,290 of what I mean by this. 37 00:01:53,290 --> 00:01:57,100 Let's start out with what we saw in Lecture 4 and in Problem Set 38 00:01:57,100 --> 00:02:00,230 2, where we looked at the question 39 00:02:00,230 --> 00:02:02,480 of how we can do early detection of type 2 diabetes. 40 00:02:05,630 --> 00:02:08,300 You used Truven MarketScan's data 41 00:02:08,300 --> 00:02:12,190 set to build a risk stratification 42 00:02:12,190 --> 00:02:14,855 algorithm for detecting who is going 43 00:02:14,855 --> 00:02:16,480 to be newly diagnosed with diabetes one 44 00:02:16,480 --> 00:02:17,930 to three years from now. 45 00:02:17,930 --> 00:02:19,638 And if you think about how one might then 46 00:02:19,638 --> 00:02:21,700 try to deploy that algorithm, you 47 00:02:21,700 --> 00:02:25,090 might, for example, try to get patients into the clinic 48 00:02:25,090 --> 00:02:28,200 to get them diagnosed. 49 00:02:28,200 --> 00:02:30,090 But the next set of questions are usually 50 00:02:30,090 --> 00:02:31,490 about the so what question. 51 00:02:31,490 --> 00:02:35,040 What are you going to do based on that prediction? 52 00:02:35,040 --> 00:02:37,650 Once diagnosed, how will you intervene? 53 00:02:37,650 --> 00:02:39,660 And at the end of the day, the interesting goal 54 00:02:39,660 --> 00:02:41,412 is not one of how do you find them early, 55 00:02:41,412 --> 00:02:43,620 but how do you prevent them from developing diabetes? 56 00:02:43,620 --> 00:02:45,990 Or how do you prevent the patient from developing 57 00:02:45,990 --> 00:02:47,130 complications of diabetes? 58 00:02:49,710 --> 00:02:54,550 And those are questions about causality. 59 00:02:54,550 --> 00:02:56,500 Now, when we built a predictive model 60 00:02:56,500 --> 00:02:57,983 and we introspected at the weight, 61 00:02:57,983 --> 00:02:59,900 we might have noticed some interesting things. 62 00:02:59,900 --> 00:03:04,180 For example, if you looked at the highest negative weights, 63 00:03:04,180 --> 00:03:07,430 which I'm not sure if we did as part of the assignment 64 00:03:07,430 --> 00:03:10,050 but is something that I did as part of my research study, 65 00:03:10,050 --> 00:03:12,100 you see that gastric bypass surgery has 66 00:03:12,100 --> 00:03:16,330 the biggest negative weight. 67 00:03:16,330 --> 00:03:21,010 Does that mean that if you give an obese person gastric bypass 68 00:03:21,010 --> 00:03:24,520 surgery, that will prevent them from developing type 2 69 00:03:24,520 --> 00:03:26,080 diabetes? 70 00:03:26,080 --> 00:03:28,330 That's an example of a causal question which is raised 71 00:03:28,330 --> 00:03:30,105 by this predictive model. 72 00:03:30,105 --> 00:03:31,480 But just by looking at the weight 73 00:03:31,480 --> 00:03:34,810 alone, as I'll show you this week, 74 00:03:34,810 --> 00:03:38,060 you won't be able to correctly infer that there 75 00:03:38,060 --> 00:03:39,720 is a causal relationship. 76 00:03:39,720 --> 00:03:41,620 And so part of what we will be doing 77 00:03:41,620 --> 00:03:45,070 is coming up with a mathematical language for thinking 78 00:03:45,070 --> 00:03:47,020 about how does one answer, is there 79 00:03:47,020 --> 00:03:49,350 a causal relationship here? 80 00:03:49,350 --> 00:03:51,750 Here's a second example. 81 00:03:51,750 --> 00:03:54,030 Right before spring break we had a series of lectures 82 00:03:54,030 --> 00:03:57,120 about diagnosis, particularly diagnosis 83 00:03:57,120 --> 00:04:00,570 from imaging data of a variety of kinds, 84 00:04:00,570 --> 00:04:03,240 whether it be radiology or pathology. 85 00:04:03,240 --> 00:04:05,400 And often, questions are of this sort. 86 00:04:05,400 --> 00:04:07,530 Here is a woman's breasts. 87 00:04:07,530 --> 00:04:09,210 She has breast cancer. 88 00:04:09,210 --> 00:04:12,490 Maybe you have an associated pathology slide as well. 89 00:04:12,490 --> 00:04:16,800 And you want to know what is the risk of this person dying 90 00:04:16,800 --> 00:04:19,500 in the next five years. 91 00:04:19,500 --> 00:04:23,340 So one can take a deep learning model, 92 00:04:23,340 --> 00:04:26,700 learn to predict what one observes. 93 00:04:26,700 --> 00:04:28,950 So in the patient in your data set, you have the input 94 00:04:28,950 --> 00:04:30,510 and you have, let's say, survival time. 95 00:04:30,510 --> 00:04:32,302 And you might use that to predict something 96 00:04:32,302 --> 00:04:38,510 about how long it takes from diagnosis to death. 97 00:04:38,510 --> 00:04:41,310 And based on those predictions, you might take actions. 98 00:04:41,310 --> 00:04:48,290 For example, if you predict that a patient is not risky, 99 00:04:48,290 --> 00:04:51,130 then you might conclude that they 100 00:04:51,130 --> 00:04:54,010 don't need to get treatment. 101 00:04:54,010 --> 00:04:56,950 But that could be really, really dangerous, 102 00:04:56,950 --> 00:05:01,450 and I'll just give you one example 103 00:05:01,450 --> 00:05:03,580 of why that could be dangerous. 104 00:05:06,210 --> 00:05:08,210 These predictive models, if you're learning them 105 00:05:08,210 --> 00:05:12,040 in this way, the outcome, in this case let's 106 00:05:12,040 --> 00:05:14,630 say time to death, is going to be affected 107 00:05:14,630 --> 00:05:17,000 by what's happened in between. 108 00:05:17,000 --> 00:05:19,880 So, for example, this patient might 109 00:05:19,880 --> 00:05:22,700 have been receiving treatment, and because 110 00:05:22,700 --> 00:05:26,510 of them receiving treatment in between the time from diagnosis 111 00:05:26,510 --> 00:05:30,202 to death, it might have prolonged their life. 112 00:05:30,202 --> 00:05:31,910 And so for this patient in your data set, 113 00:05:31,910 --> 00:05:35,612 you might have observed that they lived a very long time. 114 00:05:35,612 --> 00:05:37,320 But if you ignore what happens in between 115 00:05:37,320 --> 00:05:41,850 and you simply learn to predict y from X, X being the input, 116 00:05:41,850 --> 00:05:43,850 then a new patient comes along and you predicted 117 00:05:43,850 --> 00:05:46,560 that new patient is going to survive a long time, 118 00:05:46,560 --> 00:05:48,540 and it would be completely the wrong conclusion 119 00:05:48,540 --> 00:05:50,857 to say that you don't need to treat that patient. 120 00:05:50,857 --> 00:05:53,190 Because, in fact, the only reason the patients like them 121 00:05:53,190 --> 00:05:54,773 in the training data lived a long time 122 00:05:54,773 --> 00:05:57,400 is because they were treated. 123 00:05:57,400 --> 00:06:00,460 And so when it comes to this field of machine learning 124 00:06:00,460 --> 00:06:03,850 and health care, we need to think really carefully 125 00:06:03,850 --> 00:06:07,120 about these types of questions because an error in the way 126 00:06:07,120 --> 00:06:09,080 that we formalize our problem could kill people 127 00:06:09,080 --> 00:06:10,330 because of mistakes like this. 128 00:06:13,920 --> 00:06:16,350 Now, other questions are ones about not how 129 00:06:16,350 --> 00:06:19,770 do we predict outcomes but how do we 130 00:06:19,770 --> 00:06:23,360 guide treatment decisions. 131 00:06:23,360 --> 00:06:26,300 So, for example, as data from pathology 132 00:06:26,300 --> 00:06:28,370 gets richer and richer and richer, 133 00:06:28,370 --> 00:06:30,650 we might think that we can now use computers 134 00:06:30,650 --> 00:06:33,860 to try to better predict who is likely to benefit 135 00:06:33,860 --> 00:06:36,200 from a treatment than humans could do alone. 136 00:06:38,790 --> 00:06:40,470 But the challenge with using algorithms 137 00:06:40,470 --> 00:06:42,900 to do that is that people respond differently 138 00:06:42,900 --> 00:06:45,840 to treatment, and the data which is being 139 00:06:45,840 --> 00:06:51,450 used to guide treatment is biased based on existing 140 00:06:51,450 --> 00:06:52,740 treatment guidelines. 141 00:06:55,460 --> 00:06:58,820 So, similarly, to the previous question, we could ask, 142 00:06:58,820 --> 00:07:01,890 what would happen if we trained to predict past treatment 143 00:07:01,890 --> 00:07:02,390 decisions? 144 00:07:02,390 --> 00:07:04,015 This would be the most naive way to try 145 00:07:04,015 --> 00:07:06,590 to use data to guide treatment decisions. 146 00:07:06,590 --> 00:07:08,930 So maybe you see David gets treatment A, 147 00:07:08,930 --> 00:07:11,450 John gets treatment B, Juana gets treatment A. 148 00:07:11,450 --> 00:07:14,660 And you might ask then, OK, a new patient comes in, 149 00:07:14,660 --> 00:07:17,473 what should this new patient be treated with? 150 00:07:17,473 --> 00:07:18,890 And if you've just learned a model 151 00:07:18,890 --> 00:07:21,470 to predict from what you know about the treatment 152 00:07:21,470 --> 00:07:23,990 that David is likely to get, then the best 153 00:07:23,990 --> 00:07:26,840 that you could hope to do is to do as well 154 00:07:26,840 --> 00:07:29,850 as existing clinical practice. 155 00:07:29,850 --> 00:07:33,230 So if we want to go beyond current clinical practice, 156 00:07:33,230 --> 00:07:35,780 for example, to recognize that there is heterogeneity 157 00:07:35,780 --> 00:07:39,440 in treatment response, then we have to somehow change 158 00:07:39,440 --> 00:07:44,090 the question that we're asking. 159 00:07:44,090 --> 00:07:46,070 I'll give you one last example, which 160 00:07:46,070 --> 00:07:50,600 is perhaps a more traditional question of, does X cause y? 161 00:07:50,600 --> 00:07:53,240 For example, does smoking cause lung cancer 162 00:07:53,240 --> 00:07:59,080 is a major question of societal importance. 163 00:07:59,080 --> 00:08:02,950 Now, you might be familiar with the traditional way 164 00:08:02,950 --> 00:08:05,170 of trying to answer questions of this nature, which 165 00:08:05,170 --> 00:08:07,520 would be to do a randomized controlled trial. 166 00:08:07,520 --> 00:08:09,372 Except this isn't exactly the type 167 00:08:09,372 --> 00:08:11,830 of setting where you could do randomized controlled trials. 168 00:08:11,830 --> 00:08:17,170 How would you feel if you were a smoker and someone came up 169 00:08:17,170 --> 00:08:19,998 to you and said, you have to stop smoking because I 170 00:08:19,998 --> 00:08:21,040 need to see what happens? 171 00:08:21,040 --> 00:08:23,230 Or how would you feel if you were a non-smoker 172 00:08:23,230 --> 00:08:24,730 and someone came up to you and said, 173 00:08:24,730 --> 00:08:27,880 you have to start smoking? 174 00:08:27,880 --> 00:08:31,930 That would be both not feasible and completely unethical. 175 00:08:31,930 --> 00:08:33,909 And so if we want to try to answer questions 176 00:08:33,909 --> 00:08:35,590 like this from data, we need to start 177 00:08:35,590 --> 00:08:39,850 thinking about how can we design, 178 00:08:39,850 --> 00:08:43,580 using observational data, ways of answering 179 00:08:43,580 --> 00:08:45,230 questions like this. 180 00:08:45,230 --> 00:08:46,610 And the challenge is that there's 181 00:08:46,610 --> 00:08:50,570 going to be bias in the data because of who decides to smoke 182 00:08:50,570 --> 00:08:52,370 and who decides not to smoke. 183 00:08:52,370 --> 00:08:53,840 So, for example, the most naive way 184 00:08:53,840 --> 00:08:55,310 you might try to answer this question 185 00:08:55,310 --> 00:08:57,227 would be to look at the conditional likelihood 186 00:08:57,227 --> 00:08:59,750 of getting lung cancer among smokers 187 00:08:59,750 --> 00:09:01,790 and getting lung cancer among non-smokers. 188 00:09:04,570 --> 00:09:07,060 But those numbers, as you'll see in the next few slides, 189 00:09:07,060 --> 00:09:09,340 can be very misleading because there 190 00:09:09,340 --> 00:09:12,070 might be confounding factors, factors 191 00:09:12,070 --> 00:09:21,250 that would, for example, both cause people to be a smoker 192 00:09:21,250 --> 00:09:25,470 and cause them to receive lung cancer, 193 00:09:25,470 --> 00:09:28,980 which would differentiate between these two numbers. 194 00:09:28,980 --> 00:09:30,660 And we'll have a very concrete example 195 00:09:30,660 --> 00:09:32,920 of this in just a few minutes. 196 00:09:32,920 --> 00:09:35,010 So to properly answer all of these questions, 197 00:09:35,010 --> 00:09:37,470 one needs to be thinking in terms of causal graphs. 198 00:09:37,470 --> 00:09:40,410 So rather than the traditional setup in machine 199 00:09:40,410 --> 00:09:52,020 learning where you just have inputs and outputs, 200 00:09:52,020 --> 00:09:53,930 now we need to have triplets. 201 00:09:53,930 --> 00:09:55,610 Rather than having inputs and outputs, 202 00:09:55,610 --> 00:10:06,350 we need to be thinking of inputs, interventions, 203 00:10:06,350 --> 00:10:09,200 and outcomes or outputs. 204 00:10:09,200 --> 00:10:13,530 So we now need be having three quantities in mind. 205 00:10:13,530 --> 00:10:15,280 And we have to start thinking about, well, 206 00:10:15,280 --> 00:10:18,740 what is the causal relationship between these three? 207 00:10:18,740 --> 00:10:22,832 So for those of you who have taken more graduate level 208 00:10:22,832 --> 00:10:24,290 machine learning classes, you might 209 00:10:24,290 --> 00:10:28,350 be familiar with ideas such as Bayesian networks. 210 00:10:28,350 --> 00:10:31,890 And when I went to undergrad and grad school 211 00:10:31,890 --> 00:10:34,340 and I studied machine learning, for the longest time 212 00:10:34,340 --> 00:10:36,590 I thought causal inference had to do 213 00:10:36,590 --> 00:10:38,900 with learning causal graphs. 214 00:10:38,900 --> 00:10:41,720 So this is what I thought causal inference was about. 215 00:10:41,720 --> 00:10:43,845 You have data of the following nature-- 216 00:10:46,560 --> 00:10:49,170 1, 0, 0, 1, dot, dot, dot. 217 00:10:51,950 --> 00:10:53,930 So here, there are four random variables. 218 00:10:53,930 --> 00:10:56,690 I'm showing the realizations of those four binary variables 219 00:10:56,690 --> 00:10:59,870 one per row, and you have a data set like this. 220 00:10:59,870 --> 00:11:01,640 And I thought causal inference had 221 00:11:01,640 --> 00:11:04,633 to do with taking data like this and trying to figure out, 222 00:11:04,633 --> 00:11:06,050 is the underlying Bayesian network 223 00:11:06,050 --> 00:11:14,490 that created that data, is it X1 goes to X2 goes to X3 to X4? 224 00:11:14,490 --> 00:11:19,190 Or I'll say, this is X1, that's X2, x3, and X4. 225 00:11:19,190 --> 00:11:27,687 Or maybe the causal graph is X1, to X2, to X3, to x4. 226 00:11:27,687 --> 00:11:30,020 And trying to distinguish between these different causal 227 00:11:30,020 --> 00:11:33,940 graphs from observational data is one type of question 228 00:11:33,940 --> 00:11:36,840 that one can ask. 229 00:11:36,840 --> 00:11:40,020 And the one thing you learn in traditional machine 230 00:11:40,020 --> 00:11:42,135 learning treatments of this is that sometimes you 231 00:11:42,135 --> 00:11:44,010 can't distinguish between these causal graphs 232 00:11:44,010 --> 00:11:45,210 from the data you have. 233 00:11:45,210 --> 00:11:49,050 For example, suppose you just had two random variables. 234 00:11:49,050 --> 00:11:54,180 Because any distribution could be represented by probability 235 00:11:54,180 --> 00:11:58,950 of X1 times probability of X2 given X1, 236 00:11:58,950 --> 00:12:03,960 according to just rule of conditional probability, 237 00:12:03,960 --> 00:12:06,090 and similarly, any distribution can be represented 238 00:12:06,090 --> 00:12:10,380 as the opposite, probability of X2 times probability 239 00:12:10,380 --> 00:12:17,735 of X1 given X2, which would look like this, the statement 240 00:12:17,735 --> 00:12:19,360 that one would make is that if you just 241 00:12:19,360 --> 00:12:22,210 had data involving X1 and X2, you couldn't distinguish 242 00:12:22,210 --> 00:12:25,930 between these two causal graphs, X1 causes X2 or X2 causes X1. 243 00:12:28,650 --> 00:12:31,320 And usually another treatment would say, OK, 244 00:12:31,320 --> 00:12:33,810 but if you have a third variable and you have a V structure 245 00:12:33,810 --> 00:12:40,590 or something like X1 goes to x2, X1 goes to X3, 246 00:12:40,590 --> 00:12:44,414 this you could distinguish from, let's say, a chain structure. 247 00:12:47,380 --> 00:12:49,470 And then the final answer to what 248 00:12:49,470 --> 00:12:51,180 is causal inference from this philosophy 249 00:12:51,180 --> 00:12:54,750 would be something like, OK, if you're in a setting like this 250 00:12:54,750 --> 00:12:57,150 and you can't distinguish between X1 causes X2 or X2 251 00:12:57,150 --> 00:12:59,970 causes X1, then you do some interventions, 252 00:12:59,970 --> 00:13:04,440 like you intervene on X1 and you look to see what happens to X2, 253 00:13:04,440 --> 00:13:07,290 and that'll help you disentangle these directions of causality. 254 00:13:09,612 --> 00:13:12,070 None of this is what we're going to be talking about today. 255 00:13:14,950 --> 00:13:18,100 Today, we're going to be talking about the simplest, 256 00:13:18,100 --> 00:13:20,500 simplest possible setting you could imagine, 257 00:13:20,500 --> 00:13:22,030 that graph shown up there. 258 00:13:25,470 --> 00:13:29,130 You have three sets of random variables, X, 259 00:13:29,130 --> 00:13:30,810 which is perhaps a vector, so it's 260 00:13:30,810 --> 00:13:33,570 high dimensional, a single random variable 261 00:13:33,570 --> 00:13:37,170 T, and a single random variable Y. 262 00:13:37,170 --> 00:13:40,500 And we know the causal graph here. 263 00:13:40,500 --> 00:13:42,090 We're going to suppose that we know 264 00:13:42,090 --> 00:13:49,650 the directionality, that we know that X might cause T 265 00:13:49,650 --> 00:13:54,660 and X and T might cause Y. And the only thing we don't know 266 00:13:54,660 --> 00:13:58,180 is the strength of the edges. 267 00:13:58,180 --> 00:13:58,680 All right. 268 00:13:58,680 --> 00:14:00,730 And so now let's try to think through this in context 269 00:14:00,730 --> 00:14:01,772 of the previous examples. 270 00:14:01,772 --> 00:14:02,780 Yeah, question? 271 00:14:02,780 --> 00:14:05,697 AUDIENCE: Just to make sure-- so T does not affect X in any way? 272 00:14:05,697 --> 00:14:07,530 DAVID SONTAG: Correct, that's the assumption 273 00:14:07,530 --> 00:14:10,150 we're going to make here. 274 00:14:10,150 --> 00:14:12,480 So let's try to instantiate this. 275 00:14:12,480 --> 00:14:13,855 So we'll start with this example. 276 00:14:17,300 --> 00:14:23,340 X might be what you know about the patient at diagnosis. 277 00:14:23,340 --> 00:14:27,940 T, I'm going to assume for the purposes of today's class, 278 00:14:27,940 --> 00:14:32,468 is a decision between two different treatment plans. 279 00:14:32,468 --> 00:14:34,510 And I'm going to simplify the state of the world. 280 00:14:34,510 --> 00:14:36,940 I'm going to say those treatment plans only 281 00:14:36,940 --> 00:14:42,070 depend on what you know about the patient at diagnosis. 282 00:14:42,070 --> 00:14:44,290 So at diagnosis, you decide, I'm going 283 00:14:44,290 --> 00:14:46,780 to be giving them this sequence of treatments 284 00:14:46,780 --> 00:14:49,750 at this three-month interval or this other sequence 285 00:14:49,750 --> 00:14:52,000 of treatment at, maybe, that four-month interval. 286 00:14:52,000 --> 00:14:53,680 And you make that decision just based on diagnosis 287 00:14:53,680 --> 00:14:55,930 and you don't change it based on anything you observe. 288 00:14:58,060 --> 00:15:03,140 Then the causal graph of relevance there is, 289 00:15:03,140 --> 00:15:06,080 based on what you know about the patient at diagnosis, 290 00:15:06,080 --> 00:15:07,910 which I'm going to say X is a vector 291 00:15:07,910 --> 00:15:09,740 because maybe it's based on images, 292 00:15:09,740 --> 00:15:11,240 your whole electronic health record. 293 00:15:11,240 --> 00:15:14,180 There's a ton of data you have on the patient at diagnosis. 294 00:15:14,180 --> 00:15:17,300 Based on that, you make some decision 295 00:15:17,300 --> 00:15:19,310 about a treatment plan. 296 00:15:19,310 --> 00:15:23,390 I'm going to call that T. T could be binary, 297 00:15:23,390 --> 00:15:27,290 a choice between two treatments, it could be continuous, 298 00:15:27,290 --> 00:15:29,780 maybe you're deciding the dosage of the treatment, 299 00:15:29,780 --> 00:15:33,380 or it could be maybe even a vector. 300 00:15:33,380 --> 00:15:35,150 For today's lecture, I'm going to suppose 301 00:15:35,150 --> 00:15:38,930 that T is just binary, just involves two choices. 302 00:15:38,930 --> 00:15:41,120 But most of what I'll tell you about 303 00:15:41,120 --> 00:15:45,440 will generalize to the setting where T is non-binary as well. 304 00:15:45,440 --> 00:15:48,200 But critically, I'm going to make 305 00:15:48,200 --> 00:15:49,880 the assumption for today's lecture 306 00:15:49,880 --> 00:15:52,620 that you're not observing new things in between. 307 00:15:52,620 --> 00:15:56,630 So, for example, in this whole week's lecture, 308 00:15:56,630 --> 00:15:59,850 the following scenario will not happen. 309 00:15:59,850 --> 00:16:05,520 Based on diagnosis, you make a decision about treatment plan. 310 00:16:05,520 --> 00:16:08,220 Treatment plan starts, you got new observations. 311 00:16:08,220 --> 00:16:09,720 Based on those new observations, you 312 00:16:09,720 --> 00:16:11,428 realize that treatment plan isn't working 313 00:16:11,428 --> 00:16:14,310 and change to another treatment plan, and so on. 314 00:16:14,310 --> 00:16:17,440 So that scenario goes by a different name, 315 00:16:17,440 --> 00:16:18,930 which is called dynamic treatment 316 00:16:18,930 --> 00:16:22,710 regimes or off-policy reinforcement learning, 317 00:16:22,710 --> 00:16:24,620 and that we'll learn about next week. 318 00:16:24,620 --> 00:16:27,000 So for today's and Thursday's lecture, 319 00:16:27,000 --> 00:16:28,830 we're going to suppose you base on what 320 00:16:28,830 --> 00:16:31,650 you know about the patient at this time, you make a decision, 321 00:16:31,650 --> 00:16:34,650 you execute the decision, and you look at some outcome. 322 00:16:34,650 --> 00:16:38,250 So X causes T, not the other way around. 323 00:16:38,250 --> 00:16:42,060 And that's pretty clear because of our prior knowledge 324 00:16:42,060 --> 00:16:43,440 about this problem. 325 00:16:43,440 --> 00:16:46,350 It's not that the treatment affects 326 00:16:46,350 --> 00:16:49,280 what their diagnosis was. 327 00:16:49,280 --> 00:16:52,760 And then there's the outcome Y, and there, again, we 328 00:16:52,760 --> 00:16:55,190 suppose the outcome, what happens to the patient, maybe 329 00:16:55,190 --> 00:16:59,810 survival time, for example, is a function of what treatment 330 00:16:59,810 --> 00:17:03,740 they're getting and aspects about that patient. 331 00:17:03,740 --> 00:17:05,119 So this is the causal graph. 332 00:17:05,119 --> 00:17:06,746 We know it. 333 00:17:06,746 --> 00:17:08,329 But we don't know, does that treatment 334 00:17:08,329 --> 00:17:09,680 do anything to this patient? 335 00:17:09,680 --> 00:17:12,670 For whom does this treatment help the most? 336 00:17:12,670 --> 00:17:14,420 And those are the types of questions we're 337 00:17:14,420 --> 00:17:15,628 going to try to answer today. 338 00:17:18,185 --> 00:17:19,060 Is the setting clear? 339 00:17:31,390 --> 00:17:32,320 OK. 340 00:17:32,320 --> 00:17:36,030 Now, these questions are not new questions. 341 00:17:36,030 --> 00:17:39,360 They've been studied for decades in fields 342 00:17:39,360 --> 00:17:43,770 such as political science, economics, statistics, 343 00:17:43,770 --> 00:17:45,967 biostatistics. 344 00:17:45,967 --> 00:17:48,300 And the reason why they're studied in those other fields 345 00:17:48,300 --> 00:17:51,420 is because often you don't have the ability to intervene, 346 00:17:51,420 --> 00:17:53,880 and one has to try to answer these questions 347 00:17:53,880 --> 00:17:56,020 from observational data. 348 00:17:56,020 --> 00:18:01,140 For example, you might ask, what will happen to the US economy 349 00:18:01,140 --> 00:18:05,880 if the Federal Reserve raises US interest rates by 1%? 350 00:18:07,922 --> 00:18:10,130 When's the last time you heard of the Federal Reserve 351 00:18:10,130 --> 00:18:13,100 doing a randomized controlled trial? 352 00:18:13,100 --> 00:18:15,630 And even if they had done a randomized controlled trial, 353 00:18:15,630 --> 00:18:18,130 for example, flipped a coin to decide which way the interest 354 00:18:18,130 --> 00:18:21,377 rates would go, it wouldn't be comparable had they 355 00:18:21,377 --> 00:18:23,960 done that experiment today to if they had done that experiment 356 00:18:23,960 --> 00:18:26,210 two years from now because the state of the world 357 00:18:26,210 --> 00:18:28,050 has changed in those years. 358 00:18:31,010 --> 00:18:33,620 Let's talk about political science. 359 00:18:33,620 --> 00:18:38,990 I have close colleagues of mine at NYU who look at Twitter, 360 00:18:38,990 --> 00:18:41,910 and they want to ask questions like, 361 00:18:41,910 --> 00:18:44,210 how can we influence elections, or how 362 00:18:44,210 --> 00:18:46,670 are elections influenced? 363 00:18:46,670 --> 00:18:54,060 So you might look at some unnamed actors, possibly 364 00:18:54,060 --> 00:18:56,460 people supported by the Russian government, who 365 00:18:56,460 --> 00:19:00,660 are posting to Twitter or their social media. 366 00:19:00,660 --> 00:19:03,300 And you might ask the question of, well, 367 00:19:03,300 --> 00:19:05,490 did that actually influence the outcome 368 00:19:05,490 --> 00:19:08,378 of the previous presidential election? 369 00:19:08,378 --> 00:19:09,920 Again, in that scenario, it's one of, 370 00:19:09,920 --> 00:19:11,990 well, we have this data, something 371 00:19:11,990 --> 00:19:15,110 happened in the world, and we'd like 372 00:19:15,110 --> 00:19:17,420 to understand what was the effect of that action, 373 00:19:17,420 --> 00:19:22,510 but we can't exactly go back and replay to do something else. 374 00:19:22,510 --> 00:19:24,352 So these are fundamental questions that 375 00:19:24,352 --> 00:19:26,560 appear all across the sciences, and of course they're 376 00:19:26,560 --> 00:19:28,130 extremely relevant in health care, 377 00:19:28,130 --> 00:19:30,130 but yet, we don't teach them in our introduction 378 00:19:30,130 --> 00:19:32,050 to machine learning classes. 379 00:19:32,050 --> 00:19:34,930 We don't teach them in our undergraduate computer science 380 00:19:34,930 --> 00:19:36,460 education. 381 00:19:36,460 --> 00:19:38,708 And I view this as a major hole in our education, 382 00:19:38,708 --> 00:19:40,500 which is why we're spending two weeks on it 383 00:19:40,500 --> 00:19:42,640 in this course, which is still not enough. 384 00:19:46,070 --> 00:19:48,550 But what has changed between these fields, 385 00:19:48,550 --> 00:19:51,480 and what is relevant in health care? 386 00:19:51,480 --> 00:19:54,100 Well, the traditional way in which these questions 387 00:19:54,100 --> 00:19:56,410 were asked in statistics were ones 388 00:19:56,410 --> 00:19:59,975 where you took a huge amount of domain knowledge 389 00:19:59,975 --> 00:20:02,350 to, first of all, make sure you're setting up the problem 390 00:20:02,350 --> 00:20:04,850 correctly, and that's always going to be important. 391 00:20:04,850 --> 00:20:08,680 But then to think through what are all of the factors that 392 00:20:08,680 --> 00:20:11,580 could influence the treatment decisions 393 00:20:11,580 --> 00:20:14,770 called the confounding factors. 394 00:20:14,770 --> 00:20:16,420 And the traditional approach is one 395 00:20:16,420 --> 00:20:19,157 would write down 10, 20 different things, 396 00:20:19,157 --> 00:20:21,240 and make sure that you do some analysis, including 397 00:20:21,240 --> 00:20:24,040 the analysis I'll show you about in today and Thursday's lecture 398 00:20:24,040 --> 00:20:27,500 using those 10 or 20 variables. 399 00:20:27,500 --> 00:20:30,120 But where this field is going is one of now 400 00:20:30,120 --> 00:20:31,440 having high dimensional data. 401 00:20:31,440 --> 00:20:34,292 So I talked about how you might have imaging data for X, 402 00:20:34,292 --> 00:20:36,750 you might have the whole entire patient's electronic health 403 00:20:36,750 --> 00:20:38,540 record data facts. 404 00:20:38,540 --> 00:20:41,310 And the traditional approaches that the statistics community 405 00:20:41,310 --> 00:20:46,020 used to work on no longer work in this high dimensional 406 00:20:46,020 --> 00:20:46,657 setting. 407 00:20:46,657 --> 00:20:48,990 And so, in fact, it's actually a really interesting area 408 00:20:48,990 --> 00:20:51,030 for research, one that my lab is starting to work on 409 00:20:51,030 --> 00:20:53,160 and many other labs, where we could ask, how can we 410 00:20:53,160 --> 00:20:56,490 bring machine learning algorithms that are designed 411 00:20:56,490 --> 00:20:58,230 to work with high dimensional data 412 00:20:58,230 --> 00:21:01,380 to answer these types of causal inference questions? 413 00:21:01,380 --> 00:21:04,860 And in today's lecture, you'll see one example of reduction 414 00:21:04,860 --> 00:21:08,363 from causal inference to machine learning, 415 00:21:08,363 --> 00:21:09,780 where we'll be able to use machine 416 00:21:09,780 --> 00:21:13,110 learning to answer one of those causal inference questions. 417 00:21:16,500 --> 00:21:19,640 So the first thing we need is some language in order 418 00:21:19,640 --> 00:21:23,240 to formalize these notions. 419 00:21:23,240 --> 00:21:26,230 So I will work within what's known as the Rubin-Neyman 420 00:21:26,230 --> 00:21:30,030 Causal Model, where we talk about what 421 00:21:30,030 --> 00:21:31,930 are called potential outcomes. 422 00:21:31,930 --> 00:21:36,130 What would have happened under this world or that world? 423 00:21:36,130 --> 00:21:38,850 We'll call Y 0, and often it will 424 00:21:38,850 --> 00:21:42,160 be denoted as Y underscore 0, sometimes it'll 425 00:21:42,160 --> 00:21:47,290 be denoted as Y parentheses 0, and sometimes it'll 426 00:21:47,290 --> 00:21:59,990 be denoted as Y given X comma do Y equals 0. 427 00:21:59,990 --> 00:22:05,330 And all three of these notations are equivalent. 428 00:22:05,330 --> 00:22:08,290 So Y is 0 corresponds to what would 429 00:22:08,290 --> 00:22:10,730 have happened to this individual if you gave them 430 00:22:10,730 --> 00:22:12,800 treatment to 0. 431 00:22:12,800 --> 00:22:15,447 And Y1 is the potential outcome of what 432 00:22:15,447 --> 00:22:17,780 would have happened to this individual had you gave them 433 00:22:17,780 --> 00:22:19,010 treatment one. 434 00:22:19,010 --> 00:22:23,340 So you could think about Y1 as being giving the blue pill 435 00:22:23,340 --> 00:22:25,070 and Y0 as being given the red pill. 436 00:22:28,400 --> 00:22:32,870 Now, once you can talk about these states of the world, 437 00:22:32,870 --> 00:22:34,550 then one could start to ask questions 438 00:22:34,550 --> 00:22:38,850 of what's better, the red pill or the blue pill? 439 00:22:38,850 --> 00:22:41,000 And one can formalize that notion mathematically 440 00:22:41,000 --> 00:22:43,640 in terms of what's called the conditional average treatment 441 00:22:43,640 --> 00:22:45,800 effect, and this also goes by the name 442 00:22:45,800 --> 00:22:48,970 of individual treatment effect. 443 00:22:48,970 --> 00:22:51,210 So it's going to take as input Xi, which 444 00:22:51,210 --> 00:22:54,050 I'm going to denote as the data that you had 445 00:22:54,050 --> 00:22:56,100 at baseline for the individual. 446 00:22:56,100 --> 00:23:00,600 It's the covariance, the features for the individual. 447 00:23:00,600 --> 00:23:05,010 And one wants to know, well, for this individual with what 448 00:23:05,010 --> 00:23:07,770 we know about them, what's the difference between giving them 449 00:23:07,770 --> 00:23:11,430 treatment one or giving them treatment zero? 450 00:23:11,430 --> 00:23:13,740 So mathematically, that corresponds to a difference 451 00:23:13,740 --> 00:23:14,700 in expectations. 452 00:23:14,700 --> 00:23:20,340 It's a difference in expectation of Y1 from Y0. 453 00:23:20,340 --> 00:23:22,860 Now, the reason why I'm calling this an expectation 454 00:23:22,860 --> 00:23:26,340 is because I'm not going to assume that Y1 and Y0 are 455 00:23:26,340 --> 00:23:31,780 deterministic because maybe there's 456 00:23:31,780 --> 00:23:33,370 some bad luck component. 457 00:23:33,370 --> 00:23:36,820 Like, maybe a medication usually works for this type of person, 458 00:23:36,820 --> 00:23:41,788 but with a flip of a coin, sometimes it doesn't work. 459 00:23:41,788 --> 00:23:43,330 And so that's the randomness that I'm 460 00:23:43,330 --> 00:23:47,980 referring to when I talk about probability over Y1 given Xi. 461 00:23:47,980 --> 00:23:50,320 And so the CATE looks at the difference 462 00:23:50,320 --> 00:23:51,910 in those two expectations. 463 00:23:51,910 --> 00:23:55,300 And then one can now talk about what the average treatment 464 00:23:55,300 --> 00:23:59,570 effect is, which is the difference between those two. 465 00:23:59,570 --> 00:24:04,480 So the average treatment effect is now the expectation of-- 466 00:24:04,480 --> 00:24:09,100 I'll say the expectation of the CATE over the distribution 467 00:24:09,100 --> 00:24:14,980 of people, P of X. Now, we're going 468 00:24:14,980 --> 00:24:17,900 to go through this in four different ways in the next 10 469 00:24:17,900 --> 00:24:20,270 minutes, and then you're going to go over it five more 470 00:24:20,270 --> 00:24:21,770 ways doing your homework assignment, 471 00:24:21,770 --> 00:24:25,340 and you'll go over it two more ways on Friday in recitation. 472 00:24:25,340 --> 00:24:27,387 So if you don't get it just yet, stay with me, 473 00:24:27,387 --> 00:24:28,970 you'll get it by the end of this week. 474 00:24:33,300 --> 00:24:38,870 Now, in the data that you observe for an individual, 475 00:24:38,870 --> 00:24:41,700 all you see is what happened under one of the interventions. 476 00:24:41,700 --> 00:24:45,650 So, for example, if the i'th individual in your data set 477 00:24:45,650 --> 00:24:49,670 received treatment Ti equals 1, then what you observe, 478 00:24:49,670 --> 00:24:53,702 Yi is the potential outcome Y1. 479 00:24:53,702 --> 00:24:55,910 On the other hand, if the individual in your data set 480 00:24:55,910 --> 00:24:58,880 received treatment Ti equals 0, then 481 00:24:58,880 --> 00:25:00,770 what you observed for that individual 482 00:25:00,770 --> 00:25:04,130 is the potential outcome Y0. 483 00:25:04,130 --> 00:25:08,750 So that's the observed factual outcome. 484 00:25:08,750 --> 00:25:11,930 But one could also talk about the counterfactual 485 00:25:11,930 --> 00:25:14,450 of what would have happened to this person had 486 00:25:14,450 --> 00:25:17,190 the opposite treatment been done for them. 487 00:25:17,190 --> 00:25:22,460 Notice that I just swapped each Ti for 1 minus Ti, and so on. 488 00:25:22,460 --> 00:25:26,380 Now, the key challenge in the field is that in your data set, 489 00:25:26,380 --> 00:25:29,470 you only observe the factual outcomes. 490 00:25:29,470 --> 00:25:33,370 And when you want to reason about the counterfactual, 491 00:25:33,370 --> 00:25:36,970 that's where you have to impute this unobserved counterfactual 492 00:25:36,970 --> 00:25:38,720 outcome. 493 00:25:38,720 --> 00:25:40,640 And that is known as the fundamental problem 494 00:25:40,640 --> 00:25:42,110 of causal inference, that we only 495 00:25:42,110 --> 00:25:44,820 observe one of the two outcomes for any individual in the data 496 00:25:44,820 --> 00:25:46,340 set. 497 00:25:46,340 --> 00:25:49,070 So let's look at a very simple example. 498 00:25:49,070 --> 00:25:50,630 Here, individuals are characterized 499 00:25:50,630 --> 00:25:54,400 by just one feature, their age. 500 00:25:54,400 --> 00:25:58,240 And these two curves that I'm showing you 501 00:25:58,240 --> 00:26:00,010 are the potential outcomes of what 502 00:26:00,010 --> 00:26:03,070 would happen to this individual's blood pressure 503 00:26:03,070 --> 00:26:04,783 if you gave them treatment zero, which 504 00:26:04,783 --> 00:26:06,700 is the blue curve, versus treatment one, which 505 00:26:06,700 --> 00:26:08,990 is the red curve. 506 00:26:08,990 --> 00:26:09,490 All right. 507 00:26:09,490 --> 00:26:12,000 So let's dig in a little bit deeper. 508 00:26:12,000 --> 00:26:13,830 For the blue curve, we see people 509 00:26:13,830 --> 00:26:22,220 who received the control, what I'm calling treatment zero, 510 00:26:22,220 --> 00:26:26,380 their blood pressure was pretty low 511 00:26:26,380 --> 00:26:28,890 for the individuals who were low and for individuals 512 00:26:28,890 --> 00:26:30,930 whose age is high. 513 00:26:30,930 --> 00:26:35,250 But for middle age individuals, their blood pressure 514 00:26:35,250 --> 00:26:40,050 on receiving treatment zero is in the higher range. 515 00:26:40,050 --> 00:26:42,230 On the other hand, for individuals 516 00:26:42,230 --> 00:26:44,760 who receive treatment one, it's the red curve. 517 00:26:44,760 --> 00:26:47,940 So young people have much higher, let's say, blood 518 00:26:47,940 --> 00:26:53,790 pressure under treatment one, and, similarly, much older 519 00:26:53,790 --> 00:26:55,965 people. 520 00:26:55,965 --> 00:26:57,340 So then one could ask, well, what 521 00:26:57,340 --> 00:26:59,757 about the difference between these two potential outcomes? 522 00:26:59,757 --> 00:27:02,850 That is to say the CATE, the Conditional Average Treatment 523 00:27:02,850 --> 00:27:06,810 Effect, is simply looking at the distance between the blue curve 524 00:27:06,810 --> 00:27:08,910 and the red curve for that individual. 525 00:27:08,910 --> 00:27:11,310 So for someone with a specific age, 526 00:27:11,310 --> 00:27:14,640 let's say a young person or a very old person, 527 00:27:14,640 --> 00:27:17,400 there's a very big difference between giving treatment 528 00:27:17,400 --> 00:27:19,980 zero or giving treatment one. 529 00:27:19,980 --> 00:27:21,658 Whereas for a middle aged person, 530 00:27:21,658 --> 00:27:22,950 there's very little difference. 531 00:27:22,950 --> 00:27:30,090 So, for example, if treatment one was significantly cheaper 532 00:27:30,090 --> 00:27:32,890 than treatment zero, then you might say, 533 00:27:32,890 --> 00:27:34,500 we'll give treatment one. 534 00:27:34,500 --> 00:27:37,020 Even though it's not quite as good as treatment zero, 535 00:27:37,020 --> 00:27:39,660 but it's so much cheaper and the difference between them 536 00:27:39,660 --> 00:27:43,187 is so small, we'll give the other one. 537 00:27:43,187 --> 00:27:45,270 But in order to make that type of policy decision, 538 00:27:45,270 --> 00:27:46,560 one, of course, has to understand 539 00:27:46,560 --> 00:27:47,700 that conditional average treatment 540 00:27:47,700 --> 00:27:49,700 effect for that individual, and that's something 541 00:27:49,700 --> 00:27:53,252 that we're going to want to predict using data. 542 00:27:53,252 --> 00:27:54,710 Now, we don't always get the luxury 543 00:27:54,710 --> 00:27:57,440 of having personalized treatment recommendations. 544 00:27:57,440 --> 00:28:01,230 Sometimes we have to give a policy. 545 00:28:01,230 --> 00:28:02,538 Like, for example-- 546 00:28:02,538 --> 00:28:04,080 I took this example out of my slides, 547 00:28:04,080 --> 00:28:06,018 but I'll give it to you anyway. 548 00:28:06,018 --> 00:28:07,560 The federal government might come out 549 00:28:07,560 --> 00:28:12,885 with a guideline saying that all men over the age of 50-- 550 00:28:12,885 --> 00:28:14,010 I'm making up that number-- 551 00:28:14,010 --> 00:28:19,320 need to get annual prostate cancer screening. 552 00:28:19,320 --> 00:28:24,550 That's an example of a very broad policy decision. 553 00:28:24,550 --> 00:28:28,450 You might ask, well, what is the effect of that policy now 554 00:28:28,450 --> 00:28:31,940 applied over the full population on, 555 00:28:31,940 --> 00:28:35,410 let's say, decreasing deaths due to prostate cancer? 556 00:28:35,410 --> 00:28:37,990 And that would be an example of asking 557 00:28:37,990 --> 00:28:40,655 about the average treatment effect. 558 00:28:40,655 --> 00:28:42,280 So if you were to average the red line, 559 00:28:42,280 --> 00:28:43,750 if you were to average the blue line, 560 00:28:43,750 --> 00:28:45,460 you get those two dotted lines I show there. 561 00:28:45,460 --> 00:28:46,750 And if you look at the difference between them, 562 00:28:46,750 --> 00:28:48,250 that is the average treatment effect 563 00:28:48,250 --> 00:28:50,350 between giving the red intervention 564 00:28:50,350 --> 00:28:52,400 or giving the blue intervention. 565 00:28:52,400 --> 00:28:56,830 And if the average human effect is very positive, 566 00:28:56,830 --> 00:29:00,370 you might say that, on average, this intervention 567 00:29:00,370 --> 00:29:01,767 is a good intervention. 568 00:29:01,767 --> 00:29:03,850 If it's very negative, you might say the opposite. 569 00:29:06,670 --> 00:29:08,980 Now, the challenge about doing causal inference 570 00:29:08,980 --> 00:29:11,330 from observational data is that, of course, 571 00:29:11,330 --> 00:29:14,620 we don't observe those red and those blue curves, 572 00:29:14,620 --> 00:29:18,788 rather what we observe are data points that might be 573 00:29:18,788 --> 00:29:20,080 distributed all over the place. 574 00:29:20,080 --> 00:29:23,120 Like, for example, in this example, 575 00:29:23,120 --> 00:29:26,537 the blue treatment happens to be given in the data more 576 00:29:26,537 --> 00:29:28,120 to young people, and the red treatment 577 00:29:28,120 --> 00:29:31,540 happens to be given in the data more to older people. 578 00:29:31,540 --> 00:29:33,530 And that can happen for a variety of reasons. 579 00:29:33,530 --> 00:29:37,030 It can happen due to access to medication. 580 00:29:37,030 --> 00:29:39,520 It can happen for socioeconomic reasons. 581 00:29:39,520 --> 00:29:43,990 It could happen because existing treatment guidelines say 582 00:29:43,990 --> 00:29:46,710 that old people should receive treatment one 583 00:29:46,710 --> 00:29:49,330 and young people should receive treatment zero. 584 00:29:49,330 --> 00:29:52,600 These are all reasons why in your data who receives 585 00:29:52,600 --> 00:29:56,240 what treatment could be biased in some way. 586 00:29:56,240 --> 00:30:00,025 And that's exactly what this edge from X to T is modeling. 587 00:30:02,960 --> 00:30:04,380 But for each of those people, you 588 00:30:04,380 --> 00:30:06,660 might want to know, well, what would have happened if they 589 00:30:06,660 --> 00:30:07,952 had gotten the other treatment? 590 00:30:07,952 --> 00:30:10,230 And that's asking about the counterfactual. 591 00:30:10,230 --> 00:30:13,980 So these dotted circles are the counterfactuals 592 00:30:13,980 --> 00:30:17,300 for each of those observations. 593 00:30:17,300 --> 00:30:19,740 And by the way, you'll notice that those dots are not 594 00:30:19,740 --> 00:30:21,990 on the curves, and the reason they're not on the curve 595 00:30:21,990 --> 00:30:23,407 is because I'm trying to point out 596 00:30:23,407 --> 00:30:25,930 that there could be some stochasticity in the outcome. 597 00:30:25,930 --> 00:30:30,210 So the dotted lines are the expected potential outcomes 598 00:30:30,210 --> 00:30:32,245 and the circles are the realizations of them. 599 00:30:34,920 --> 00:30:35,490 All right. 600 00:30:35,490 --> 00:30:40,460 Everyone take out a calculator or your computer or your phone, 601 00:30:40,460 --> 00:30:41,550 and I'll take out mine. 602 00:30:45,470 --> 00:30:49,090 This is not an opportunity to go on Facebook, just to be clear. 603 00:30:49,090 --> 00:30:50,862 All you want is a calculator. 604 00:30:54,972 --> 00:30:56,930 My phone doesn't-- oh, OK, it has a calculator. 605 00:30:56,930 --> 00:30:58,500 Good. 606 00:30:58,500 --> 00:31:00,450 All right. 607 00:31:00,450 --> 00:31:02,624 So we're going to do a little exercise. 608 00:31:05,410 --> 00:31:08,530 Here's a data set on the left-hand side. 609 00:31:08,530 --> 00:31:10,540 Each row is an individual. 610 00:31:10,540 --> 00:31:13,720 We're observing the individual's age, gender, 611 00:31:13,720 --> 00:31:15,340 whether they exercise regularly, which 612 00:31:15,340 --> 00:31:17,800 I'll say is a one or a zero, and what treatment they got, 613 00:31:17,800 --> 00:31:21,750 which is A or B. On the far right-hand side 614 00:31:21,750 --> 00:31:28,200 are their observed sugar glucose sugar levels, let's say, 615 00:31:28,200 --> 00:31:29,340 at the end of the year. 616 00:31:33,010 --> 00:31:37,960 Now, what we'd like to have, it looks like this. 617 00:31:37,960 --> 00:31:42,460 So we'd like to know what would have happened to this person's 618 00:31:42,460 --> 00:31:45,700 sugar levels had they received medication A 619 00:31:45,700 --> 00:31:47,830 or had they received medication B. 620 00:31:47,830 --> 00:31:52,630 But if you look at the previous slide, 621 00:31:52,630 --> 00:31:56,560 we observed for each individual that they got either A or B. 622 00:31:56,560 --> 00:31:58,480 And so we're only going to know one 623 00:31:58,480 --> 00:32:00,920 of these columns for each individual. 624 00:32:00,920 --> 00:32:03,430 So the first row, for example, this individual 625 00:32:03,430 --> 00:32:05,980 received treatment A, and so you'll 626 00:32:05,980 --> 00:32:11,650 see that I've taken the observed sugar 627 00:32:11,650 --> 00:32:14,730 level for that individual, and since they received 628 00:32:14,730 --> 00:32:17,860 treatment A, that observed level represents 629 00:32:17,860 --> 00:32:21,550 the potential outcome Ya, or Y0. 630 00:32:21,550 --> 00:32:27,370 And that's why I have a 6, which is bolded under Y0. 631 00:32:27,370 --> 00:32:30,370 And we don't know what would have happened 632 00:32:30,370 --> 00:32:32,200 to that individual had they received 633 00:32:32,200 --> 00:32:36,580 treatment B. So in this case, some magical creature 634 00:32:36,580 --> 00:32:38,762 came to me and told me their sugar levels would 635 00:32:38,762 --> 00:32:40,720 have been 5.5, but we don't actually know that. 636 00:32:40,720 --> 00:32:42,070 It wasn't in the data. 637 00:32:42,070 --> 00:32:43,737 Let's look at the next line just to make 638 00:32:43,737 --> 00:32:45,230 sure we get what I'm saying. 639 00:32:45,230 --> 00:32:47,050 So the second individual actually 640 00:32:47,050 --> 00:32:53,450 received treatment B. They're observed sugar level is 6.5. 641 00:32:53,450 --> 00:32:55,790 OK. 642 00:32:55,790 --> 00:32:58,160 Let's do a little survey. 643 00:32:58,160 --> 00:33:00,950 That 6.5 number, should it be in this column? 644 00:33:00,950 --> 00:33:01,698 Raise your hand. 645 00:33:01,698 --> 00:33:02,990 Or should it be in this column? 646 00:33:02,990 --> 00:33:04,740 Raise your hand. 647 00:33:04,740 --> 00:33:05,240 All right. 648 00:33:05,240 --> 00:33:08,050 About half of you got that right. 649 00:33:08,050 --> 00:33:11,040 Indeed, it goes to the second column. 650 00:33:11,040 --> 00:33:14,080 And again, what we would like to know is the counterfactual. 651 00:33:14,080 --> 00:33:16,260 What would have been their sugar levels 652 00:33:16,260 --> 00:33:18,267 had they received medication A? 653 00:33:18,267 --> 00:33:20,100 Which we don't actually observe in our data, 654 00:33:20,100 --> 00:33:22,440 but I'm going to hypothesize is-- 655 00:33:22,440 --> 00:33:25,562 suppose that someone told me it was 7, then 656 00:33:25,562 --> 00:33:27,270 you would see that value filled in there. 657 00:33:27,270 --> 00:33:30,610 That's the unobserved counterfactual. 658 00:33:30,610 --> 00:33:31,110 All right. 659 00:33:31,110 --> 00:33:33,900 First of all, is the setup clear? 660 00:33:33,900 --> 00:33:34,470 All right. 661 00:33:34,470 --> 00:33:37,990 Now here's when you use your calculators. 662 00:33:37,990 --> 00:33:40,720 So we're going to now demonstrate 663 00:33:40,720 --> 00:33:43,420 the difference between a naive estimator 664 00:33:43,420 --> 00:33:47,590 of your average treatment effect and the true average treatment 665 00:33:47,590 --> 00:33:48,850 effect. 666 00:33:48,850 --> 00:33:51,190 So what I want you to do right now 667 00:33:51,190 --> 00:33:59,270 is to compute, first, what is the average sugar 668 00:33:59,270 --> 00:34:07,270 level of the individuals who got medication B. So for that, 669 00:34:07,270 --> 00:34:11,440 we're only going to be using the red ones. 670 00:34:11,440 --> 00:34:17,050 So this is conditioning on receiving medication B. 671 00:34:17,050 --> 00:34:24,340 And so this is equivalent to going back to this one 672 00:34:24,340 --> 00:34:27,130 and saying, we're only going to take the rows where individuals 673 00:34:27,130 --> 00:34:29,350 receive medication B, and we're going 674 00:34:29,350 --> 00:34:34,110 to average their observed sugar levels. 675 00:34:34,110 --> 00:34:36,530 And everyone should do that. 676 00:34:36,530 --> 00:34:37,530 What's the first number? 677 00:34:42,600 --> 00:35:02,370 6.5 plus-- I'm getting 7.875. 678 00:35:02,370 --> 00:35:08,790 This is for the average sugar, given 679 00:35:08,790 --> 00:35:11,430 that they received medication B. Is that 680 00:35:11,430 --> 00:35:12,680 what other people are getting? 681 00:35:12,680 --> 00:35:13,130 AUDIENCE: Yeah. 682 00:35:13,130 --> 00:35:13,838 DAVID SONTAG: OK. 683 00:35:13,838 --> 00:35:16,170 What about for the second number? 684 00:35:16,170 --> 00:35:20,070 Average sugar, given A? 685 00:35:24,820 --> 00:35:26,058 I want you to compute it. 686 00:35:26,058 --> 00:35:27,850 And I'm going to ask everyone to say it out 687 00:35:27,850 --> 00:35:29,123 loud in literally one minute. 688 00:35:29,123 --> 00:35:30,540 And if you get it wrong, of course 689 00:35:30,540 --> 00:35:33,360 you're going to be embarrassed. 690 00:35:33,360 --> 00:35:34,360 I'm going to try myself. 691 00:35:53,090 --> 00:35:53,910 OK. 692 00:35:53,910 --> 00:35:55,493 On the count of three, I want everyone 693 00:35:55,493 --> 00:35:57,680 to read out what that third number is. 694 00:35:57,680 --> 00:36:00,020 One, two, three. 695 00:36:00,020 --> 00:36:04,100 ALL: 7.125. 696 00:36:04,100 --> 00:36:05,250 DAVID SONTAG: All right. 697 00:36:05,250 --> 00:36:05,750 Good. 698 00:36:05,750 --> 00:36:08,830 We can all do arithmetic. 699 00:36:08,830 --> 00:36:10,370 All right. 700 00:36:10,370 --> 00:36:11,595 Good. 701 00:36:11,595 --> 00:36:17,000 So, again, we're just looking at the red numbers 702 00:36:17,000 --> 00:36:18,590 here, just the red numbers. 703 00:36:18,590 --> 00:36:20,960 So we just computed that difference, 704 00:36:20,960 --> 00:36:24,280 which is point what? 705 00:36:24,280 --> 00:36:25,690 AUDIENCE: 0.75. 706 00:36:25,690 --> 00:36:27,500 DAVID SONTAG: 0.75? 707 00:36:27,500 --> 00:36:29,030 Yeah, that looks about right. 708 00:36:29,030 --> 00:36:29,670 Good. 709 00:36:29,670 --> 00:36:30,170 All right. 710 00:36:30,170 --> 00:36:33,220 So that's a positive number. 711 00:36:33,220 --> 00:36:37,260 Now let's do something different. 712 00:36:37,260 --> 00:36:42,600 Now let's compute the actual average treatment effect, which 713 00:36:42,600 --> 00:36:50,310 is we're now going to average every number in this column, 714 00:36:50,310 --> 00:36:53,400 and we're going to average every number in this column. 715 00:36:53,400 --> 00:36:56,880 So this is the average sugar level 716 00:36:56,880 --> 00:37:00,030 under the potential outcome of had the individual received 717 00:37:00,030 --> 00:37:03,590 treatment B, and this is the average sugar level 718 00:37:03,590 --> 00:37:06,080 under the potential outcome that the individual received 719 00:37:06,080 --> 00:37:12,350 treatment A. All right. 720 00:37:12,350 --> 00:37:13,550 Who's doing it? 721 00:37:13,550 --> 00:37:14,960 AUDIENCE: 0.75. 722 00:37:14,960 --> 00:37:16,385 DAVID SONTAG: 0.75 is what? 723 00:37:16,385 --> 00:37:17,760 AUDIENCE: The difference. 724 00:37:17,760 --> 00:37:19,010 DAVID SONTAG: How do you know? 725 00:37:19,010 --> 00:37:21,160 AUDIENCE: [INAUDIBLE] 726 00:37:21,160 --> 00:37:22,910 DAVID SONTAG: Wow, you're fast. 727 00:37:22,910 --> 00:37:23,410 OK. 728 00:37:23,410 --> 00:37:24,240 Let's see if you're right. 729 00:37:24,240 --> 00:37:25,190 I actually don't know. 730 00:37:25,190 --> 00:37:25,760 OK. 731 00:37:25,760 --> 00:37:26,930 The first one is 0.75. 732 00:37:26,930 --> 00:37:27,500 Good, we got that right. 733 00:37:27,500 --> 00:37:29,917 I intentionally didn't post the slides to today's lecture. 734 00:37:32,240 --> 00:37:38,340 And the second one is minus 0.75. 735 00:37:38,340 --> 00:37:38,840 All right. 736 00:37:38,840 --> 00:37:43,330 So now let's put us in the shoes of a policymaker. 737 00:37:43,330 --> 00:37:47,135 The policymaker has to decide, is it a good idea to-- 738 00:37:47,135 --> 00:37:49,010 or let's say it's a health insurance company. 739 00:37:49,010 --> 00:37:50,843 A health insurance company is trying decide, 740 00:37:50,843 --> 00:37:53,768 should I reimburse for treatment B or not? 741 00:37:53,768 --> 00:37:55,310 Or should I simply say, no, I'm never 742 00:37:55,310 --> 00:37:58,430 going to reimburse for treatment because it doesn't work well? 743 00:37:58,430 --> 00:38:02,300 So if they had done the naive estimator, that 744 00:38:02,300 --> 00:38:04,860 would have been the first example, 745 00:38:04,860 --> 00:38:10,540 then it would look like medication B is-- 746 00:38:10,540 --> 00:38:12,380 we want lower numbers here, so it 747 00:38:12,380 --> 00:38:18,610 would look like medication B is worse than medication A. 748 00:38:18,610 --> 00:38:21,730 And if you properly estimated what 749 00:38:21,730 --> 00:38:24,580 the actual average treatment effect is, 750 00:38:24,580 --> 00:38:26,890 you get the absolute opposite conclusion. 751 00:38:26,890 --> 00:38:29,950 You conclude that medication B is much better than medication 752 00:38:29,950 --> 00:38:33,660 A. It's just a simple example to really illustrate 753 00:38:33,660 --> 00:38:35,970 the difference between conditioning 754 00:38:35,970 --> 00:38:39,035 and actually computing that counterfactual. 755 00:38:42,890 --> 00:38:43,390 OK. 756 00:38:43,390 --> 00:38:45,170 So hopefully now you're starting to get it. 757 00:38:45,170 --> 00:38:47,795 And again, you're going to have many more opportunities to work 758 00:38:47,795 --> 00:38:52,700 through these things in your homework assignment and so on. 759 00:38:52,700 --> 00:38:55,550 So by now you should be starting to wonder, how the hell 760 00:38:55,550 --> 00:38:57,620 could I do anything in this state of the world? 761 00:38:57,620 --> 00:39:00,620 Because you don't actually observe those black numbers. 762 00:39:00,620 --> 00:39:02,540 These are all unobserved. 763 00:39:02,540 --> 00:39:05,600 And clearly there is bias in what 764 00:39:05,600 --> 00:39:07,100 the values should be because of what 765 00:39:07,100 --> 00:39:08,790 I've been saying all along. 766 00:39:08,790 --> 00:39:11,163 So what can we do? 767 00:39:11,163 --> 00:39:12,830 Well, the first thing we have to realize 768 00:39:12,830 --> 00:39:15,247 is that typically, this is an impossible problem to solve. 769 00:39:15,247 --> 00:39:18,920 So your instincts aren't wrong, and we're 770 00:39:18,920 --> 00:39:20,740 going to have to make a ton of assumptions 771 00:39:20,740 --> 00:39:23,950 in order to do anything here. 772 00:39:23,950 --> 00:39:26,430 So the first assumption is called SUTVA. 773 00:39:26,430 --> 00:39:27,930 I'm not even going to talk about it. 774 00:39:27,930 --> 00:39:29,725 You can read about that in your readings. 775 00:39:29,725 --> 00:39:31,350 I'll tell you about the two assumptions 776 00:39:31,350 --> 00:39:34,890 that are a little bit easier to describe. 777 00:39:34,890 --> 00:39:37,920 The first critical assumption is that there are no unobserved 778 00:39:37,920 --> 00:39:39,840 confounding factors. 779 00:39:39,840 --> 00:39:41,370 Mathematically what that's saying 780 00:39:41,370 --> 00:39:44,610 is that your potential outcomes, Y0 and Y1, 781 00:39:44,610 --> 00:39:47,340 are conditionally independent of the treatment decision given 782 00:39:47,340 --> 00:39:52,780 what you observe on the individual, X. 783 00:39:52,780 --> 00:39:55,900 Now, this could be a bit hard to-- 784 00:39:55,900 --> 00:39:57,300 and that's called ignorability. 785 00:39:57,300 --> 00:39:59,008 And this can be a bit hard to understand, 786 00:39:59,008 --> 00:40:01,950 so let me draw a picture. 787 00:40:01,950 --> 00:40:04,200 So X is your covariance, T is your treatment decision. 788 00:40:04,200 --> 00:40:06,600 And now I've drawn for you a slightly different graph. 789 00:40:06,600 --> 00:40:10,860 Over here I said X goes to T, X and T go to Y. 790 00:40:10,860 --> 00:40:14,610 But now I don't have Y. Instead, I have Y0 and Y1, 791 00:40:14,610 --> 00:40:16,662 and I don't have any edge from T to them. 792 00:40:16,662 --> 00:40:18,120 And that's because now I'm actually 793 00:40:18,120 --> 00:40:20,850 using the potential outcomes notation. 794 00:40:20,850 --> 00:40:22,225 Y0 is a potential outcome of what 795 00:40:22,225 --> 00:40:24,558 would have happened to this individual had they received 796 00:40:24,558 --> 00:40:26,040 treatment 0, and Y1 is what would 797 00:40:26,040 --> 00:40:28,950 have happened to this individual if they received treatment one. 798 00:40:28,950 --> 00:40:31,395 And because you already know what treatment the individual 799 00:40:31,395 --> 00:40:32,853 has received, it doesn't make sense 800 00:40:32,853 --> 00:40:35,560 to talk about an edge from T to those values. 801 00:40:35,560 --> 00:40:37,150 That's why there's no edge there. 802 00:40:37,150 --> 00:40:39,150 So then you might wonder, how could you possibly 803 00:40:39,150 --> 00:40:41,192 have a violation of this conditional independence 804 00:40:41,192 --> 00:40:42,180 assumption? 805 00:40:42,180 --> 00:40:43,680 Well, before I give you that answer, 806 00:40:43,680 --> 00:40:45,970 let me put some names to these things. 807 00:40:45,970 --> 00:40:48,870 So we might think about X as being the age, gender, weight, 808 00:40:48,870 --> 00:40:50,850 diet, and so on of the individual. 809 00:40:50,850 --> 00:40:54,300 T might be a medication, like an anti-hypertensive medication 810 00:40:54,300 --> 00:40:56,820 to try to lower a patient's blood pressure. 811 00:40:56,820 --> 00:40:58,770 And these would be the potential outcomes 812 00:40:58,770 --> 00:41:00,990 after those two medications. 813 00:41:00,990 --> 00:41:04,470 So an example of a violation of ignorability 814 00:41:04,470 --> 00:41:10,970 is if there is something else, some hidden variable h, which 815 00:41:10,970 --> 00:41:13,490 is not observed and which affects 816 00:41:13,490 --> 00:41:15,470 both the decision of what treatment 817 00:41:15,470 --> 00:41:17,750 the individual in your data set receives 818 00:41:17,750 --> 00:41:20,545 and the potential outcomes. 819 00:41:20,545 --> 00:41:22,170 Now it should be really clear that this 820 00:41:22,170 --> 00:41:24,378 would be a violation of that conditional independence 821 00:41:24,378 --> 00:41:25,100 assumption. 822 00:41:25,100 --> 00:41:29,010 In this graph, Y0 and Y1 are not conditionally 823 00:41:29,010 --> 00:41:32,760 independent of T given X. All right. 824 00:41:32,760 --> 00:41:34,800 So what are these hidden confounders? 825 00:41:34,800 --> 00:41:37,710 Well, they might be things, for example, which really 826 00:41:37,710 --> 00:41:40,020 affect treatment decisions. 827 00:41:40,020 --> 00:41:42,420 So maybe there's a treatment guideline 828 00:41:42,420 --> 00:41:44,400 saying that for diabetic patients, 829 00:41:44,400 --> 00:41:47,700 they should receive treatment zero, that that's 830 00:41:47,700 --> 00:41:50,350 the right thing to do. 831 00:41:50,350 --> 00:41:54,270 And so a violation of this would be 832 00:41:54,270 --> 00:41:56,700 if the fact that the patient's diabetic 833 00:41:56,700 --> 00:41:59,950 were not recorded in the electronic health record. 834 00:41:59,950 --> 00:42:01,660 So you don't know-- 835 00:42:01,660 --> 00:42:02,540 that's not up there. 836 00:42:02,540 --> 00:42:05,610 You don't know that, in fact, the reason 837 00:42:05,610 --> 00:42:08,700 the patient received treatment T was because of this h factor. 838 00:42:08,700 --> 00:42:10,450 And there's critically another assumption, 839 00:42:10,450 --> 00:42:12,540 which is that h actually affects the outcome, 840 00:42:12,540 --> 00:42:15,435 which is why you have these edges from h to the Y's. 841 00:42:15,435 --> 00:42:17,310 If h were something which might have affected 842 00:42:17,310 --> 00:42:21,620 treatment decision but not the actual potential outcomes-- 843 00:42:21,620 --> 00:42:23,530 and that can happen, of course. 844 00:42:23,530 --> 00:42:26,880 Things like gender can often affect treatment decisions, 845 00:42:26,880 --> 00:42:32,570 but maybe, for some diseases, it might not affect outcomes. 846 00:42:32,570 --> 00:42:36,270 In that situation it wouldn't be a confounding factor 847 00:42:36,270 --> 00:42:38,540 because it doesn't violate this assumption. 848 00:42:38,540 --> 00:42:40,290 And, in fact, one would be able to come up 849 00:42:40,290 --> 00:42:42,960 with consistent estimators of average treatment effect 850 00:42:42,960 --> 00:42:44,130 under that assumption. 851 00:42:44,130 --> 00:42:47,656 Where things go to hell is when you have both of those edges. 852 00:42:47,656 --> 00:42:49,950 All right. 853 00:42:49,950 --> 00:42:51,790 So there can't be any of these h's. 854 00:42:51,790 --> 00:42:53,540 You have to observe all things that affect 855 00:42:53,540 --> 00:42:55,055 both treatment and outcomes. 856 00:42:57,848 --> 00:42:59,390 The second big assumption-- oh, yeah. 857 00:42:59,390 --> 00:42:59,930 Question? 858 00:42:59,930 --> 00:43:02,098 AUDIENCE: In practice, how good of a model is this? 859 00:43:02,098 --> 00:43:03,890 DAVID SONTAG: Of what I'm showing you here? 860 00:43:03,890 --> 00:43:04,610 AUDIENCE: Yeah. 861 00:43:04,610 --> 00:43:06,015 DAVID SONTAG: For hypertension? 862 00:43:06,015 --> 00:43:06,640 AUDIENCE: Sure. 863 00:43:06,640 --> 00:43:07,848 DAVID SONTAG: I have no idea. 864 00:43:10,248 --> 00:43:11,790 But I think what you're really trying 865 00:43:11,790 --> 00:43:14,248 to get at here in asking your question, how good of a model 866 00:43:14,248 --> 00:43:17,540 is this, is, well, oh, my god, how do I know 867 00:43:17,540 --> 00:43:19,200 if I've observed everything? 868 00:43:19,200 --> 00:43:20,100 Right? 869 00:43:20,100 --> 00:43:20,600 All right. 870 00:43:20,600 --> 00:43:22,017 And that's where you need to start 871 00:43:22,017 --> 00:43:24,100 talking to domain experts. 872 00:43:24,100 --> 00:43:27,700 So this is my starting place where 873 00:43:27,700 --> 00:43:31,450 I said, no, I'm not going to attempt 874 00:43:31,450 --> 00:43:32,965 to fit the causal graph. 875 00:43:35,470 --> 00:43:37,478 I'm going to assume I know the causal graph 876 00:43:37,478 --> 00:43:39,020 and just try to estimate the effects. 877 00:43:39,020 --> 00:43:41,228 That's where this starts to become really irrelevant. 878 00:43:41,228 --> 00:43:44,053 Because if you notice, this is another causal graph, not 879 00:43:44,053 --> 00:43:45,220 the one I drew on the board. 880 00:43:48,100 --> 00:43:50,110 And so that's something where, really, 881 00:43:50,110 --> 00:43:52,100 talking with domain experts would be relevant. 882 00:43:52,100 --> 00:43:57,120 So if you say, OK, I'm going to be studying hypertension 883 00:43:57,120 --> 00:44:00,870 and this is the data I've observed on patients, 884 00:44:00,870 --> 00:44:04,980 well, you can then go to a clinician, maybe a primary care 885 00:44:04,980 --> 00:44:08,220 doctor who often treats patients with hypertension, 886 00:44:08,220 --> 00:44:10,530 and you say, OK, what usually affects your treatment 887 00:44:10,530 --> 00:44:11,850 decisions? 888 00:44:11,850 --> 00:44:13,370 And you get a set of variables out, 889 00:44:13,370 --> 00:44:15,660 and then you check to make sure, am I 890 00:44:15,660 --> 00:44:17,610 observing all of those variables, at least 891 00:44:17,610 --> 00:44:20,988 the variables that would also affect outcomes? 892 00:44:20,988 --> 00:44:22,530 So, often, there's going to be a back 893 00:44:22,530 --> 00:44:24,900 and forth in that conversation to make sure that you've 894 00:44:24,900 --> 00:44:26,195 set up your problem correctly. 895 00:44:26,195 --> 00:44:27,570 And again, this is one area where 896 00:44:27,570 --> 00:44:29,400 you see a critical difference between the way 897 00:44:29,400 --> 00:44:31,067 that we do causal inference from the way 898 00:44:31,067 --> 00:44:32,400 that we do machine learning. 899 00:44:32,400 --> 00:44:36,580 Machine learning, if there's some unobserved variables, 900 00:44:36,580 --> 00:44:37,080 so what? 901 00:44:37,080 --> 00:44:38,880 I mean, maybe your predictive accuracy isn't quite as good 902 00:44:38,880 --> 00:44:40,740 as it could have been, but whatever. 903 00:44:40,740 --> 00:44:43,920 Here, your conclusions could be completely wrong 904 00:44:43,920 --> 00:44:48,810 if you don't get those confounding factors right. 905 00:44:48,810 --> 00:44:50,610 Now, in some of the optional readings 906 00:44:50,610 --> 00:44:52,710 for Thursday's lecture-- 907 00:44:52,710 --> 00:44:55,290 and we'll touch on it very briefly on Thursday, 908 00:44:55,290 --> 00:44:57,300 but there's not much time in this course-- 909 00:44:57,300 --> 00:45:00,690 I'll talk about ways and you'll read about ways 910 00:45:00,690 --> 00:45:03,780 to try to assess robustness to violations 911 00:45:03,780 --> 00:45:05,430 of these assumptions. 912 00:45:05,430 --> 00:45:08,075 And those go by the name of sensitivity analysis. 913 00:45:08,075 --> 00:45:10,200 So, for example, the type of question you might ask 914 00:45:10,200 --> 00:45:12,300 is, how would my conclusions have 915 00:45:12,300 --> 00:45:15,060 changed if there were a confounding factor which 916 00:45:15,060 --> 00:45:17,860 was blah strong? 917 00:45:17,860 --> 00:45:23,210 And that's something that one could try to answer from data, 918 00:45:23,210 --> 00:45:25,420 but it's really starting to get beyond the scope 919 00:45:25,420 --> 00:45:26,510 of this course. 920 00:45:26,510 --> 00:45:28,052 So I'll give you some readings on it, 921 00:45:28,052 --> 00:45:32,000 but I won't be able to talk about it in the lecture. 922 00:45:32,000 --> 00:45:34,660 Now, the second major assumption that one needs 923 00:45:34,660 --> 00:45:36,943 is what's known as common support. 924 00:45:36,943 --> 00:45:38,610 And by the way, pay close attention here 925 00:45:38,610 --> 00:45:43,680 because at the end of today's lecture-- and if I forget, 926 00:45:43,680 --> 00:45:45,030 someone must remind me-- 927 00:45:45,030 --> 00:45:49,650 I'm going to ask you where did these two assumptions come up 928 00:45:49,650 --> 00:45:52,870 in the proof that I'm about to give you. 929 00:45:52,870 --> 00:45:55,370 The first one I'm going to give you will be a dead giveaway. 930 00:45:55,370 --> 00:45:57,840 So I'm going to answer to you where ignorability comes up, 931 00:45:57,840 --> 00:45:59,423 but it's up to you to figure out where 932 00:45:59,423 --> 00:46:01,560 does common support show up. 933 00:46:01,560 --> 00:46:02,670 So what is common support? 934 00:46:02,670 --> 00:46:07,560 Well, what common support says is that there always 935 00:46:07,560 --> 00:46:11,440 must be some stochasticity in the treatment decisions. 936 00:46:11,440 --> 00:46:17,270 For example, if in your data patients only 937 00:46:17,270 --> 00:46:21,780 receive treatment A and no patient receives treatment B, 938 00:46:21,780 --> 00:46:24,420 then you would never be able to figure out the counterfactual, 939 00:46:24,420 --> 00:46:29,008 what would have happened if patients receive treatment B. 940 00:46:29,008 --> 00:46:31,050 But what happens if it's not quite that universal 941 00:46:31,050 --> 00:46:34,260 but maybe there is classes of people? 942 00:46:34,260 --> 00:46:37,350 Some individual is X, let's say, people with blue hair. 943 00:46:37,350 --> 00:46:42,450 People with blue hair always receive treatment zero 944 00:46:42,450 --> 00:46:45,200 and they never see treatment one. 945 00:46:45,200 --> 00:46:49,340 Well, for those people, if for some reason 946 00:46:49,340 --> 00:46:50,990 something about them having blue hair 947 00:46:50,990 --> 00:46:53,600 was also going to affect how they would respond 948 00:46:53,600 --> 00:46:55,250 to the treatment, then you wouldn't 949 00:46:55,250 --> 00:46:57,470 be able to answer anything about the counterfactual 950 00:46:57,470 --> 00:46:59,660 for those individuals. 951 00:46:59,660 --> 00:47:03,560 This goes by the name of what's called a propensity score. 952 00:47:03,560 --> 00:47:07,310 It's the probability of receiving some treatment 953 00:47:07,310 --> 00:47:09,230 for each individual. 954 00:47:09,230 --> 00:47:14,150 And we're going to assume that this propensity score is always 955 00:47:14,150 --> 00:47:17,150 bounded between 0 and 1. 956 00:47:17,150 --> 00:47:20,000 So it's between 1 minus epsilon and epsilon 957 00:47:20,000 --> 00:47:23,020 for some small epsilon. 958 00:47:23,020 --> 00:47:25,330 And violations of that assumption 959 00:47:25,330 --> 00:47:28,000 are going to completely invalidate all conclusions 960 00:47:28,000 --> 00:47:30,610 that we could draw from the data. 961 00:47:30,610 --> 00:47:31,520 All right. 962 00:47:31,520 --> 00:47:35,190 Now, in actual clinical practice, you might wonder, 963 00:47:35,190 --> 00:47:37,010 can this ever hold? 964 00:47:37,010 --> 00:47:40,880 Because there are clinical guidelines. 965 00:47:40,880 --> 00:47:43,867 Well, a couple of places where you'll see this are as follows. 966 00:47:43,867 --> 00:47:46,450 First, often, there are settings where we haven't the faintest 967 00:47:46,450 --> 00:47:49,720 idea how to treat patients, like second line diabetes 968 00:47:49,720 --> 00:47:51,010 treatments. 969 00:47:51,010 --> 00:47:54,370 You know that the first thing we start with is metformin. 970 00:47:54,370 --> 00:47:57,310 But if metformin doesn't help control the patient's glucose 971 00:47:57,310 --> 00:48:00,490 values, there are several second line diabetic treatments. 972 00:48:00,490 --> 00:48:03,100 And right now, we don't really know which one to try. 973 00:48:03,100 --> 00:48:06,340 So a clinician might start with treatments from one class. 974 00:48:06,340 --> 00:48:08,570 And if that's not working, you try a different class, 975 00:48:08,570 --> 00:48:09,040 and so on. 976 00:48:09,040 --> 00:48:10,582 And it's a bit random which class you 977 00:48:10,582 --> 00:48:13,550 start with for any one patient. 978 00:48:13,550 --> 00:48:16,600 In other settings, there might be good clinical guidelines, 979 00:48:16,600 --> 00:48:18,860 but there is randomness in other ways. 980 00:48:18,860 --> 00:48:25,500 For example, clinicians who are trained on the west coast 981 00:48:25,500 --> 00:48:28,350 might be trained that this is the right way to do things, 982 00:48:28,350 --> 00:48:30,420 and clinicians who are trained in the east coast 983 00:48:30,420 --> 00:48:33,630 might be trained that this is the right way to do things. 984 00:48:33,630 --> 00:48:37,860 And so even if any one clinician's treatment decisions 985 00:48:37,860 --> 00:48:40,260 are deterministic in some way, you'll 986 00:48:40,260 --> 00:48:43,530 see some stochasticity now across clinicians. 987 00:48:43,530 --> 00:48:45,790 It's a bit subtle how to use that in your analysis, 988 00:48:45,790 --> 00:48:49,160 but trust me, it can be done. 989 00:48:49,160 --> 00:48:51,680 So if you want to do causal inference 990 00:48:51,680 --> 00:48:53,290 from observational data, you're going 991 00:48:53,290 --> 00:48:56,570 to have to first start to formalize things mathematically 992 00:48:56,570 --> 00:49:01,190 in terms of what is your X, what is your T, what is your Y. You 993 00:49:01,190 --> 00:49:04,830 have to think through, do these choices 994 00:49:04,830 --> 00:49:09,310 satisfy these assumptions of ignorability and overlap? 995 00:49:09,310 --> 00:49:11,310 Some of these things you can check in your data. 996 00:49:11,310 --> 00:49:13,770 Ignorability you can't explicitly check in your data. 997 00:49:13,770 --> 00:49:19,580 But overlap, this thing, you can test in your data. 998 00:49:19,580 --> 00:49:20,550 By the way, how? 999 00:49:20,550 --> 00:49:21,050 Any idea? 1000 00:49:24,828 --> 00:49:26,370 Someone else who hasn't spoken today. 1001 00:49:31,320 --> 00:49:33,690 So just think back to the previous example. 1002 00:49:33,690 --> 00:49:41,220 You have this table of these X's and treatment A or B and then 1003 00:49:41,220 --> 00:49:42,750 sugar values. 1004 00:49:42,750 --> 00:49:44,303 How would you test this? 1005 00:49:44,303 --> 00:49:46,220 AUDIENCE: You could use a frequentist approach 1006 00:49:46,220 --> 00:49:48,550 and just count how many things show up. 1007 00:49:48,550 --> 00:49:51,880 And if there is zero, then you could say that it's violated. 1008 00:49:51,880 --> 00:49:52,770 DAVID SONTAG: Good. 1009 00:49:52,770 --> 00:49:54,705 So you have this table. 1010 00:49:54,705 --> 00:49:58,140 I'll just go back to that table. 1011 00:49:58,140 --> 00:50:03,420 We have this table, and these are your X's. 1012 00:50:05,805 --> 00:50:07,680 Actually, we'll go back to the previous slide 1013 00:50:07,680 --> 00:50:08,972 where it's a bit easier to see. 1014 00:50:13,930 --> 00:50:17,020 Here, we're going to ignore the outcome, the sugar 1015 00:50:17,020 --> 00:50:19,150 levels because, remember, this only 1016 00:50:19,150 --> 00:50:22,030 has to do with probability of treatment 1017 00:50:22,030 --> 00:50:23,770 given your covariance. 1018 00:50:23,770 --> 00:50:25,588 The Y doesn't show up here at all. 1019 00:50:25,588 --> 00:50:27,130 So this thing on the right-hand side, 1020 00:50:27,130 --> 00:50:29,977 the observed sugar levels, is irrelevant for this question. 1021 00:50:29,977 --> 00:50:31,810 All we care about is what goes on over here. 1022 00:50:31,810 --> 00:50:32,740 So we look at this. 1023 00:50:32,740 --> 00:50:35,100 These are your X's, and this is your treatment. 1024 00:50:35,100 --> 00:50:37,840 And you can look to see, OK, here you 1025 00:50:37,840 --> 00:50:42,010 have one 75-year-old male who does exercise 1026 00:50:42,010 --> 00:50:44,680 frequently and received treatment A. Is there any one 1027 00:50:44,680 --> 00:50:48,370 else in the data set who is 75 years old and male, 1028 00:50:48,370 --> 00:50:51,190 does exercise regularly but received treatment B? 1029 00:50:51,190 --> 00:50:52,450 Yes or no? 1030 00:50:52,450 --> 00:50:53,580 No. 1031 00:50:53,580 --> 00:50:54,080 Good. 1032 00:50:54,080 --> 00:50:54,580 OK. 1033 00:50:54,580 --> 00:50:59,360 So overlap is not satisfied here, at least not empirically. 1034 00:50:59,360 --> 00:51:03,190 Now, you might argue that I'm being a bit too coarse here. 1035 00:51:03,190 --> 00:51:05,740 Well, what happens if the individual is 74 1036 00:51:05,740 --> 00:51:06,850 and received treatment B? 1037 00:51:06,850 --> 00:51:08,200 Maybe that's close enough. 1038 00:51:08,200 --> 00:51:09,820 So there starts to become subtleties 1039 00:51:09,820 --> 00:51:12,700 in assessing these things when you have finite data. 1040 00:51:12,700 --> 00:51:14,710 But it is something at the fundamental level 1041 00:51:14,710 --> 00:51:17,290 that you could start to assess using data. 1042 00:51:17,290 --> 00:51:19,870 As opposed to ignorability, which you cannot test using 1043 00:51:19,870 --> 00:51:21,290 data. 1044 00:51:21,290 --> 00:51:21,790 All right. 1045 00:51:21,790 --> 00:51:29,990 So you have to think about, are these assumptions satisfied? 1046 00:51:29,990 --> 00:51:34,160 And only once you start to think through those questions can 1047 00:51:34,160 --> 00:51:37,340 you start to do your analysis. 1048 00:51:37,340 --> 00:51:41,460 And so that now brings me to the next part of this lecture, 1049 00:51:41,460 --> 00:51:45,260 which is how do we actually-- let's just now believe David, 1050 00:51:45,260 --> 00:51:46,760 believe that these assumptions hold. 1051 00:51:46,760 --> 00:51:49,720 How do we do that causal inference? 1052 00:51:49,720 --> 00:51:50,220 Yeah? 1053 00:51:50,220 --> 00:51:51,802 AUDIENCE: I just had a question on [INAUDIBLE].. 1054 00:51:51,802 --> 00:51:54,687 If you know that some patients, for instance, healthy patients, 1055 00:51:54,687 --> 00:51:56,270 are not tracking to get any treatment, 1056 00:51:56,270 --> 00:51:58,890 should we just remove them, basically? 1057 00:51:58,890 --> 00:52:00,500 DAVID SONTAG: So the question is, 1058 00:52:00,500 --> 00:52:04,710 what happens if you have a violation of overlap? 1059 00:52:04,710 --> 00:52:08,240 For example, you know that healthy individuals never 1060 00:52:08,240 --> 00:52:09,770 receive any treatment. 1061 00:52:09,770 --> 00:52:11,520 Should you remove them from your data set? 1062 00:52:11,520 --> 00:52:14,020 Well, first of all, that has to do with how do you formalize 1063 00:52:14,020 --> 00:52:16,160 the question because not receiving a treatment 1064 00:52:16,160 --> 00:52:18,250 is a treatment. 1065 00:52:18,250 --> 00:52:21,880 So that might be your control arm, just to be clear. 1066 00:52:21,880 --> 00:52:24,160 Now, if you're asking about the difference between two 1067 00:52:24,160 --> 00:52:26,200 treatments-- two different classes of treatment 1068 00:52:26,200 --> 00:52:34,000 for a condition, then often one defines the relevant inclusion 1069 00:52:34,000 --> 00:52:40,990 criteria in order to have these conditions hold. 1070 00:52:40,990 --> 00:52:44,830 For example, we could try to redefine the set of individuals 1071 00:52:44,830 --> 00:52:47,140 that we're asking about so that overlap does hold. 1072 00:52:47,140 --> 00:52:48,640 But then in that situation, you have 1073 00:52:48,640 --> 00:52:51,520 to just make sure that your policy is also modified. 1074 00:52:51,520 --> 00:52:54,640 You say, OK, I conclude that the average treatment effect is 1075 00:52:54,640 --> 00:52:57,740 blah for this type of people. 1076 00:52:57,740 --> 00:52:59,830 OK? 1077 00:52:59,830 --> 00:53:01,730 OK. 1078 00:53:01,730 --> 00:53:05,560 So how could we possibly compute the average treatment effect 1079 00:53:05,560 --> 00:53:07,835 from data? 1080 00:53:07,835 --> 00:53:09,960 Remember, average treatment effect, mathematically, 1081 00:53:09,960 --> 00:53:13,635 is the expectation between potential outcome Y1 minus Y0. 1082 00:53:16,860 --> 00:53:20,250 The key tool which we'll use in order to estimate that 1083 00:53:20,250 --> 00:53:22,203 is what's known as the adjustment formula. 1084 00:53:22,203 --> 00:53:24,370 This goes by many names in the statistics community, 1085 00:53:24,370 --> 00:53:26,960 such as the G-formula as well. 1086 00:53:26,960 --> 00:53:30,460 Here, I'll give you a derivation of it. 1087 00:53:30,460 --> 00:53:34,790 We're first going to recognize that this expectation is 1088 00:53:34,790 --> 00:53:36,830 actually two expectations in one. 1089 00:53:36,830 --> 00:53:39,770 It's the expectation over individuals X 1090 00:53:39,770 --> 00:53:43,575 and it's the expectation over potential outcomes Y given X. 1091 00:53:43,575 --> 00:53:45,200 So I'm first just going to write it out 1092 00:53:45,200 --> 00:53:47,330 in terms of those two expectations, 1093 00:53:47,330 --> 00:53:50,870 and I'll write the expectations related to X on the outside. 1094 00:53:50,870 --> 00:53:54,760 That goes by name of law of total expectation. 1095 00:53:54,760 --> 00:53:58,750 This is trivial at this stage. 1096 00:53:58,750 --> 00:54:02,230 And by the way, I'm just writing out expectation of Y1. 1097 00:54:02,230 --> 00:54:04,900 In a few minutes, I'll show you expectation of Y0, 1098 00:54:04,900 --> 00:54:07,840 but it's going to be exactly analogous. 1099 00:54:07,840 --> 00:54:11,980 Now, the next step is where we use ignorability. 1100 00:54:11,980 --> 00:54:15,540 I told you I was going to give that one away. 1101 00:54:15,540 --> 00:54:19,000 So remember, we said that we're assuming 1102 00:54:19,000 --> 00:54:23,740 that Y1 is conditionally independent of the treatment 1103 00:54:23,740 --> 00:54:34,210 T given X. What that means is probability of Y1 1104 00:54:34,210 --> 00:54:39,910 given X is equal to probability of Y1 1105 00:54:39,910 --> 00:54:43,750 given X comma T equals whatever-- in this case 1106 00:54:43,750 --> 00:54:46,750 I'll just say T equals 1. 1107 00:54:46,750 --> 00:54:52,220 This is implied by Y1 being conditionally independent of T 1108 00:54:52,220 --> 00:54:54,050 given X. 1109 00:54:54,050 --> 00:54:59,640 So I can just stick n comma T equals 1 here, 1110 00:54:59,640 --> 00:55:05,090 and that's explicitly because of ignorability holding. 1111 00:55:05,090 --> 00:55:08,570 But now we're in a really good place because notice that-- 1112 00:55:08,570 --> 00:55:10,760 and here I've just done some short notation. 1113 00:55:10,760 --> 00:55:14,391 I'm just going to hide this expectation. 1114 00:55:17,550 --> 00:55:19,840 And by the way, you could do the same for Y0-- 1115 00:55:19,840 --> 00:55:21,640 Y1, Y0. 1116 00:55:21,640 --> 00:55:26,330 And now notice that we can replace 1117 00:55:26,330 --> 00:55:30,140 this average human effect with now this expectation 1118 00:55:30,140 --> 00:55:32,440 with respect to all individuals X 1119 00:55:32,440 --> 00:55:35,720 of the expectation of Y1 given X comma T equals 1, and so on. 1120 00:55:39,500 --> 00:55:43,600 And these are mostly quantities that we can now 1121 00:55:43,600 --> 00:55:45,560 observe from our data. 1122 00:55:45,560 --> 00:55:50,980 So, for example, we can look at the individuals who 1123 00:55:50,980 --> 00:55:54,550 received treatment one, and for those individuals 1124 00:55:54,550 --> 00:55:56,690 we have realizations of Y1. 1125 00:55:56,690 --> 00:55:58,940 We can look at individuals who receive treatment zero, 1126 00:55:58,940 --> 00:56:02,497 and for those individuals we have realizations of Y0. 1127 00:56:02,497 --> 00:56:04,330 And we could just average those realizations 1128 00:56:04,330 --> 00:56:07,810 to get estimates of the corresponding expectations. 1129 00:56:07,810 --> 00:56:10,150 So these we can easily estimate from our data. 1130 00:56:13,020 --> 00:56:14,660 And so we've made progress. 1131 00:56:14,660 --> 00:56:18,665 We can now estimate some part of this from our data. 1132 00:56:18,665 --> 00:56:20,040 But notice, there are some things 1133 00:56:20,040 --> 00:56:22,123 that we can't yet directly estimate from our data. 1134 00:56:22,123 --> 00:56:27,450 In particular, we can't estimate expectation of Y0 1135 00:56:27,450 --> 00:56:31,500 given X comma T equals 1 because we have no idea what 1136 00:56:31,500 --> 00:56:34,620 would have happened to this individual who actually 1137 00:56:34,620 --> 00:56:37,170 got treatment one if they had gotten treatment zero. 1138 00:56:37,170 --> 00:56:39,060 So these we don't know. 1139 00:56:42,210 --> 00:56:45,540 So these we don't know. 1140 00:56:45,540 --> 00:56:47,620 Now, what is the trick I'm planning on you? 1141 00:56:47,620 --> 00:56:50,030 How does it help that we can do this? 1142 00:56:50,030 --> 00:56:52,300 Well, the key point is that these quantities 1143 00:56:52,300 --> 00:56:56,790 that we can estimate from data show up in that term. 1144 00:56:56,790 --> 00:56:59,910 In particular, if you look at the individuals X 1145 00:56:59,910 --> 00:57:04,230 that you've sampled from the full set of individuals P of X, 1146 00:57:04,230 --> 00:57:07,650 for that individual X for which, in fact, 1147 00:57:07,650 --> 00:57:11,310 we observed T equals 1, then we can estimate expectation of Y1 1148 00:57:11,310 --> 00:57:16,430 given X comma T equals 1, and similarly for Y0. 1149 00:57:16,430 --> 00:57:19,080 But what we need to be able to do is to extrapolate. 1150 00:57:19,080 --> 00:57:22,995 Because empirically, we only have samples from P of X 1151 00:57:22,995 --> 00:57:24,620 given T equals 1, P of X given T equals 1152 00:57:24,620 --> 00:57:27,830 0 for those two potential outcomes correspondingly. 1153 00:57:27,830 --> 00:57:31,670 But we are going to also get samples of X such 1154 00:57:31,670 --> 00:57:33,620 that for those individuals in your data set, 1155 00:57:33,620 --> 00:57:36,650 you might have only observed T equals 0. 1156 00:57:36,650 --> 00:57:41,180 And to compute this formula, you have to answer, for that X, 1157 00:57:41,180 --> 00:57:44,360 what would it have been if they got treatment equals one? 1158 00:57:44,360 --> 00:57:46,283 So there are going to be a set of individuals 1159 00:57:46,283 --> 00:57:47,950 that we have to extrapolate for in order 1160 00:57:47,950 --> 00:57:50,275 to use this adjustment formula for estimate. 1161 00:57:52,780 --> 00:57:53,280 Yep? 1162 00:57:53,280 --> 00:57:55,405 AUDIENCE: I thought because common support is true, 1163 00:57:55,405 --> 00:57:58,010 we have some patients that received each treatment 1164 00:57:58,010 --> 00:58:00,150 or a given type of X. 1165 00:58:00,150 --> 00:58:02,110 DAVID SONTAG: Yes. 1166 00:58:02,110 --> 00:58:06,850 But now-- so, yes, that's true. 1167 00:58:09,460 --> 00:58:13,920 But that's a statement about infinite data. 1168 00:58:13,920 --> 00:58:17,010 And in reality, one only has finite data. 1169 00:58:17,010 --> 00:58:22,590 And so although common support has to hold to some extent, 1170 00:58:22,590 --> 00:58:25,500 you can't just build on that to say that you always 1171 00:58:25,500 --> 00:58:29,240 observe the counterfactual for every individual, 1172 00:58:29,240 --> 00:58:30,990 such as the pictures I showed you earlier. 1173 00:58:33,740 --> 00:58:36,340 So I'm going to leave this slide up for just one more second 1174 00:58:36,340 --> 00:58:38,260 to let it sink in and see what it's saying. 1175 00:58:41,660 --> 00:58:44,480 We started out from the goal of computing the average treatment 1176 00:58:44,480 --> 00:58:48,330 effect, expected value of Y1 minus Y0. 1177 00:58:48,330 --> 00:58:50,970 Using the adjustment formula, we've 1178 00:58:50,970 --> 00:58:54,750 gotten to now an equivalent representation, which 1179 00:58:54,750 --> 00:58:58,080 is now an expectation with respect to all individuals 1180 00:58:58,080 --> 00:59:03,310 sampling from P of X of expected value of Y1 1181 00:59:03,310 --> 00:59:05,800 given X comma T equals 1, expected value of Y0 1182 00:59:05,800 --> 00:59:08,060 given X comma T equals 0. 1183 00:59:08,060 --> 00:59:10,223 For some of the individuals, you can observe this, 1184 00:59:10,223 --> 00:59:12,140 and for some of them, you have to extrapolate. 1185 00:59:14,670 --> 00:59:18,547 So from here, there are many ways that one can go. 1186 00:59:18,547 --> 00:59:20,130 Hold your question for a little while. 1187 00:59:23,180 --> 00:59:25,940 So types of causal inference methods 1188 00:59:25,940 --> 00:59:27,500 that you will have heard of include 1189 00:59:27,500 --> 00:59:29,090 things like covariance adjustment, 1190 00:59:29,090 --> 00:59:32,120 propensity score re-weighting, doubly robust estimators, 1191 00:59:32,120 --> 00:59:34,830 matching, and so on. 1192 00:59:34,830 --> 00:59:37,520 And those are the tools of the causal inference trade. 1193 00:59:37,520 --> 00:59:39,320 And in this course, we're only going 1194 00:59:39,320 --> 00:59:40,520 to talk about the first two. 1195 00:59:40,520 --> 00:59:41,750 And in today's lecture, we're only 1196 00:59:41,750 --> 00:59:44,083 going to talk about the first one, covariate adjustment. 1197 00:59:44,083 --> 00:59:47,610 And on Thursday, we'll talk about the second one. 1198 00:59:47,610 --> 00:59:50,690 So covariate adjustment is a very natural way 1199 00:59:50,690 --> 00:59:54,505 to try to do that extrapolation. 1200 00:59:54,505 --> 00:59:56,880 It also goes by the name, by the way, of response surface 1201 00:59:56,880 --> 00:59:57,500 modeling. 1202 00:59:57,500 --> 00:59:59,042 What we're going to do is we're going 1203 00:59:59,042 --> 01:00:04,010 to learn a function f, which takes as an input X and T, 1204 01:00:04,010 --> 01:00:06,500 and its goals is to predict Y. So intuitively, you 1205 01:00:06,500 --> 01:00:10,790 should think about f as this conditional probability 1206 01:00:10,790 --> 01:00:12,620 distribution. 1207 01:00:12,620 --> 01:00:19,140 It's predicting Y given X and T. So 1208 01:00:19,140 --> 01:00:23,340 T is going to be an input to the machine learning 1209 01:00:23,340 --> 01:00:25,830 algorithm, which is going to predict what would be 1210 01:00:25,830 --> 01:00:30,850 the potential outcome Y for this individual described by feature 1211 01:00:30,850 --> 01:00:42,720 as X1 through Xd under intervention T. 1212 01:00:42,720 --> 01:00:44,710 So this is just from the previous slide. 1213 01:00:44,710 --> 01:00:46,290 And what we're going to do now are-- 1214 01:00:46,290 --> 01:00:50,640 this is now where we get the reduction to machine learning-- 1215 01:00:50,640 --> 01:00:53,820 is we're going to use empirical risk minimization, or maybe 1216 01:00:53,820 --> 01:00:57,480 some regularized empirical risk minimization, to fit a function 1217 01:00:57,480 --> 01:01:02,070 f which approximates the expected value of YT given 1218 01:01:02,070 --> 01:01:03,910 capital T equals little t. 1219 01:01:03,910 --> 01:01:07,830 Got my X. And then once you have that function, 1220 01:01:07,830 --> 01:01:10,260 we're going to be able to use that to estimate 1221 01:01:10,260 --> 01:01:15,420 the average treatment effect by just implementing now 1222 01:01:15,420 --> 01:01:16,863 this formula here. 1223 01:01:16,863 --> 01:01:19,196 So we're going to first take an expectation with respect 1224 01:01:19,196 --> 01:01:20,790 to the individuals in the data set. 1225 01:01:20,790 --> 01:01:22,440 So we're going to approximate that 1226 01:01:22,440 --> 01:01:25,890 with an empirical expectation where we sum over the little n 1227 01:01:25,890 --> 01:01:28,370 individuals in your data set. 1228 01:01:28,370 --> 01:01:29,870 Then what we're going to do is we're 1229 01:01:29,870 --> 01:01:36,620 going to estimate the first term, which is f of Xi comma 1 1230 01:01:36,620 --> 01:01:39,590 because that is approximating the expected value of Y1 1231 01:01:39,590 --> 01:01:41,330 given T comma X-- 1232 01:01:41,330 --> 01:01:43,970 T equals 1 comma X. And we're going 1233 01:01:43,970 --> 01:01:47,240 to approximate the second term, which is just plugging 1234 01:01:47,240 --> 01:01:49,750 now 0 for T instead of 1. 1235 01:01:49,750 --> 01:01:51,950 And we're going to take the difference between them, 1236 01:01:51,950 --> 01:01:54,630 and that will be our estimator of the average treatment 1237 01:01:54,630 --> 01:01:55,130 effect. 1238 01:02:00,357 --> 01:02:02,065 Here's a natural place to ask a question. 1239 01:02:07,210 --> 01:02:12,578 One thing you might wonder is, in your data set, 1240 01:02:12,578 --> 01:02:14,870 you actually did observe something for that individual, 1241 01:02:14,870 --> 01:02:15,660 right. 1242 01:02:15,660 --> 01:02:20,550 Notice how your raw data doesn't show up in this at all. 1243 01:02:20,550 --> 01:02:23,250 Because I've done machine learning, 1244 01:02:23,250 --> 01:02:27,030 and then I've thrown away the observed Y's, 1245 01:02:27,030 --> 01:02:30,330 and I used this estimator. 1246 01:02:30,330 --> 01:02:33,120 So what you could have done-- an alternative formula, which, 1247 01:02:33,120 --> 01:02:35,530 by the way, is also a consistent estimator, 1248 01:02:35,530 --> 01:02:38,280 would have been to use the observed 1249 01:02:38,280 --> 01:02:41,760 Y for whatever the factual is and the imputed Y 1250 01:02:41,760 --> 01:02:44,642 for the counterfactual using f. 1251 01:02:44,642 --> 01:02:46,350 That would have been that would have also 1252 01:02:46,350 --> 01:02:48,690 been a consistent estimator for the average treatment effect. 1253 01:02:48,690 --> 01:02:49,732 You could've done either. 1254 01:02:53,790 --> 01:02:54,290 OK. 1255 01:02:57,050 --> 01:02:59,360 Now, sometimes you're not interested in just 1256 01:02:59,360 --> 01:03:00,680 the average treatment effect, but you're actually 1257 01:03:00,680 --> 01:03:02,555 interested in understanding the heterogeneity 1258 01:03:02,555 --> 01:03:03,647 in the population. 1259 01:03:03,647 --> 01:03:05,480 Well, this also now gives you an opportunity 1260 01:03:05,480 --> 01:03:08,460 to try to explore that heterogeneity. 1261 01:03:08,460 --> 01:03:10,520 So for each individual Xi, you can 1262 01:03:10,520 --> 01:03:12,530 look at just the difference between what 1263 01:03:12,530 --> 01:03:16,580 f predicts for treatment one and what X 1264 01:03:16,580 --> 01:03:17,930 predicts given treatment zero. 1265 01:03:17,930 --> 01:03:19,070 And the difference between those is 1266 01:03:19,070 --> 01:03:21,195 your estimate of your conditional average treatment 1267 01:03:21,195 --> 01:03:21,695 effect. 1268 01:03:21,695 --> 01:03:23,445 So, for example, if you want to figure out 1269 01:03:23,445 --> 01:03:25,460 for this individual, what is the optimal policy, 1270 01:03:25,460 --> 01:03:27,667 you might look to see is CATE positive or negative, 1271 01:03:27,667 --> 01:03:29,750 or is it greater than some threshold, for example? 1272 01:03:32,148 --> 01:03:33,440 So let's look at some pictures. 1273 01:03:36,300 --> 01:03:39,030 Now what we're using is we're using that function f in order 1274 01:03:39,030 --> 01:03:41,190 to impute those counterfactuals. 1275 01:03:41,190 --> 01:03:43,920 And now we have those observed, and we can actually 1276 01:03:43,920 --> 01:03:45,540 compute the CATE. 1277 01:03:45,540 --> 01:03:48,120 And averaging over those, you can estimate now 1278 01:03:48,120 --> 01:03:51,060 the average treatment effect. 1279 01:03:51,060 --> 01:03:51,560 Yep? 1280 01:03:51,560 --> 01:03:53,180 AUDIENCE: How is f non-biased? 1281 01:03:54,968 --> 01:03:55,760 DAVID SONTAG: Good. 1282 01:03:55,760 --> 01:03:57,008 So where can this go wrong? 1283 01:03:57,008 --> 01:03:58,550 So what do you mean by biased, first? 1284 01:03:58,550 --> 01:03:59,168 I'll ask that. 1285 01:03:59,168 --> 01:04:00,710 AUDIENCE: For instance, as we've seen 1286 01:04:00,710 --> 01:04:04,820 in the paper like pneumonia and people who have asthma, 1287 01:04:04,820 --> 01:04:07,830 [INAUDIBLE] 1288 01:04:08,717 --> 01:04:11,300 DAVID SONTAG: Oh, thank you so much for bringing that back up. 1289 01:04:11,300 --> 01:04:15,350 So you're referring to one of the readings 1290 01:04:15,350 --> 01:04:17,000 for the course from several weeks 1291 01:04:17,000 --> 01:04:20,180 ago, where we talked about using just a pure machine learning 1292 01:04:20,180 --> 01:04:24,903 algorithm to try to predict outcomes in a hospital setting. 1293 01:04:24,903 --> 01:04:26,570 In particular, what happens for patients 1294 01:04:26,570 --> 01:04:29,780 who have pneumonia in the emergency department? 1295 01:04:29,780 --> 01:04:32,600 And if you all remember, there was this asthma example, 1296 01:04:32,600 --> 01:04:36,320 where patients with asthma were predicted 1297 01:04:36,320 --> 01:04:41,090 to have better outcomes than patients without asthma. 1298 01:04:43,700 --> 01:04:45,188 And you're calling that bias. 1299 01:04:45,188 --> 01:04:46,980 But you remember, when I taught about this, 1300 01:04:46,980 --> 01:04:48,610 I called it biased due to a particular thing. 1301 01:04:48,610 --> 01:04:49,735 What's the language I used? 1302 01:04:52,990 --> 01:04:58,978 I said bias due to intervention, maybe, is what I-- 1303 01:04:58,978 --> 01:05:00,520 I can't remember exactly what I said. 1304 01:05:00,520 --> 01:05:02,170 [LAUGHTER] 1305 01:05:02,170 --> 01:05:03,400 I don't know. 1306 01:05:03,400 --> 01:05:06,240 Make it up. 1307 01:05:06,240 --> 01:05:08,960 Now a textbook will be written with bias by intervention. 1308 01:05:08,960 --> 01:05:09,460 OK. 1309 01:05:09,460 --> 01:05:12,160 So the problem there is that they 1310 01:05:12,160 --> 01:05:16,193 didn't formulize the prediction problem correctly. 1311 01:05:16,193 --> 01:05:17,860 The question that they should have asked 1312 01:05:17,860 --> 01:05:20,920 is, for asthma patients-- 1313 01:05:23,440 --> 01:05:31,150 what you really want to ask is a question of X and then T and Y, 1314 01:05:31,150 --> 01:05:39,360 where T are the interventions that are done for asthmatics. 1315 01:05:45,090 --> 01:05:48,450 So the failure of that paper is that it ignored 1316 01:05:48,450 --> 01:05:51,360 the causal inference question which was hidden in the data, 1317 01:05:51,360 --> 01:05:54,120 and it just went to predict Y given X marginalizing 1318 01:05:54,120 --> 01:05:55,320 over T altogether. 1319 01:05:55,320 --> 01:05:59,070 So T was never in the predictive model. 1320 01:05:59,070 --> 01:06:01,870 And said differently, they never asked counterfactual questions 1321 01:06:01,870 --> 01:06:04,200 of what would have happened had you done a different T. 1322 01:06:04,200 --> 01:06:06,640 And then they still used it to try to guide some treatment 1323 01:06:06,640 --> 01:06:07,140 decisions. 1324 01:06:07,140 --> 01:06:09,840 Like, for example, should you send this person home, 1325 01:06:09,840 --> 01:06:12,173 or should you keep them for careful monitoring or so on? 1326 01:06:12,173 --> 01:06:14,415 So this is exactly the same example 1327 01:06:14,415 --> 01:06:16,260 as I gave in the beginning of the lecture, 1328 01:06:16,260 --> 01:06:19,020 where I said if you just use a risk stratification 1329 01:06:19,020 --> 01:06:23,430 model to make some decisions, you run the risk that you're 1330 01:06:23,430 --> 01:06:27,720 making the wrong decisions because those predictions were 1331 01:06:27,720 --> 01:06:30,360 biased by decisions in your data. 1332 01:06:30,360 --> 01:06:32,580 So that doesn't happen here because we're explicitly 1333 01:06:32,580 --> 01:06:35,320 accounting for T in all of our analysis. 1334 01:06:35,320 --> 01:06:35,980 Yep? 1335 01:06:35,980 --> 01:06:38,330 AUDIENCE: In the data sets that we've used, like MIMIC, 1336 01:06:38,330 --> 01:06:39,922 how much treatment information exists? 1337 01:06:39,922 --> 01:06:41,880 DAVID SONTAG: So how much treatment information 1338 01:06:41,880 --> 01:06:42,380 is in MIMIC? 1339 01:06:42,380 --> 01:06:44,880 A ton. 1340 01:06:44,880 --> 01:06:48,240 In fact, one of the readings for next week 1341 01:06:48,240 --> 01:06:52,350 is going to be about trying to understand how one could manage 1342 01:06:52,350 --> 01:06:58,920 sepsis, which is a condition caused by infection, which 1343 01:06:58,920 --> 01:07:02,670 is managed by, for example, giving broad spectrum 1344 01:07:02,670 --> 01:07:05,850 antibiotics, giving fluids, giving 1345 01:07:05,850 --> 01:07:07,602 pressers and ventilators. 1346 01:07:07,602 --> 01:07:09,060 And all of those are interventions, 1347 01:07:09,060 --> 01:07:11,227 and all those interventions are recorded in the data 1348 01:07:11,227 --> 01:07:13,590 so that one could then ask counterfactual questions 1349 01:07:13,590 --> 01:07:14,880 from the data, like what would have happened 1350 01:07:14,880 --> 01:07:16,170 if this patient had they received 1351 01:07:16,170 --> 01:07:17,545 a different set of interventions? 1352 01:07:17,545 --> 01:07:20,010 Would we have prolonged their life, for example? 1353 01:07:20,010 --> 01:07:24,383 And so in an intensive care unit setting, most of the questions 1354 01:07:24,383 --> 01:07:26,550 that we want to ask about, not all, but many of them 1355 01:07:26,550 --> 01:07:29,010 are about dynamic treatments because it's not just 1356 01:07:29,010 --> 01:07:30,698 a single treatment but really about 1357 01:07:30,698 --> 01:07:32,490 a service sequence of treatments responding 1358 01:07:32,490 --> 01:07:34,053 to the current patient condition. 1359 01:07:34,053 --> 01:07:36,720 And so that's where we'll really start to get into that material 1360 01:07:36,720 --> 01:07:40,310 next week, not in today's lecture. 1361 01:07:40,310 --> 01:07:41,300 Yep? 1362 01:07:41,300 --> 01:07:44,022 AUDIENCE: How do you make sure that your f function really 1363 01:07:44,022 --> 01:07:46,388 learned from the relationship between T and the outcome? 1364 01:07:46,388 --> 01:07:48,180 DAVID SONTAG: That's a phenomenal question. 1365 01:07:48,180 --> 01:07:50,810 Where were you this whole course? 1366 01:07:50,810 --> 01:07:51,810 Thank you for asking it. 1367 01:07:51,810 --> 01:07:53,640 So I'll repeat it. 1368 01:07:53,640 --> 01:07:56,100 How do you know that your function f actually 1369 01:07:56,100 --> 01:07:59,850 learned something about the relationship between the input 1370 01:07:59,850 --> 01:08:04,730 X and the treatment T and the outcome? 1371 01:08:04,730 --> 01:08:07,070 And that really gets to the question of, 1372 01:08:07,070 --> 01:08:09,410 is my reduction actually valid? 1373 01:08:09,410 --> 01:08:19,979 So I've taken this problem and I've 1374 01:08:19,979 --> 01:08:23,350 reduced it to this machine learning problem, where 1375 01:08:23,350 --> 01:08:27,340 I take my data, and literally I just 1376 01:08:27,340 --> 01:08:29,770 learn a function f to try to predict well 1377 01:08:29,770 --> 01:08:32,550 the observations in the data. 1378 01:08:32,550 --> 01:08:34,705 And how do we know that that function f actually 1379 01:08:34,705 --> 01:08:36,330 does a good job at estimating something 1380 01:08:36,330 --> 01:08:38,682 like average treatment effect? 1381 01:08:38,682 --> 01:08:41,250 In fact, it might not. 1382 01:08:41,250 --> 01:08:44,250 And this is where things start to get 1383 01:08:44,250 --> 01:08:47,460 really tricky, particularly with high dimensional data. 1384 01:08:47,460 --> 01:08:51,520 Because it could happen, for example, that your treatment 1385 01:08:51,520 --> 01:08:55,470 decision is only one of a huge number of factors that affect 1386 01:08:55,470 --> 01:08:59,130 the outcome Y. And it could be that a much more 1387 01:08:59,130 --> 01:09:02,130 important factor is hidden in X. And because you don't have 1388 01:09:02,130 --> 01:09:05,640 much data, and because you have to regularize your learning 1389 01:09:05,640 --> 01:09:08,100 algorithm, let's say, with L1 or L2 regularization or maybe 1390 01:09:08,100 --> 01:09:10,590 early stopping if you're using deep neural network, 1391 01:09:10,590 --> 01:09:15,790 your algorithm might never learn the actual dependence on T. 1392 01:09:15,790 --> 01:09:19,859 It might learn just to throw away T and just 1393 01:09:19,859 --> 01:09:23,649 use X to predict Y. And if that's the case, 1394 01:09:23,649 --> 01:09:26,160 you will never be able to infer these average treatment 1395 01:09:26,160 --> 01:09:27,750 effects accurately. 1396 01:09:27,750 --> 01:09:29,545 You'll have huge errors. 1397 01:09:29,545 --> 01:09:31,170 And that gets back to one of the slides 1398 01:09:31,170 --> 01:09:33,990 that I skipped, where I started out from this picture. 1399 01:09:33,990 --> 01:09:36,990 This is the machine learning picture saying, OK, a reduction 1400 01:09:36,990 --> 01:09:38,729 to machine learning is-- 1401 01:09:38,729 --> 01:09:40,229 now you add an additional feature, 1402 01:09:40,229 --> 01:09:41,760 which is your treatment decision, 1403 01:09:41,760 --> 01:09:44,795 and you learn that black box function f. 1404 01:09:44,795 --> 01:09:46,920 But this is where machine learning causal inference 1405 01:09:46,920 --> 01:09:50,100 starts to differ because we don't actually 1406 01:09:50,100 --> 01:09:55,703 care about the quality of predicting Y. 1407 01:09:55,703 --> 01:09:57,495 We can measure your root mean squared error 1408 01:09:57,495 --> 01:10:00,630 in predicting Y given your X's and T's, and that error 1409 01:10:00,630 --> 01:10:02,550 might be low. 1410 01:10:02,550 --> 01:10:05,400 But you can run into these failure modes 1411 01:10:05,400 --> 01:10:08,130 where it just completely ignores T, for example. 1412 01:10:08,130 --> 01:10:10,263 So T is special here. 1413 01:10:10,263 --> 01:10:12,180 So really, the picture we want to have in mind 1414 01:10:12,180 --> 01:10:15,760 is that T is some parameter of interest. 1415 01:10:15,760 --> 01:10:19,500 We want to learn a model f such that if we twiddle T, 1416 01:10:19,500 --> 01:10:22,500 we can see how there is a differential effect on Y based 1417 01:10:22,500 --> 01:10:24,297 on twiddling T. That's what we truly 1418 01:10:24,297 --> 01:10:26,130 care about when we're using machine learning 1419 01:10:26,130 --> 01:10:28,290 for causal inference. 1420 01:10:28,290 --> 01:10:30,150 And so that's really the gap, that's 1421 01:10:30,150 --> 01:10:32,930 the gap in our understanding today. 1422 01:10:32,930 --> 01:10:34,680 And it's really an active area of research 1423 01:10:34,680 --> 01:10:37,320 to figure out how do you change the whole machine learning 1424 01:10:37,320 --> 01:10:40,938 paradigm to recognize that when you're using machine learning 1425 01:10:40,938 --> 01:10:42,480 for causal inference, you're actually 1426 01:10:42,480 --> 01:10:44,995 interested in something a little bit different. 1427 01:10:44,995 --> 01:10:47,370 And by the way, that's a major area of my lab's research, 1428 01:10:47,370 --> 01:10:49,037 and we just published a series of papers 1429 01:10:49,037 --> 01:10:50,503 trying to answer that question. 1430 01:10:50,503 --> 01:10:52,170 Beyond the scope of this course, but I'm 1431 01:10:52,170 --> 01:10:56,370 happy to send you those papers if anyone's interested. 1432 01:10:56,370 --> 01:11:00,880 So that type of question is extremely important. 1433 01:11:00,880 --> 01:11:04,740 It doesn't show up quite as much when your X's aren't very high 1434 01:11:04,740 --> 01:11:07,560 dimensional and where things like regularization 1435 01:11:07,560 --> 01:11:09,090 don't become important. 1436 01:11:09,090 --> 01:11:11,310 But once your X becomes high dimensional 1437 01:11:11,310 --> 01:11:14,160 and once you want to start to consider more and more complex 1438 01:11:14,160 --> 01:11:16,050 f's during your fitting, like you 1439 01:11:16,050 --> 01:11:18,510 want to use deep neural networks, for example, 1440 01:11:18,510 --> 01:11:22,650 these differences in goals become extremely important. 1441 01:11:35,790 --> 01:11:37,930 So there are other ways in which things can fail. 1442 01:11:37,930 --> 01:11:43,205 So I want to give you here an example where-- 1443 01:11:43,205 --> 01:11:44,580 shoot, I'm answering my question. 1444 01:11:46,930 --> 01:11:47,430 OK. 1445 01:11:50,840 --> 01:11:52,520 No one saw that slide. 1446 01:11:52,520 --> 01:11:54,650 Question-- where did the overlap assumptions 1447 01:11:54,650 --> 01:11:59,930 show up in our approach for estimating average treatment 1448 01:11:59,930 --> 01:12:02,705 effect using covariate adjustment? 1449 01:12:17,580 --> 01:12:19,115 Let me go back to the formula. 1450 01:12:24,630 --> 01:12:27,743 Someone who hasn't spoken today, hopefully. 1451 01:12:27,743 --> 01:12:28,910 You can be wrong, it's fine. 1452 01:12:31,430 --> 01:12:32,415 Yeah, in the back? 1453 01:12:32,415 --> 01:12:34,290 AUDIENCE: Is it the version with the same age 1454 01:12:34,290 --> 01:12:37,520 in receiving treatment B and treatment B? 1455 01:12:37,520 --> 01:12:43,917 DAVID SONTAG: So maybe you have an individual with some age-- 1456 01:12:43,917 --> 01:12:45,500 we're going to want to be able to look 1457 01:12:45,500 --> 01:12:48,358 at the difference between what f predicts for that individual 1458 01:12:48,358 --> 01:12:50,150 if they got treatment A versus treatment B, 1459 01:12:50,150 --> 01:12:52,100 or one versus zero. 1460 01:12:52,100 --> 01:12:57,500 And let me try to lead this a little bit. 1461 01:12:57,500 --> 01:12:59,000 And it might happen in your data set 1462 01:12:59,000 --> 01:13:04,310 that for individuals like them, you only ever 1463 01:13:04,310 --> 01:13:07,100 observe treatment one and there's no one even remotely 1464 01:13:07,100 --> 01:13:09,800 like them who you observe treatment zero. 1465 01:13:09,800 --> 01:13:13,690 So what's this function going to output then 1466 01:13:13,690 --> 01:13:17,290 when you input zero for that second argument? 1467 01:13:17,290 --> 01:13:19,850 Everyone say out loud. 1468 01:13:19,850 --> 01:13:22,200 Garbage? 1469 01:13:22,200 --> 01:13:22,700 Right? 1470 01:13:22,700 --> 01:13:27,710 If in your data set you never observed anyone even remotely 1471 01:13:27,710 --> 01:13:31,730 similar to Xi who received treatment zero, 1472 01:13:31,730 --> 01:13:34,428 then this function is basically undefined for that individual. 1473 01:13:34,428 --> 01:13:36,470 I mean, yeah, your function will output something 1474 01:13:36,470 --> 01:13:41,610 because you fit it, but it's not going to be the right answer. 1475 01:13:41,610 --> 01:13:45,530 And so that's where this assumption starts to show up. 1476 01:13:45,530 --> 01:13:49,910 When one talks about the sample complexity of learning 1477 01:13:49,910 --> 01:13:53,030 these functions f to do covariate adjustment, 1478 01:13:53,030 --> 01:13:55,730 and when one talks about the consistency 1479 01:13:55,730 --> 01:13:57,140 of these arguments-- for example, 1480 01:13:57,140 --> 01:13:58,640 you'd like to be able to make claims 1481 01:13:58,640 --> 01:14:01,220 that as the amount of data grows to, let's 1482 01:14:01,220 --> 01:14:04,430 say, infinity, that this is the right answer-- gives you 1483 01:14:04,430 --> 01:14:05,480 the right estimate. 1484 01:14:05,480 --> 01:14:07,490 So that's the type of proof which 1485 01:14:07,490 --> 01:14:10,560 is often given in the causal inference literature. 1486 01:14:10,560 --> 01:14:13,920 Well, if you have overlap, then as the amount of data 1487 01:14:13,920 --> 01:14:18,138 goes to infinity, you will observe someone, 1488 01:14:18,138 --> 01:14:19,930 like the person who received treatment one, 1489 01:14:19,930 --> 01:14:22,120 you'll observe someone who also received treatment zero. 1490 01:14:22,120 --> 01:14:23,920 It might have taken you a huge amount of data to get there 1491 01:14:23,920 --> 01:14:26,110 because treatment zero might have been much less 1492 01:14:26,110 --> 01:14:27,520 likely than treatment one. 1493 01:14:27,520 --> 01:14:30,970 But because the probability of treatment zero is not zero, 1494 01:14:30,970 --> 01:14:32,810 eventually you'll see someone like that. 1495 01:14:32,810 --> 01:14:34,477 And so eventually you'll get enough data 1496 01:14:34,477 --> 01:14:37,120 in order to learn a function which can extrapolate correctly 1497 01:14:37,120 --> 01:14:39,700 for that individual. 1498 01:14:39,700 --> 01:14:43,930 And so that's where overlap comes in 1499 01:14:43,930 --> 01:14:46,450 in giving that type of consistency argument. 1500 01:14:46,450 --> 01:14:51,100 Of course, in reality, you never have infinite data. 1501 01:14:51,100 --> 01:14:54,280 And so these questions about trade-offs 1502 01:14:54,280 --> 01:14:56,050 between the amount of data you have 1503 01:14:56,050 --> 01:14:59,470 and the fact that you never truly have 1504 01:14:59,470 --> 01:15:02,350 empirical overlap with a small amount of data, 1505 01:15:02,350 --> 01:15:05,380 and answering when can you extrapolate correctly 1506 01:15:05,380 --> 01:15:07,750 despite that is the critical question 1507 01:15:07,750 --> 01:15:09,800 that one needs to answer, but is, by the way, 1508 01:15:09,800 --> 01:15:11,662 not studied very well in the literature 1509 01:15:11,662 --> 01:15:13,870 because people don't usually think in terms of sample 1510 01:15:13,870 --> 01:15:16,148 complexity in that field. 1511 01:15:16,148 --> 01:15:18,190 That's where computer scientists can start really 1512 01:15:18,190 --> 01:15:20,560 to contribute to this literature and bringing things 1513 01:15:20,560 --> 01:15:22,060 that we often think about in machine 1514 01:15:22,060 --> 01:15:26,120 learning to this new topic. 1515 01:15:26,120 --> 01:15:30,110 So I've got a couple of minutes left. 1516 01:15:30,110 --> 01:15:31,860 Are there any other questions, or should I 1517 01:15:31,860 --> 01:15:33,840 introduce some new material in one minute? 1518 01:15:33,840 --> 01:15:35,550 Yeah? 1519 01:15:35,550 --> 01:15:38,160 AUDIENCE: So you said that the average treatment effect 1520 01:15:38,160 --> 01:15:40,020 estimator here is consistent. 1521 01:15:40,020 --> 01:15:43,070 But does it matter if we choose the wrong-- 1522 01:15:43,070 --> 01:15:46,830 do we have to choose some functional form of the features 1523 01:15:46,830 --> 01:15:47,413 to the effect? 1524 01:15:47,413 --> 01:15:48,622 DAVID SONTAG: Great question. 1525 01:15:48,622 --> 01:15:51,710 AUDIENCE: Is it consistent even if we choose a completely wrong 1526 01:15:51,710 --> 01:15:52,582 function or formula? 1527 01:15:52,582 --> 01:15:53,290 DAVID SONTAG: No. 1528 01:15:53,290 --> 01:15:53,910 AUDIENCE: That's a different thing? 1529 01:15:53,910 --> 01:15:54,520 DAVID SONTAG: No, no. 1530 01:15:54,520 --> 01:15:56,103 You're asking all the right questions. 1531 01:15:56,103 --> 01:15:58,050 Good job today, everyone. 1532 01:15:58,050 --> 01:16:00,090 So, no. 1533 01:16:00,090 --> 01:16:03,750 If you walk through that argument I made, 1534 01:16:03,750 --> 01:16:04,830 I assume two things. 1535 01:16:04,830 --> 01:16:06,660 First, that you observe enough data such 1536 01:16:06,660 --> 01:16:12,330 that you can have any chance of extrapolating correctly. 1537 01:16:12,330 --> 01:16:13,943 But then implicit in that statement 1538 01:16:13,943 --> 01:16:15,360 is that you're choosing a function 1539 01:16:15,360 --> 01:16:17,130 family which is powerful enough that it 1540 01:16:17,130 --> 01:16:19,060 can extrapolate correctly. 1541 01:16:19,060 --> 01:16:21,255 So if your true function is-- 1542 01:16:24,010 --> 01:16:28,970 if you think back to this figure I showed you here, 1543 01:16:28,970 --> 01:16:30,830 if the true potential outcome functions are 1544 01:16:30,830 --> 01:16:34,250 these quadratic functions and you're fitting them 1545 01:16:34,250 --> 01:16:36,340 with a linear function, then no matter 1546 01:16:36,340 --> 01:16:37,840 how much data you have you're always 1547 01:16:37,840 --> 01:16:42,230 going to get wrong estimates because this argument really 1548 01:16:42,230 --> 01:16:45,080 requires that you're considering more and more complex 1549 01:16:45,080 --> 01:16:48,710 non-linearity as your amount of data grows. 1550 01:16:48,710 --> 01:16:51,050 So now here's a visual depiction of what can go wrong 1551 01:16:51,050 --> 01:16:53,490 if you don't have overlap. 1552 01:16:53,490 --> 01:16:55,070 So now I've taken out-- 1553 01:16:55,070 --> 01:16:57,530 previously, I had one or two red points over here and one 1554 01:16:57,530 --> 01:17:00,080 or two blue points over here, but I've taken those out. 1555 01:17:00,080 --> 01:17:02,460 So in your data all you have are these blue points 1556 01:17:02,460 --> 01:17:03,335 and those red points. 1557 01:17:06,500 --> 01:17:09,350 So all you have are the points, and now one 1558 01:17:09,350 --> 01:17:12,140 can learn as good functions, as you can imagine, to try to, 1559 01:17:12,140 --> 01:17:14,840 let's say, minimize the mean squared error of predicting 1560 01:17:14,840 --> 01:17:17,840 these blue points and minimize the mean squared error 1561 01:17:17,840 --> 01:17:19,400 of predicting those red points. 1562 01:17:19,400 --> 01:17:21,530 And what you might get out is something-- maybe 1563 01:17:21,530 --> 01:17:23,130 you'll decide on a linear function. 1564 01:17:23,130 --> 01:17:26,090 That's as good as you could do if all you 1565 01:17:26,090 --> 01:17:28,940 have are those red points. 1566 01:17:28,940 --> 01:17:30,910 And so even if you were willing to consider 1567 01:17:30,910 --> 01:17:34,010 more and more complex hypothesis classes, 1568 01:17:34,010 --> 01:17:36,950 here, if you tried to consider a more complex hypothesis 1569 01:17:36,950 --> 01:17:39,620 class than this line, you'd probably just over-fitting 1570 01:17:39,620 --> 01:17:41,360 to the data you have. 1571 01:17:41,360 --> 01:17:44,750 And so you decide on that line, which, 1572 01:17:44,750 --> 01:17:47,480 because you had no data over here, 1573 01:17:47,480 --> 01:17:51,680 you don't even know that it's not a good fit to the data. 1574 01:17:51,680 --> 01:17:53,360 And then you notice that you're getting 1575 01:17:53,360 --> 01:17:54,485 completely wrong estimates. 1576 01:17:54,485 --> 01:17:58,050 For example, if you asked about the CATE for a young person, 1577 01:17:58,050 --> 01:18:01,610 it would have the wrong sign over here because they flipped, 1578 01:18:01,610 --> 01:18:03,200 the two lines. 1579 01:18:03,200 --> 01:18:07,760 So that's an example of how one can start to get errors. 1580 01:18:07,760 --> 01:18:10,790 And when we begin on Thursday's lecture, 1581 01:18:10,790 --> 01:18:13,610 we're going to pick up right where we left off today, 1582 01:18:13,610 --> 01:18:17,370 and I'll talk about this issue a little bit more in detail. 1583 01:18:17,370 --> 01:18:21,290 I'll talk about how, if one were to learn a linear function, how 1584 01:18:21,290 --> 01:18:23,090 one could actually, under the assumption 1585 01:18:23,090 --> 01:18:25,130 that the true potential outcomes are linear, 1586 01:18:25,130 --> 01:18:27,500 how one could actually interpret the coefficients 1587 01:18:27,500 --> 01:18:29,435 of that linear function in a causal way 1588 01:18:29,435 --> 01:18:31,310 under the very strong assumption that the two 1589 01:18:31,310 --> 01:18:32,700 potential outcomes are linear. 1590 01:18:32,700 --> 01:18:35,180 So that's what we'll return to on Thursday.