1 00:00:14,762 --> 00:00:16,470 DAVID SONTAG: A three-part lecture today, 2 00:00:16,470 --> 00:00:18,720 and I'm still continuing on the theme of reinforcement 3 00:00:18,720 --> 00:00:20,160 learning. 4 00:00:20,160 --> 00:00:22,380 Part one, I'm going to be speaking, 5 00:00:22,380 --> 00:00:26,212 and I'll be following up on last week's discussion 6 00:00:26,212 --> 00:00:28,170 about causal inference and Tuesday's discussion 7 00:00:28,170 --> 00:00:29,820 on reinforcement learning. 8 00:00:29,820 --> 00:00:35,160 And I'll be going into sort of one more subtlety that 9 00:00:35,160 --> 00:00:37,680 arises there and where we can develop 10 00:00:37,680 --> 00:00:40,650 some nice mathematical methods to help with. 11 00:00:40,650 --> 00:00:43,140 And then I'm going to turn over the show 12 00:00:43,140 --> 00:00:47,550 to Barbra, who I'll formally introduce when the time comes. 13 00:00:47,550 --> 00:00:51,120 And she's going to both talk about some of her work 14 00:00:51,120 --> 00:00:56,520 on developing and evaluating dynamic treatment regimes, 15 00:00:56,520 --> 00:00:58,350 and then she will lead a discussion 16 00:00:58,350 --> 00:01:01,080 on the sepsis paper, which was required 17 00:01:01,080 --> 00:01:02,650 reading from today's class. 18 00:01:02,650 --> 00:01:05,129 So those are the three parts of today's lecture. 19 00:01:07,920 --> 00:01:11,042 So I want you to return back, put yourself back 20 00:01:11,042 --> 00:01:12,500 in the mindset of Tuesday's lecture 21 00:01:12,500 --> 00:01:14,510 where we talked about reinforcement learning. 22 00:01:14,510 --> 00:01:16,718 Now, remember that the goal of reinforcement learning 23 00:01:16,718 --> 00:01:18,050 was to optimize some reward. 24 00:01:24,930 --> 00:01:30,920 Specifically, our goal is to find some policy, which 25 00:01:30,920 --> 00:01:36,910 I can note as pi star, which is the arg 26 00:01:36,910 --> 00:01:45,240 max over all possible policies pi of v of pi, 27 00:01:45,240 --> 00:01:46,950 where just to remind you, v of pi 28 00:01:46,950 --> 00:01:49,490 is the value of the policy pi. 29 00:01:49,490 --> 00:01:57,930 Formally, it's defined as the expectation of the sum 30 00:01:57,930 --> 00:02:01,993 of the rewards across time. 31 00:02:01,993 --> 00:02:03,410 So the reason why I'm calling this 32 00:02:03,410 --> 00:02:06,540 an expectation with like the pi is because there's 33 00:02:06,540 --> 00:02:10,340 stochasticity both in the environment, and possibly pi 34 00:02:10,340 --> 00:02:12,740 is going to be a stochastic policy. 35 00:02:12,740 --> 00:02:14,960 And this is summing over the time steps, 36 00:02:14,960 --> 00:02:18,372 because this is not just a single time step problem. 37 00:02:18,372 --> 00:02:20,330 But we're going to be considering interventions 38 00:02:20,330 --> 00:02:23,327 across time of the reward at each point in time. 39 00:02:23,327 --> 00:02:25,910 And that reward function could either be at each point in time 40 00:02:25,910 --> 00:02:28,230 or you might imagine that this is 0 for all time steps, 41 00:02:28,230 --> 00:02:29,480 except for the last time step. 42 00:02:32,020 --> 00:02:34,020 So the first question I want us to think about 43 00:02:34,020 --> 00:02:36,480 is, well, what are the implications 44 00:02:36,480 --> 00:02:40,560 of this as a learning paradigm? 45 00:02:40,560 --> 00:02:43,140 If we look what's going on over here, hidden in my story 46 00:02:43,140 --> 00:02:47,610 is also an expectation over x, the patient, 47 00:02:47,610 --> 00:02:50,903 for example, or the initial state. 48 00:02:50,903 --> 00:02:52,320 And so this intuitively is saying, 49 00:02:52,320 --> 00:02:56,730 let's try to find a policy that has high expected 50 00:02:56,730 --> 00:03:01,370 reward, average [INAUDIBLE] over all patients. 51 00:03:01,370 --> 00:03:04,160 And I just want you to think about whether that is indeed 52 00:03:04,160 --> 00:03:05,907 the right goal. 53 00:03:05,907 --> 00:03:07,490 Can anyone think about a setting where 54 00:03:07,490 --> 00:03:09,360 that might not be desirable? 55 00:03:14,420 --> 00:03:16,090 Yeah. 56 00:03:16,090 --> 00:03:18,590 AUDIENCE: What if the reward is the patient living or dying? 57 00:03:18,590 --> 00:03:20,173 You don't want it to have high ratings 58 00:03:20,173 --> 00:03:22,360 like saving two patients and [INAUDIBLE] 59 00:03:22,360 --> 00:03:24,157 and expect the same [INAUDIBLE]. 60 00:03:24,157 --> 00:03:25,990 DAVID SONTAG: So what happens if this reward 61 00:03:25,990 --> 00:03:32,230 is something mission critical like a patient dying? 62 00:03:32,230 --> 00:03:35,350 You really want to try to avoid that from happening 63 00:03:35,350 --> 00:03:36,262 as much as possible. 64 00:03:36,262 --> 00:03:37,720 Of course, there are other criteria 65 00:03:37,720 --> 00:03:39,430 that we might be interested in as well. 66 00:03:39,430 --> 00:03:43,600 And both in Frederick's lecture on Tuesday and in the readings, 67 00:03:43,600 --> 00:03:47,800 we talked about how there might be other aspects about making 68 00:03:47,800 --> 00:03:51,397 sure that a patient is not just alive but also healthy, 69 00:03:51,397 --> 00:03:53,230 which might play into your reward functions. 70 00:03:53,230 --> 00:03:55,438 And there might be rewards associated with those. 71 00:03:55,438 --> 00:03:56,980 And if you were to just, for example, 72 00:03:56,980 --> 00:04:00,190 put a positive or negative infinity 73 00:04:00,190 --> 00:04:03,050 for a patient dying, that's a nonstarter, 74 00:04:03,050 --> 00:04:07,390 right, because if you did that, unfortunately in this world, 75 00:04:07,390 --> 00:04:09,863 we're not always going to be able to keep patients alive. 76 00:04:09,863 --> 00:04:12,280 And so you're going to get into an infeasible optimization 77 00:04:12,280 --> 00:04:13,460 problem. 78 00:04:13,460 --> 00:04:15,130 So minus infinity is not an option. 79 00:04:15,130 --> 00:04:17,560 We're going to have to put some number to it 80 00:04:17,560 --> 00:04:20,450 in this type of approach. 81 00:04:20,450 --> 00:04:24,730 But then you're going to start trading off between patients. 82 00:04:24,730 --> 00:04:31,510 In some cases, you might have a very high reward for-- 83 00:04:31,510 --> 00:04:33,670 there are two different solutions 84 00:04:33,670 --> 00:04:35,560 that you might imagine, one solution 85 00:04:35,560 --> 00:04:39,210 where the reward is somewhat balanced across patients 86 00:04:39,210 --> 00:04:41,060 and another situation where you have 87 00:04:41,060 --> 00:04:43,510 really small values of reward for some patients 88 00:04:43,510 --> 00:04:45,760 and a few patients with very large values and rewards. 89 00:04:45,760 --> 00:04:49,200 And both of them could be the same average, obviously. 90 00:04:49,200 --> 00:04:51,910 But both are not necessarily equally useful. 91 00:04:51,910 --> 00:04:54,190 We might want to say that we prefer to avoid 92 00:04:54,190 --> 00:04:56,473 that worst-case situation. 93 00:04:56,473 --> 00:04:58,390 So one could imagine other ways of formulating 94 00:04:58,390 --> 00:05:00,670 this optimization problem, like maybe you 95 00:05:00,670 --> 00:05:03,460 want to control the worst-case reward instead 96 00:05:03,460 --> 00:05:05,043 of the average-case reward. 97 00:05:05,043 --> 00:05:06,460 Or maybe you want to say something 98 00:05:06,460 --> 00:05:09,160 about different quartiles. 99 00:05:09,160 --> 00:05:11,410 I just wanted to point that out, because really that's 100 00:05:11,410 --> 00:05:15,813 the starting place for a lot of the work that we're doing here. 101 00:05:15,813 --> 00:05:17,230 So now I want us to think through, 102 00:05:17,230 --> 00:05:24,870 OK, returning back to this goal, we've done our policy iteration 103 00:05:24,870 --> 00:05:27,120 or we've done our Q learning, that is, 104 00:05:27,120 --> 00:05:29,220 and we get a policy out. 105 00:05:29,220 --> 00:05:30,780 And we might now want to know what 106 00:05:30,780 --> 00:05:32,400 is the value of that policy? 107 00:05:32,400 --> 00:05:36,050 So what is our estimate of that quantity? 108 00:05:36,050 --> 00:05:38,590 Well, to get that, one could just 109 00:05:38,590 --> 00:05:40,300 try to read it off from the results of Q 110 00:05:40,300 --> 00:05:43,900 learning by just computing that the pi-- 111 00:05:43,900 --> 00:05:46,300 what I'm calling v pi hat-- the estimate is 112 00:05:46,300 --> 00:05:50,860 just equal to now a maximum over actions 113 00:05:50,860 --> 00:05:55,390 a of your Q function evaluated at whatever 114 00:05:55,390 --> 00:06:03,820 your initial state is and the optimal choice of action a. 115 00:06:03,820 --> 00:06:07,590 So all I'm saying here is that the last step of the algorithm 116 00:06:07,590 --> 00:06:09,930 might be to ask, well, what is the expected 117 00:06:09,930 --> 00:06:11,895 reward of this policy? 118 00:06:11,895 --> 00:06:13,770 And if you remember, the Q learning algorithm 119 00:06:13,770 --> 00:06:15,728 is, in essence, a dynamic programming algorithm 120 00:06:15,728 --> 00:06:19,410 working its way from the sort of large values of time 121 00:06:19,410 --> 00:06:21,160 up to the present. 122 00:06:21,160 --> 00:06:24,907 And it is indeed actually computing this expected value 123 00:06:24,907 --> 00:06:25,990 that you're interested in. 124 00:06:25,990 --> 00:06:27,948 So you could just read it off from the Q values 125 00:06:27,948 --> 00:06:30,600 at the very end. 126 00:06:30,600 --> 00:06:32,520 But I want to point out that here there's 127 00:06:32,520 --> 00:06:34,500 an implicit policy built in. 128 00:06:34,500 --> 00:06:37,140 So I'm going to compare this in just a second to what 129 00:06:37,140 --> 00:06:40,540 happens under the causal inference scenario. 130 00:06:40,540 --> 00:06:42,600 So just a single time step in potential outcomes 131 00:06:42,600 --> 00:06:45,170 framework that we're used to. 132 00:06:45,170 --> 00:06:49,590 Notice that the value of this policy, the reason why it's 133 00:06:49,590 --> 00:06:53,440 a function of pi is because the value 134 00:06:53,440 --> 00:06:57,510 is a function of every subsequent action 135 00:06:57,510 --> 00:06:58,640 that you're taking as well. 136 00:06:58,640 --> 00:07:01,770 And so now let's just compare that 137 00:07:01,770 --> 00:07:04,740 for a second to what happens in the potential outcomes 138 00:07:04,740 --> 00:07:05,340 framework. 139 00:07:08,180 --> 00:07:10,640 So there, our starting place-- 140 00:07:10,640 --> 00:07:17,560 so now I'm going to turn our attention for just one 141 00:07:17,560 --> 00:07:22,120 moment from reinforcement learning now back to just 142 00:07:22,120 --> 00:07:24,240 causal inference. 143 00:07:24,240 --> 00:07:26,720 In reinforcement learning, we talked about policies. 144 00:07:26,720 --> 00:07:29,560 How do we find policies to do well 145 00:07:29,560 --> 00:07:33,040 in terms of some expected reward of this policy? 146 00:07:33,040 --> 00:07:37,250 But yet when we were talking about causal inference, 147 00:07:37,250 --> 00:07:41,420 we only used words like average treatment effect 148 00:07:41,420 --> 00:07:44,390 or conditional average treatment effect, 149 00:07:44,390 --> 00:07:47,505 where for example, to estimate the conditional average 150 00:07:47,505 --> 00:07:49,130 treatment effect, what we said is we're 151 00:07:49,130 --> 00:07:52,430 going to first learn, if we use a covariate adjustment 152 00:07:52,430 --> 00:07:55,340 approach, we learn some function f 153 00:07:55,340 --> 00:07:59,900 of x comma t, which is intended to be 154 00:07:59,900 --> 00:08:07,520 an approximation of the expected value of your outcome y given 155 00:08:07,520 --> 00:08:08,380 x comma-- 156 00:08:12,370 --> 00:08:18,860 I'll say y of t. 157 00:08:18,860 --> 00:08:19,360 There. 158 00:08:19,360 --> 00:08:20,970 So that notation. 159 00:08:20,970 --> 00:08:22,840 So the goal of covariate adjustment 160 00:08:22,840 --> 00:08:25,030 was to estimate this quantity. 161 00:08:25,030 --> 00:08:29,470 And we could use that then to try to construct a policy. 162 00:08:29,470 --> 00:08:37,690 For example, you could think about the policy pi of x, 163 00:08:37,690 --> 00:08:42,309 which simply looks to see is-- 164 00:08:42,309 --> 00:08:50,860 we'll say it's 1 if CATE or your estimate of CATE for x 165 00:08:50,860 --> 00:08:56,290 is positive and 0 otherwise. 166 00:08:56,290 --> 00:09:02,490 Just remind you, the way that we got the estimate of CATE 167 00:09:02,490 --> 00:09:05,070 for an individual x was just by looking 168 00:09:05,070 --> 00:09:11,670 at f of x comma 1 minus f of x comma 0. 169 00:09:30,620 --> 00:09:32,690 So if we have a policy-- 170 00:09:32,690 --> 00:09:34,952 so now we're going to start thinking about policies 171 00:09:34,952 --> 00:09:36,410 in the context of causal inference, 172 00:09:36,410 --> 00:09:39,070 just like we were doing in reinforcement learning. 173 00:09:39,070 --> 00:09:43,610 And I want us to think through what would the analogous value 174 00:09:43,610 --> 00:09:46,720 of the policy be? 175 00:09:46,720 --> 00:09:49,465 How good is that policy? 176 00:09:49,465 --> 00:09:51,340 It could be another policy, but right now I'm 177 00:09:51,340 --> 00:09:53,160 assuming I'm just going to focus on this policy 178 00:09:53,160 --> 00:09:53,993 that I show up here. 179 00:09:56,690 --> 00:09:59,140 Well, one approach to try to evaluate 180 00:09:59,140 --> 00:10:02,350 how good that policy is, is exactly analogous to what we 181 00:10:02,350 --> 00:10:03,610 did in reinforcement learning. 182 00:10:03,610 --> 00:10:05,068 In essence, what we're going to say 183 00:10:05,068 --> 00:10:08,470 is we evaluate the quality of the policy 184 00:10:08,470 --> 00:10:22,420 by summing over your empirical data of pi of xi. 185 00:10:22,420 --> 00:10:28,460 So this is going to be 1 if the policy says to give treatment 1 186 00:10:28,460 --> 00:10:31,910 to individual xi. 187 00:10:31,910 --> 00:10:37,730 In that case, we say that the value is f of x comma 1. 188 00:10:37,730 --> 00:10:41,550 Or if you gave the second-- 189 00:10:41,550 --> 00:10:45,540 if the policy would give treatment 0, 190 00:10:45,540 --> 00:10:49,260 the value of the policy on that individual is 1 minus pi 191 00:10:49,260 --> 00:10:53,280 of x times f of x comma 0. 192 00:10:57,357 --> 00:11:04,250 So I'm going to call this sort of an empirical estimate 193 00:11:04,250 --> 00:11:09,680 of what you should think about as the reward for a policy pi. 194 00:11:14,690 --> 00:11:20,440 And it's exactly analogous to the estimate of v of pie 195 00:11:20,440 --> 00:11:23,180 that you would get from a reinforcement learning context. 196 00:11:23,180 --> 00:11:28,090 But now we're talking about policies explicitly. 197 00:11:28,090 --> 00:11:30,040 So let's try to dig down a little bit deeper 198 00:11:30,040 --> 00:11:31,915 and think about what this is actually saying. 199 00:11:34,040 --> 00:11:40,430 Imagine the story where you just have a single covariate x. 200 00:11:40,430 --> 00:11:45,440 We'll think about x as being, let's say, the patient's age. 201 00:11:45,440 --> 00:11:50,260 And unfortunately there's just one color here. 202 00:11:50,260 --> 00:11:52,100 But I'll do my best with that. 203 00:11:52,100 --> 00:11:56,410 And imagine that the potential outcome 204 00:11:56,410 --> 00:12:03,280 y0 as a function of the patient's age x 205 00:12:03,280 --> 00:12:05,867 looks like this. 206 00:12:05,867 --> 00:12:07,700 Now imagine that the other potential outcome 207 00:12:07,700 --> 00:12:14,060 y1 looked like that. 208 00:12:14,060 --> 00:12:16,620 So I'll call this the y1 potential outcome. 209 00:12:21,610 --> 00:12:25,750 Suppose now that the policy that we're defining is this. 210 00:12:25,750 --> 00:12:27,550 So we're going to give treatment one 211 00:12:27,550 --> 00:12:29,800 if the condition of our treatment effect is positive 212 00:12:29,800 --> 00:12:32,240 and 0 otherwise. 213 00:12:32,240 --> 00:12:36,320 I want everyone to draw what the value of that policy 214 00:12:36,320 --> 00:12:38,940 is on a piece of paper. 215 00:12:38,940 --> 00:12:39,800 It's going to be-- 216 00:12:44,027 --> 00:12:46,360 I'm sorry-- I want everyone to write on a piece of paper 217 00:12:46,360 --> 00:12:49,630 what the value of the policy would be for each individual. 218 00:12:49,630 --> 00:12:52,735 So it's going to be a function of x. 219 00:12:55,650 --> 00:12:57,420 And now I want it to be-- 220 00:12:57,420 --> 00:13:03,690 I'm looking for y of pi of x. 221 00:13:03,690 --> 00:13:06,235 So I'm looking for you to draw that plot. 222 00:13:08,525 --> 00:13:10,150 And feel free to talk to your neighbor. 223 00:13:13,190 --> 00:13:15,584 In fact, I encourage you to talk to your neighbor. 224 00:13:15,584 --> 00:13:17,492 [SIDE CONVERSATION] 225 00:13:22,792 --> 00:13:24,750 Just to try to connect this a little bit better 226 00:13:24,750 --> 00:13:28,304 to what I have up here, I'm going to assume that f-- 227 00:13:28,304 --> 00:13:32,340 this is f of x1, and this is f of x0. 228 00:13:39,440 --> 00:13:39,940 All right. 229 00:13:39,940 --> 00:13:41,005 Any guesses? 230 00:13:43,540 --> 00:13:46,613 What does this plot look like? 231 00:13:46,613 --> 00:13:49,030 Someone who hasn't spoken in the last one week and a half, 232 00:13:49,030 --> 00:13:49,600 if possible. 233 00:13:58,870 --> 00:13:59,500 Yeah? 234 00:13:59,500 --> 00:14:01,860 AUDIENCE: Does it take like the max of the functions 235 00:14:01,860 --> 00:14:03,780 at all point, like, it would be y0 up 236 00:14:03,780 --> 00:14:06,200 until they intersect and then y1 afterward? 237 00:14:06,200 --> 00:14:08,200 DAVID SONTAG: So it would be something like this 238 00:14:08,200 --> 00:14:09,430 until the intersection point. 239 00:14:09,430 --> 00:14:10,210 AUDIENCE: Yeah. 240 00:14:10,210 --> 00:14:12,050 DAVID SONTAG: And then like that afterwards. 241 00:14:12,050 --> 00:14:12,550 Yeah. 242 00:14:12,550 --> 00:14:15,310 That's exactly what I'm going for. 243 00:14:15,310 --> 00:14:17,350 And let's try to think through why is 244 00:14:17,350 --> 00:14:20,910 that the value of the policy? 245 00:14:20,910 --> 00:14:25,260 Well, here the CATE, which is looking 246 00:14:25,260 --> 00:14:29,190 at a difference between these two lines as negative-- 247 00:14:29,190 --> 00:14:33,600 so for every x up to this crossing point, 248 00:14:33,600 --> 00:14:36,270 the policy that we've defined over there 249 00:14:36,270 --> 00:14:39,645 is going to perform action-- 250 00:14:42,520 --> 00:14:43,050 wait. 251 00:14:43,050 --> 00:14:45,460 Am I drawing this correctly? 252 00:14:45,460 --> 00:14:47,948 Maybe it's actually the opposite, right? 253 00:14:47,948 --> 00:14:49,560 This should be doing action one. 254 00:14:54,100 --> 00:14:54,600 Here. 255 00:14:54,600 --> 00:14:55,100 OK. 256 00:14:55,100 --> 00:15:00,250 So here the CATE is negative. 257 00:15:00,250 --> 00:15:03,990 And so by my definition, the action performed is action 0. 258 00:15:03,990 --> 00:15:07,828 And so the value of the policy is actually this one. 259 00:15:07,828 --> 00:15:10,516 [INTERPOSING VOICES] 260 00:15:10,516 --> 00:15:11,340 DAVID SONTAG: Oh. 261 00:15:11,340 --> 00:15:11,840 Wait. 262 00:15:11,840 --> 00:15:12,470 Oh, good. 263 00:15:12,470 --> 00:15:13,925 [INAUDIBLE] 264 00:15:13,925 --> 00:15:15,800 Because this is the graph I have in my notes. 265 00:15:15,800 --> 00:15:16,300 Oh, good. 266 00:15:16,300 --> 00:15:18,050 OK. 267 00:15:18,050 --> 00:15:19,740 I was getting worried. 268 00:15:19,740 --> 00:15:20,240 OK. 269 00:15:20,240 --> 00:15:23,690 So it's this action, all the way up until you get over here. 270 00:15:23,690 --> 00:15:28,890 And then over here, now the CATE suddenly becomes positive. 271 00:15:28,890 --> 00:15:34,280 And so the action chosen is 1. 272 00:15:34,280 --> 00:15:41,570 And so the value of that policy is y1. 273 00:15:41,570 --> 00:15:44,210 So one could write this a little bit differently for-- 274 00:15:50,260 --> 00:15:52,320 in the case of just two policies, and now 275 00:15:52,320 --> 00:15:55,020 I'm going to write this in a way that it's really clear. 276 00:15:55,020 --> 00:15:58,500 In the case of just two actions, one 277 00:15:58,500 --> 00:16:04,970 could write this equivalently as an average 278 00:16:04,970 --> 00:16:14,860 over the data points of the maximum of fx comma 0 279 00:16:14,860 --> 00:16:19,450 and f of x comma 1. 280 00:16:19,450 --> 00:16:25,660 And this simplification turning this formula into this formula 281 00:16:25,660 --> 00:16:28,240 is making the assumption that the pi 282 00:16:28,240 --> 00:16:31,100 that we're being evaluated on is precisely this pi. 283 00:16:31,100 --> 00:16:34,453 So this simplification is only for that pi. 284 00:16:34,453 --> 00:16:37,120 For another policy, which is not looking at CATE or for example, 285 00:16:37,120 --> 00:16:38,910 which might threshold CATE at a gamma, 286 00:16:38,910 --> 00:16:40,170 it wouldn't quite be this. 287 00:16:40,170 --> 00:16:43,280 It would be something else. 288 00:16:43,280 --> 00:16:45,790 But I've gone a step further here. 289 00:16:45,790 --> 00:16:47,280 So what I've shown you right here 290 00:16:47,280 --> 00:16:50,270 is not the average value but sort of individual values. 291 00:16:50,270 --> 00:16:52,920 I have shown you the max function. 292 00:16:52,920 --> 00:16:54,630 But what this is actually looking 293 00:16:54,630 --> 00:17:00,390 at is the expected reward, which is now averaging across all x. 294 00:17:00,390 --> 00:17:04,200 So to truly draw a connection between this plot we're drawing 295 00:17:04,200 --> 00:17:07,357 and the average reward of that policy, what 296 00:17:07,357 --> 00:17:08,940 we should be looking at is the average 297 00:17:08,940 --> 00:17:17,050 of these two functions, which is we'll say something like that. 298 00:17:17,050 --> 00:17:21,660 And that value is the expected reward. 299 00:17:21,660 --> 00:17:26,740 Now, this all goes to show that the expected reward 300 00:17:26,740 --> 00:17:30,550 of this policy is not a quantity that we've considered 301 00:17:30,550 --> 00:17:32,210 in the previous lectures, at least 302 00:17:32,210 --> 00:17:34,300 not in the previous lectures in causal inference. 303 00:17:34,300 --> 00:17:36,482 This is not the same as the average treatment 304 00:17:36,482 --> 00:17:37,315 effect, for example. 305 00:17:45,840 --> 00:17:49,340 So I've just given you one way to think through, 306 00:17:49,340 --> 00:17:51,770 number one, what is the policy that you 307 00:17:51,770 --> 00:17:55,190 might want to derive when you're doing causal inference? 308 00:17:55,190 --> 00:17:58,760 And number two, what is one way to estimate 309 00:17:58,760 --> 00:18:01,610 the value of that policy, which goes 310 00:18:01,610 --> 00:18:07,070 through the process of estimating potential outcomes 311 00:18:07,070 --> 00:18:09,790 via covariate adjustment? 312 00:18:09,790 --> 00:18:12,610 But we might wonder, just like when 313 00:18:12,610 --> 00:18:14,548 we talked about in causal inference 314 00:18:14,548 --> 00:18:16,840 where I said there are two approaches or more than two, 315 00:18:16,840 --> 00:18:19,160 but we focused on two, using covariate adjustment 316 00:18:19,160 --> 00:18:22,210 and doing inverse propensity score weighting, 317 00:18:22,210 --> 00:18:24,130 you might wonder is there another approach 318 00:18:24,130 --> 00:18:26,422 to this problem all together? 319 00:18:26,422 --> 00:18:27,880 Is there an approach which wouldn't 320 00:18:27,880 --> 00:18:29,860 have had to go through estimating 321 00:18:29,860 --> 00:18:32,242 the potential outcomes? 322 00:18:32,242 --> 00:18:33,700 And that's what I'll spend the rest 323 00:18:33,700 --> 00:18:38,960 of this third of the lecture focused talking about. 324 00:18:38,960 --> 00:18:43,620 And so to help you page this back in, 325 00:18:43,620 --> 00:18:48,690 remember that we derived in last Thursday's lecture 326 00:18:48,690 --> 00:18:52,080 an estimator for the average treatment effect, which 327 00:18:52,080 --> 00:18:58,230 was 1 over n times the sum over data points 328 00:18:58,230 --> 00:19:09,120 that got treatment 1 of yi, the observed outcome for that data 329 00:19:09,120 --> 00:19:13,500 point, divided by the propensity score, 330 00:19:13,500 --> 00:19:15,660 which I'm just going to write as ei. 331 00:19:15,660 --> 00:19:19,830 So ei is equal to the probability 332 00:19:19,830 --> 00:19:30,510 of observing t equals 1 given the data point 333 00:19:30,510 --> 00:19:41,510 xi minus a sum over data point i such that ti equals 334 00:19:41,510 --> 00:19:46,540 0 of yi divided by 1 minus ei. 335 00:19:48,812 --> 00:19:51,020 And by the way, there was a lot of confusion in class 336 00:19:51,020 --> 00:19:53,810 why do I have a 1 over n here, a 1 over n here, 337 00:19:53,810 --> 00:19:56,210 but right now I just took it out all together, 338 00:19:56,210 --> 00:19:59,840 and not 1 over the number of positive points 339 00:19:59,840 --> 00:20:03,470 and 1 over the number of 0 data points. 340 00:20:03,470 --> 00:20:06,770 And I expanded the derivation that I gave in class, 341 00:20:06,770 --> 00:20:09,300 and I posted new slides online after class. 342 00:20:09,300 --> 00:20:11,840 So if you're curious about that, go to those slides 343 00:20:11,840 --> 00:20:15,450 and look at the derivation. 344 00:20:15,450 --> 00:20:17,850 So in a very analogous way now, I'm 345 00:20:17,850 --> 00:20:19,410 going to give you a new estimator 346 00:20:19,410 --> 00:20:22,110 for this same quantity that I had over here, 347 00:20:22,110 --> 00:20:25,180 the expected reward of a policy. 348 00:20:25,180 --> 00:20:30,520 Notice that this estimator here, it made sense for any policy. 349 00:20:30,520 --> 00:20:34,230 It didn't have to be the policy which looked at, 350 00:20:34,230 --> 00:20:36,150 is CATE just greater than 0 or not? 351 00:20:36,150 --> 00:20:37,360 This held for any policy. 352 00:20:37,360 --> 00:20:39,720 The simplification I gave was only 353 00:20:39,720 --> 00:20:42,108 in this particular setting. 354 00:20:42,108 --> 00:20:43,900 I'm going to give you now another estimator 355 00:20:43,900 --> 00:20:46,870 for the average value of a policy, which 356 00:20:46,870 --> 00:20:51,040 doesn't go through estimating potential outcomes at all. 357 00:20:51,040 --> 00:20:53,590 Analogous to this is just going to make 358 00:20:53,590 --> 00:20:56,690 use of the propensity scores. 359 00:20:56,690 --> 00:21:00,070 And I'll call it R hat. 360 00:21:00,070 --> 00:21:02,170 Now I'm going to put a superscript IPW 361 00:21:02,170 --> 00:21:03,890 for inverse propensity weighted. 362 00:21:03,890 --> 00:21:06,595 And it's a function of pi, and it's given to you 363 00:21:06,595 --> 00:21:08,200 by the following formula-- 364 00:21:08,200 --> 00:21:14,350 1 over n sum over the data points of an indicator 365 00:21:14,350 --> 00:21:18,880 function for if the treatment, which was actually 366 00:21:18,880 --> 00:21:23,140 given to the i-th patient, is equal to what 367 00:21:23,140 --> 00:21:28,040 the policy would have done before the i-th patient. 368 00:21:28,040 --> 00:21:30,040 And by the way, here I'm assuming that pi 369 00:21:30,040 --> 00:21:32,320 is a deterministic function. 370 00:21:32,320 --> 00:21:34,450 So the policy says for this patient, 371 00:21:34,450 --> 00:21:36,760 you should do this treatment. 372 00:21:36,760 --> 00:21:39,130 So we're going to look at just the data 373 00:21:39,130 --> 00:21:41,440 points for which the observed treatment is 374 00:21:41,440 --> 00:21:45,055 consistent with what the policy would 375 00:21:45,055 --> 00:21:46,180 have done for that patient. 376 00:21:46,180 --> 00:21:48,880 And this indicator function is 0 otherwise. 377 00:21:48,880 --> 00:22:02,750 And we're going to divide it by the probability of ti given xi. 378 00:22:02,750 --> 00:22:06,860 So the way I'm writing this, by the way, is very general. 379 00:22:06,860 --> 00:22:10,653 So this formula will hold for nonbinary treatments as well. 380 00:22:10,653 --> 00:22:12,320 And that's one of the really nice things 381 00:22:12,320 --> 00:22:13,850 about thinking about policies, which 382 00:22:13,850 --> 00:22:19,367 is whereas when talking about average treatment effect, 383 00:22:19,367 --> 00:22:21,200 average treatment effect sort of makes sense 384 00:22:21,200 --> 00:22:24,500 in the comparative sense, comparing one to another. 385 00:22:24,500 --> 00:22:27,590 But when we talk about how good is a policy, 386 00:22:27,590 --> 00:22:29,935 it's not a comparative statement at all. 387 00:22:29,935 --> 00:22:31,560 The policy does something for everyone. 388 00:22:31,560 --> 00:22:34,143 You could ask, well, what is the average value of the outcomes 389 00:22:34,143 --> 00:22:35,870 that you get for those actions that we're 390 00:22:35,870 --> 00:22:37,545 taking for those individuals? 391 00:22:37,545 --> 00:22:39,920 So that's why I'm writing a slightly more general fashion 392 00:22:39,920 --> 00:22:41,030 already here. 393 00:22:41,030 --> 00:22:44,660 Times yi obviously. 394 00:22:44,660 --> 00:22:46,667 So this is now a new estimator. 395 00:22:46,667 --> 00:22:48,500 I'm not going to derive it for you in class, 396 00:22:48,500 --> 00:22:50,250 but the derivation is very similar to what 397 00:22:50,250 --> 00:22:52,930 we did last week when we tried to drive the average treatment 398 00:22:52,930 --> 00:22:53,430 effect. 399 00:22:53,430 --> 00:22:58,280 And the critical point is we're dividing by that propensity 400 00:22:58,280 --> 00:23:00,990 score, just like we did over there. 401 00:23:04,390 --> 00:23:09,890 So this, if all of the assumptions made sense, 402 00:23:09,890 --> 00:23:12,130 you had infinite data, should give you 403 00:23:12,130 --> 00:23:16,280 exactly the same estimate as this. 404 00:23:16,280 --> 00:23:20,900 But here, you're not estimating potential outcomes at all. 405 00:23:20,900 --> 00:23:24,900 So you never have to try to impute the counterfactuals. 406 00:23:24,900 --> 00:23:27,260 Here, all it relies on knowing is 407 00:23:27,260 --> 00:23:30,110 that you have the propensity scores 408 00:23:30,110 --> 00:23:32,170 for each of the data points in your training set 409 00:23:32,170 --> 00:23:33,620 or in a data set. 410 00:23:33,620 --> 00:23:36,380 So for example, this opens the door 411 00:23:36,380 --> 00:23:40,280 to tons of new exciting directions. 412 00:23:40,280 --> 00:23:44,610 Imagine that you had a very large observational data set. 413 00:23:44,610 --> 00:23:49,420 And you learned a policy from it. 414 00:23:49,420 --> 00:23:53,250 For example, you might have done covariate adjustment 415 00:23:53,250 --> 00:23:56,280 and then said, OK, based on covariate adjustment, 416 00:23:56,280 --> 00:23:58,970 this is my new policy. 417 00:23:58,970 --> 00:24:02,270 So you might have gotten it via that approach. 418 00:24:02,270 --> 00:24:04,260 Now you want to know how good is that. 419 00:24:04,260 --> 00:24:08,810 Well, suppose that you then run a randomized control trial. 420 00:24:08,810 --> 00:24:11,030 And then you run a randomized control trial, 421 00:24:11,030 --> 00:24:15,320 you have 100 people, maybe 200 people, and so not that many. 422 00:24:15,320 --> 00:24:17,090 So not nearly enough people to have 423 00:24:17,090 --> 00:24:19,593 actually estimated your policy alone. 424 00:24:19,593 --> 00:24:22,010 You might have needed thousands or millions of individuals 425 00:24:22,010 --> 00:24:23,197 to estimate your policy. 426 00:24:23,197 --> 00:24:25,280 Now you're only going to have a couple individuals 427 00:24:25,280 --> 00:24:27,655 that you could actually afford to do a randomized control 428 00:24:27,655 --> 00:24:28,980 trial on. 429 00:24:28,980 --> 00:24:31,560 For those people, because you're flipping 430 00:24:31,560 --> 00:24:36,210 a coin for which treatment they're going to get, 431 00:24:36,210 --> 00:24:37,800 suppose that were in a binary setting 432 00:24:37,800 --> 00:24:39,960 where the only two treatments, then this value 433 00:24:39,960 --> 00:24:42,900 is always 1/2 1/2. 434 00:24:42,900 --> 00:24:44,940 And what I'm giving you here is going 435 00:24:44,940 --> 00:24:51,130 to be an unbiased estimate of how good that policy is, 436 00:24:51,130 --> 00:24:54,070 which one can now estimate using that randomized control trial. 437 00:24:57,350 --> 00:25:03,300 Now, this also might lead you to think 438 00:25:03,300 --> 00:25:06,930 through the question of, well, rather than estimating 439 00:25:06,930 --> 00:25:10,230 the policy through-- 440 00:25:10,230 --> 00:25:14,250 rather than obtaining a policy through the lens of optimizing 441 00:25:14,250 --> 00:25:17,880 CATE, of figuring how to estimate CATE, 442 00:25:17,880 --> 00:25:21,370 maybe we could have skipped that all together. 443 00:25:21,370 --> 00:25:26,170 For example, suppose that we had that randomized control trial 444 00:25:26,170 --> 00:25:26,670 data. 445 00:25:26,670 --> 00:25:30,240 Now imagine that rather than 100 individuals, 446 00:25:30,240 --> 00:25:32,500 you had a really large randomized control trial 447 00:25:32,500 --> 00:25:35,640 with 10,000 individuals in it. 448 00:25:35,640 --> 00:25:41,010 This now opens the door to thinking about directly 449 00:25:41,010 --> 00:25:43,830 maximizing or minimizing, depending whether you want this 450 00:25:43,830 --> 00:25:46,590 to be large or small, pi with respect 451 00:25:46,590 --> 00:25:50,820 to this quantity, which completely bypasses 452 00:25:50,820 --> 00:25:54,300 the goal of estimating the condition of average treatment 453 00:25:54,300 --> 00:25:56,190 effect. 454 00:25:56,190 --> 00:25:58,530 And you'll notice how this looks exactly 455 00:25:58,530 --> 00:26:00,390 like a classification problem. 456 00:26:00,390 --> 00:26:04,450 This quantity here looks exactly like a 0 1 loss. 457 00:26:04,450 --> 00:26:06,280 And the only difference is that you're 458 00:26:06,280 --> 00:26:08,140 weighting each of the data points 459 00:26:08,140 --> 00:26:12,640 by this inverse propensity. 460 00:26:12,640 --> 00:26:17,260 So one can reduce the problem of actually finding 461 00:26:17,260 --> 00:26:21,250 an optimal policy here to that of a weighted classification 462 00:26:21,250 --> 00:26:25,256 problem, in the case of a discrete set of treatments. 463 00:26:28,370 --> 00:26:31,010 There are two big caveats to that line of thinking. 464 00:26:31,010 --> 00:26:36,790 The first major caveat is that you 465 00:26:36,790 --> 00:26:38,425 have to know these propensity scores. 466 00:26:41,720 --> 00:26:46,700 And so if you have data coming from randomized control trial, 467 00:26:46,700 --> 00:26:48,570 you will know this propensity scores 468 00:26:48,570 --> 00:26:50,750 or if you have, for example, some control 469 00:26:50,750 --> 00:26:54,290 over the data generation process. 470 00:26:54,290 --> 00:26:57,140 For example, if you are an ad company 471 00:26:57,140 --> 00:27:01,860 and you get to choose which ad to show to your customers, 472 00:27:01,860 --> 00:27:03,920 then you look to see who clicks on what, 473 00:27:03,920 --> 00:27:06,740 you might know what that policy was that was showing things. 474 00:27:06,740 --> 00:27:09,890 In that case, you might exactly know the propensity scores. 475 00:27:09,890 --> 00:27:12,680 In health care, other than in randomized control trials, 476 00:27:12,680 --> 00:27:14,390 we typically don't know this value. 477 00:27:14,390 --> 00:27:17,330 So we either have to have a large enough randomized control 478 00:27:17,330 --> 00:27:22,010 trial that we won't over-fit by trying to directly minimize 479 00:27:22,010 --> 00:27:27,740 this or we have to work within an observational data setting. 480 00:27:27,740 --> 00:27:30,917 But we have to estimate the propensity scores directly. 481 00:27:30,917 --> 00:27:32,750 So you would then have a two-step procedure, 482 00:27:32,750 --> 00:27:35,520 where first you estimate these propensity scores, for example, 483 00:27:35,520 --> 00:27:37,220 by doing logistic regression. 484 00:27:37,220 --> 00:27:40,640 And then you attempt to maximize or minimize 485 00:27:40,640 --> 00:27:43,175 this quantity in order to find the optimal policy. 486 00:27:45,890 --> 00:27:48,400 And that has a lot of challenges, 487 00:27:48,400 --> 00:27:51,370 because this quantity shown in the very bottom 488 00:27:51,370 --> 00:27:54,160 here could be really small or really large 489 00:27:54,160 --> 00:27:58,480 in an observational data set due to these issues of having 490 00:27:58,480 --> 00:28:01,990 very small overlap between your treatments. 491 00:28:01,990 --> 00:28:05,200 And this being very small implies then 492 00:28:05,200 --> 00:28:10,120 that the variant of this estimator is very, very large. 493 00:28:10,120 --> 00:28:13,570 And so when one wants to use an approach like this, 494 00:28:13,570 --> 00:28:16,240 similar to when one wants to use an average treatment effect 495 00:28:16,240 --> 00:28:19,870 estimator, and when you're estimating these propensities, 496 00:28:19,870 --> 00:28:21,340 often you might need to do things 497 00:28:21,340 --> 00:28:23,620 like clipping of the propensity scores 498 00:28:23,620 --> 00:28:26,110 in order to prevent the variants from being too large. 499 00:28:26,110 --> 00:28:31,420 That then, however, leads to a biased estimate typically. 500 00:28:31,420 --> 00:28:33,800 I wanted to give you a couple of references here. 501 00:28:33,800 --> 00:28:45,530 So one is Swaminathan and Joachims, 502 00:28:45,530 --> 00:28:55,250 J-O-A-C-H-I-M-S ACML 2015. 503 00:28:55,250 --> 00:28:57,915 In that paper, they tackle this question. 504 00:28:57,915 --> 00:29:00,290 They focus on the setting where the propensity scores are 505 00:29:00,290 --> 00:29:03,470 known, such as do it half from a randomized controlled trial. 506 00:29:03,470 --> 00:29:06,380 And they recognize that you might 507 00:29:06,380 --> 00:29:09,203 decide that you prefer something like a biased estimator because 508 00:29:09,203 --> 00:29:11,120 of the fact that these propensity scores could 509 00:29:11,120 --> 00:29:12,650 be really small. 510 00:29:12,650 --> 00:29:15,110 And so they use some generalization results 511 00:29:15,110 --> 00:29:18,320 from the machine learning theory community in order 512 00:29:18,320 --> 00:29:22,456 to try to control the variants of the estimator 513 00:29:22,456 --> 00:29:25,440 as a function of these propensity scores. 514 00:29:25,440 --> 00:29:28,127 And they then learn, directly minimize 515 00:29:28,127 --> 00:29:30,460 the policy which is what they call counterfactual regret 516 00:29:30,460 --> 00:29:35,160 minimization, in order to allow one 517 00:29:35,160 --> 00:29:36,630 to generalize as best as possible 518 00:29:36,630 --> 00:29:40,030 from the small amount of data you might have available. 519 00:29:40,030 --> 00:29:42,110 A second reference that I want to give just 520 00:29:42,110 --> 00:29:43,943 to point you into this literature, if you're 521 00:29:43,943 --> 00:29:49,030 interested, is by Nathan Kallus and his student, 522 00:29:49,030 --> 00:29:55,960 I believe Angela Zhou, from NeurIPS 2018. 523 00:29:55,960 --> 00:29:58,510 And that was a paper which was one of the optional readings 524 00:29:58,510 --> 00:30:00,403 for last Thursday's class. 525 00:30:00,403 --> 00:30:02,320 Now, that paper they also start from something 526 00:30:02,320 --> 00:30:04,300 like this, from this perspective. 527 00:30:04,300 --> 00:30:07,300 And they say that, oh, now that we're 528 00:30:07,300 --> 00:30:09,490 working in this framework, one could 529 00:30:09,490 --> 00:30:12,340 think about what happens if you have actually 530 00:30:12,340 --> 00:30:14,820 unobserved confounding. 531 00:30:14,820 --> 00:30:17,700 So there, you might not actually know the true propensity 532 00:30:17,700 --> 00:30:20,720 scores, because there are unobserved confounders 533 00:30:20,720 --> 00:30:22,650 that you don't observe. 534 00:30:22,650 --> 00:30:27,380 And that you can think about trying to bound how wrong 535 00:30:27,380 --> 00:30:30,170 your estimator can be as a function of how much you 536 00:30:30,170 --> 00:30:32,180 don't know this quantity. 537 00:30:32,180 --> 00:30:34,430 And they show that when you try to-- 538 00:30:34,430 --> 00:30:36,620 if you think about having some backup strategy, 539 00:30:36,620 --> 00:30:41,480 like if your goal is to find a new policy which performs 540 00:30:41,480 --> 00:30:46,730 as best as possible with respect to an old policy, 541 00:30:46,730 --> 00:30:48,620 then it gives you a really elegant framework 542 00:30:48,620 --> 00:30:51,237 for trying to think about a robust optimization of this, 543 00:30:51,237 --> 00:30:53,570 even taking into consideration the fact that there might 544 00:30:53,570 --> 00:30:54,870 be unobserved confounding. 545 00:30:54,870 --> 00:30:59,040 And that works also in this framework. 546 00:30:59,040 --> 00:31:00,210 So I'm nearly done now. 547 00:31:03,402 --> 00:31:05,110 I just want to now finish with a thought, 548 00:31:05,110 --> 00:31:07,570 can we do the same thing for policies learned 549 00:31:07,570 --> 00:31:09,710 by reinforcement learning? 550 00:31:09,710 --> 00:31:12,400 So now that we've sort of built up this language 551 00:31:12,400 --> 00:31:15,850 that's returned to the RL setting. 552 00:31:15,850 --> 00:31:19,030 And there one can show that you can 553 00:31:19,030 --> 00:31:22,900 get a similar estimate for the value of a policy 554 00:31:22,900 --> 00:31:27,520 by summing over your observed sequences, 555 00:31:27,520 --> 00:31:35,080 summing over the time steps of that sequence of the reward 556 00:31:35,080 --> 00:31:42,590 observed at that time step times a ratio of probabilities, which 557 00:31:42,590 --> 00:31:46,820 is going from the first time step up 558 00:31:46,820 --> 00:31:53,430 to time little t of the probability 559 00:31:53,430 --> 00:31:58,350 that you would actually take the observed action t prime, 560 00:31:58,350 --> 00:32:02,760 given that you are in the observed state t prime, divided 561 00:32:02,760 --> 00:32:06,370 by the probability-- 562 00:32:06,370 --> 00:32:08,200 this is the analogy of the propensity 563 00:32:08,200 --> 00:32:11,500 score, the probability under the data generating process-- 564 00:32:11,500 --> 00:32:21,190 of seeing action a given that you are in state t prime. 565 00:32:21,190 --> 00:32:23,380 So if, as we discussed there, you 566 00:32:23,380 --> 00:32:27,940 had a deterministic policy, then this pi, 567 00:32:27,940 --> 00:32:29,660 it would just be a delta function. 568 00:32:29,660 --> 00:32:34,030 And so this would just be looking at-- 569 00:32:34,030 --> 00:32:35,680 this estimator would only be looking 570 00:32:35,680 --> 00:32:40,960 at sequences where the precise sequence of actions taken 571 00:32:40,960 --> 00:32:44,080 are identical to the precise sequence of actions 572 00:32:44,080 --> 00:32:47,790 that the policy would have taken. 573 00:32:47,790 --> 00:32:49,740 And the difference here is that now instead 574 00:32:49,740 --> 00:32:52,230 of having a single propensity score, 575 00:32:52,230 --> 00:32:56,010 one has a product of these propensity scores corresponding 576 00:32:56,010 --> 00:33:00,360 to the propensity of observing that action given 577 00:33:00,360 --> 00:33:04,240 the corresponding state at each point along the sequence. 578 00:33:04,240 --> 00:33:06,210 And so this is nice, because this gives you 579 00:33:06,210 --> 00:33:09,450 one way to do what's called off-policy evaluation. 580 00:33:15,570 --> 00:33:20,640 And this is an estimator, which is 581 00:33:20,640 --> 00:33:22,200 completely analogous to the estimator 582 00:33:22,200 --> 00:33:24,370 that we got from Q learning. 583 00:33:24,370 --> 00:33:26,670 So if all assumptions were correct, 584 00:33:26,670 --> 00:33:29,850 and you had a lot of data, then those two 585 00:33:29,850 --> 00:33:32,980 should give you precisely the same answer. 586 00:33:32,980 --> 00:33:35,800 But here, like in the causal inference setting, 587 00:33:35,800 --> 00:33:38,170 we are not making the assumption that we can 588 00:33:38,170 --> 00:33:40,232 do covariate adjustment well. 589 00:33:40,232 --> 00:33:42,190 Or said differently, we're not assuming that we 590 00:33:42,190 --> 00:33:45,450 can fit the Q function well. 591 00:33:45,450 --> 00:33:48,060 And this is now, just like there, 592 00:33:48,060 --> 00:33:50,640 based on the assumption that we have the ability 593 00:33:50,640 --> 00:33:53,730 to really accurately know what the propensity scores are. 594 00:33:53,730 --> 00:33:55,650 So it now gives you an alternative approach 595 00:33:55,650 --> 00:33:56,645 to do evaluation. 596 00:33:56,645 --> 00:33:58,020 And you could think about looking 597 00:33:58,020 --> 00:34:00,120 at the robustness of your estimates 598 00:34:00,120 --> 00:34:04,340 from these two different estimators. 599 00:34:04,340 --> 00:34:09,290 And this is the most naive of the estimators. 600 00:34:09,290 --> 00:34:12,260 There are many ways to try to make this better, such as 601 00:34:12,260 --> 00:34:16,800 by doing w robust estimators. 602 00:34:16,800 --> 00:34:18,739 And if you want to learn more, I recommend 603 00:34:18,739 --> 00:34:30,170 reading this paper by Thomas and Emma Brunskill in ICML 2016. 604 00:34:30,170 --> 00:34:33,110 And with that, I want Barbra to come up and get set up. 605 00:34:33,110 --> 00:34:35,693 And we're going to transition to the next part of the lecture. 606 00:34:38,300 --> 00:34:39,039 Yes. 607 00:34:39,039 --> 00:34:42,550 AUDIENCE: Why do we sum over t and take the project 608 00:34:42,550 --> 00:34:44,083 across all t? 609 00:34:44,083 --> 00:34:46,000 DAVID SONTAG: One easy way to think about this 610 00:34:46,000 --> 00:34:49,770 is suppose that you only had a reward of the last time step. 611 00:34:49,770 --> 00:34:51,730 If you only had a reward of the last time step, 612 00:34:51,730 --> 00:34:53,290 then you wouldn't have this sum over t, 613 00:34:53,290 --> 00:34:55,460 because the rewards in the earlier steps would be 0. 614 00:34:55,460 --> 00:34:57,460 You would just have that product going from 0 up 615 00:34:57,460 --> 00:34:59,590 to capital T of last time step. 616 00:34:59,590 --> 00:35:03,340 The reason why you have it up to at each time step 617 00:35:03,340 --> 00:35:05,890 is because one wants to be able to appropriately weigh 618 00:35:05,890 --> 00:35:11,150 the likelihood of seeing that reward at that point in time. 619 00:35:11,150 --> 00:35:12,878 One could rewrite this in other ways. 620 00:35:12,878 --> 00:35:14,920 I want to hold other questions, because this part 621 00:35:14,920 --> 00:35:17,045 of the lecture is going to be much more interesting 622 00:35:17,045 --> 00:35:18,730 than my part of the lecture. 623 00:35:18,730 --> 00:35:21,900 And with that, I want introduce Barbra. 624 00:35:21,900 --> 00:35:24,280 Barbra, I first met her when she invited me to give 625 00:35:24,280 --> 00:35:27,370 a talk in her class last year. 626 00:35:27,370 --> 00:35:33,550 She's an instructor at Harvard Medical School-- 627 00:35:33,550 --> 00:35:36,370 or School of Public Health. 628 00:35:36,370 --> 00:35:39,530 She recently finished her PhD in 2018. 629 00:35:39,530 --> 00:35:42,820 And her PhD looked at many questions 630 00:35:42,820 --> 00:35:45,910 related to the themes of the last couple of weeks. 631 00:35:45,910 --> 00:35:48,500 Since that time, in addition continuing her research, 632 00:35:48,500 --> 00:35:52,000 she's been really leading the way in creating data science 633 00:35:52,000 --> 00:35:54,100 curriculum over at Harvard. 634 00:35:54,100 --> 00:35:55,210 So please take it away. 635 00:35:55,210 --> 00:35:56,668 BARBRA DICKERMAN: Thank you so much 636 00:35:56,668 --> 00:35:57,870 for the introduction, David. 637 00:35:57,870 --> 00:36:01,180 I'm very happy to be here to share some of my work 638 00:36:01,180 --> 00:36:04,420 on evaluating dynamic treatment strategies, 639 00:36:04,420 --> 00:36:08,800 which you've been talking about over the past few lectures. 640 00:36:08,800 --> 00:36:11,130 So my goals for today, I'm just going 641 00:36:11,130 --> 00:36:14,500 to breeze over defining dynamic treatment strategies, 642 00:36:14,500 --> 00:36:16,220 as you're already familiar with it. 643 00:36:16,220 --> 00:36:18,520 But I would like to touch on when 644 00:36:18,520 --> 00:36:22,760 we need a special class of methods called g-methods. 645 00:36:22,760 --> 00:36:25,910 And then we'll talk about two different applications, 646 00:36:25,910 --> 00:36:28,840 different analyses, that have focused on evaluating 647 00:36:28,840 --> 00:36:31,250 dynamic treatment strategies. 648 00:36:31,250 --> 00:36:33,490 So the first will be an application 649 00:36:33,490 --> 00:36:36,010 of the parametric g-formula, which 650 00:36:36,010 --> 00:36:39,890 is a powerful g-method to cancer research. 651 00:36:39,890 --> 00:36:42,070 And so the goal here is to give you 652 00:36:42,070 --> 00:36:44,650 my causal inference perspective on how 653 00:36:44,650 --> 00:36:48,100 we think about this task of sequential decision making 654 00:36:48,100 --> 00:36:50,140 and then with whatever time remains, 655 00:36:50,140 --> 00:36:55,030 we'll be discussing a recent publication on the AI clinician 656 00:36:55,030 --> 00:36:56,890 to talk through the reinforcement learning 657 00:36:56,890 --> 00:36:57,623 perspective. 658 00:36:57,623 --> 00:37:00,040 So I think it'll be a really interesting discussion, where 659 00:37:00,040 --> 00:37:01,960 we can share these perspectives, talk 660 00:37:01,960 --> 00:37:06,200 about the relative strengths and limitations as well. 661 00:37:06,200 --> 00:37:10,310 And please stop me if you have any questions. 662 00:37:10,310 --> 00:37:11,420 So you already know this. 663 00:37:11,420 --> 00:37:13,020 When it comes to treatment strategies, 664 00:37:13,020 --> 00:37:13,980 there's three main types. 665 00:37:13,980 --> 00:37:15,522 There's point interventions happening 666 00:37:15,522 --> 00:37:16,840 at a single point in time. 667 00:37:16,840 --> 00:37:19,895 There's sustained interventions happening over time. 668 00:37:19,895 --> 00:37:21,770 When it comes to clinical care, this is often 669 00:37:21,770 --> 00:37:23,960 what we're most interested in. 670 00:37:23,960 --> 00:37:25,880 Within that, there are static strategies, 671 00:37:25,880 --> 00:37:28,050 which are constant over time. 672 00:37:28,050 --> 00:37:29,810 And then there's dynamic strategies, 673 00:37:29,810 --> 00:37:31,910 which we're going to focus on. 674 00:37:31,910 --> 00:37:34,970 And these differ in that the intervention over time 675 00:37:34,970 --> 00:37:38,300 depends on evolving characteristics. 676 00:37:38,300 --> 00:37:41,330 So for example, initiate treatment at baseline 677 00:37:41,330 --> 00:37:44,120 and continue it over follow up until a contraindication 678 00:37:44,120 --> 00:37:47,750 occurs, at which point you may stop treatment 679 00:37:47,750 --> 00:37:49,520 and decide with your doctor whether you're 680 00:37:49,520 --> 00:37:52,610 going to switch to an alternate treatment. 681 00:37:52,610 --> 00:37:54,770 You would still be adhering to that strategy, 682 00:37:54,770 --> 00:37:56,390 even though you quit. 683 00:37:56,390 --> 00:37:59,270 The comparison here being do not initiate treatment over 684 00:37:59,270 --> 00:38:02,880 follow up, likewise unless an indication occurs, 685 00:38:02,880 --> 00:38:04,880 at which point you may start treatment and still 686 00:38:04,880 --> 00:38:06,190 be adhering to the strategy. 687 00:38:06,190 --> 00:38:07,940 So we're focusing on these because they're 688 00:38:07,940 --> 00:38:11,710 the most clinically relevant. 689 00:38:11,710 --> 00:38:14,860 And so clinicians encounter these every day in practice. 690 00:38:14,860 --> 00:38:16,870 So when they're making a recommendation 691 00:38:16,870 --> 00:38:20,410 to their patient about a prevention intervention, 692 00:38:20,410 --> 00:38:22,360 they're going to be taking into consideration 693 00:38:22,360 --> 00:38:24,700 the patient's evolving comorbidities. 694 00:38:24,700 --> 00:38:27,280 Or when they're deciding the next screening interval, 695 00:38:27,280 --> 00:38:30,130 they'll consider the previous result from the last screening 696 00:38:30,130 --> 00:38:32,080 test when deciding that. 697 00:38:32,080 --> 00:38:35,140 Likewise for treatment, deciding whether to keep the patient 698 00:38:35,140 --> 00:38:36,400 on treatment or not. 699 00:38:36,400 --> 00:38:38,290 Is the patient having any changes 700 00:38:38,290 --> 00:38:43,210 in symptoms or lab values that may reflect toxicity? 701 00:38:43,210 --> 00:38:46,090 So one thing to note is that while many 702 00:38:46,090 --> 00:38:49,360 of the strategies that you may see in clinical guidelines 703 00:38:49,360 --> 00:38:53,140 and in clinical practice are dynamic strategies, 704 00:38:53,140 --> 00:38:56,070 these may not be the optimal strategies. 705 00:38:56,070 --> 00:38:57,820 So maybe what we're recommending and doing 706 00:38:57,820 --> 00:38:59,840 is not optimal for patients. 707 00:38:59,840 --> 00:39:02,020 However, the optimal strategies will 708 00:39:02,020 --> 00:39:04,960 be dynamic in some way, in that they 709 00:39:04,960 --> 00:39:08,860 will be adapting to individuals' unique and evolving 710 00:39:08,860 --> 00:39:10,310 characteristics. 711 00:39:10,310 --> 00:39:13,060 So that's why we care about them. 712 00:39:13,060 --> 00:39:16,270 So what's the problem? 713 00:39:16,270 --> 00:39:18,130 So one problem deals with something 714 00:39:18,130 --> 00:39:19,990 called treatment confounder feedback, 715 00:39:19,990 --> 00:39:22,510 which you may have spoken about in this class. 716 00:39:22,510 --> 00:39:26,710 So conventional statistical methods cannot appropriately 717 00:39:26,710 --> 00:39:30,490 compare dynamic treatment strategies in the presence 718 00:39:30,490 --> 00:39:32,320 of treatment confounder feedback. 719 00:39:32,320 --> 00:39:35,560 So this is when time varying confounders are 720 00:39:35,560 --> 00:39:38,330 affected by previous treatment. 721 00:39:38,330 --> 00:39:41,620 So if we kind of ground this in a concrete example 722 00:39:41,620 --> 00:39:43,960 with this causal diagram, let's say 723 00:39:43,960 --> 00:39:47,140 we're interested in estimating the effect of some intervention 724 00:39:47,140 --> 00:39:52,750 A, vasopressors or it could be IV fluids, on some outcome Y, 725 00:39:52,750 --> 00:39:55,090 which we'll call survival here. 726 00:39:55,090 --> 00:39:58,630 We know that vasopressors affect blood pressure, 727 00:39:58,630 --> 00:40:02,140 and blood pressure will affect subsequent decisions 728 00:40:02,140 --> 00:40:04,210 to treat with vasopressors. 729 00:40:04,210 --> 00:40:06,340 We also know that hypotension-- so again, 730 00:40:06,340 --> 00:40:10,570 blood pressure, L1, affects survival, based 731 00:40:10,570 --> 00:40:12,130 on our clinical knowledge. 732 00:40:12,130 --> 00:40:16,180 And then in this DAG, we also have the node U, which 733 00:40:16,180 --> 00:40:18,560 represents disease severity. 734 00:40:18,560 --> 00:40:21,910 So these could be potentially unmeasured markers 735 00:40:21,910 --> 00:40:25,810 of disease severity that are affecting your blood pressure 736 00:40:25,810 --> 00:40:30,260 and also affecting your probability of survival. 737 00:40:30,260 --> 00:40:32,500 So if we're interested in estimating 738 00:40:32,500 --> 00:40:37,510 the effect of a sustained treatment strategy, 739 00:40:37,510 --> 00:40:40,140 then we want to know something about the total effect 740 00:40:40,140 --> 00:40:42,430 of treatment at all time points. 741 00:40:42,430 --> 00:40:45,520 We can see that L1 here is a confounder for the effect of A1 742 00:40:45,520 --> 00:40:48,560 on Y so we have to do something to adjust for that. 743 00:40:48,560 --> 00:40:50,980 And if we were to apply a conventional statistical 744 00:40:50,980 --> 00:40:54,970 method, we would essentially be conditioning on a collider 745 00:40:54,970 --> 00:40:56,780 and inducing a selection bias. 746 00:40:56,780 --> 00:41:01,210 So an open path from A0 to L1 to U to Y. 747 00:41:01,210 --> 00:41:02,750 What's the consequence of this? 748 00:41:02,750 --> 00:41:04,270 If we look in our data set, we may 749 00:41:04,270 --> 00:41:08,040 see an association between A and Y. 750 00:41:08,040 --> 00:41:11,410 But that association is not because there's necessarily 751 00:41:11,410 --> 00:41:14,080 an effect of A on Y. It might not be causal. 752 00:41:14,080 --> 00:41:19,100 It may be due to this selection bias that we created. 753 00:41:19,100 --> 00:41:20,930 So this is the problem. 754 00:41:20,930 --> 00:41:24,910 And so in these cases, we need a special type of method 755 00:41:24,910 --> 00:41:28,210 that can handle these settings. 756 00:41:28,210 --> 00:41:32,260 And so a class of methods that was designed specifically 757 00:41:32,260 --> 00:41:35,110 to handle this is g-methods. 758 00:41:35,110 --> 00:41:38,380 And so these are sometimes referred to as causal methods. 759 00:41:38,380 --> 00:41:41,530 They've been developed by Jamie Robins and colleagues 760 00:41:41,530 --> 00:41:43,480 and collaborators since 1986. 761 00:41:43,480 --> 00:41:45,970 And they include the parametric g-formula, 762 00:41:45,970 --> 00:41:48,100 g-estimation of structural nested models, 763 00:41:48,100 --> 00:41:49,660 and inverse probability weighting 764 00:41:49,660 --> 00:41:50,935 of marginal structural models. 765 00:41:55,140 --> 00:41:57,770 So in my research, what I do is I 766 00:41:57,770 --> 00:42:02,420 combine g-methods with large longitudinal databases 767 00:42:02,420 --> 00:42:06,290 to try to evaluate dynamic treatment strategies. 768 00:42:06,290 --> 00:42:09,320 So I'm particularly interested in bringing these methods 769 00:42:09,320 --> 00:42:11,180 to cancer research, because they haven't 770 00:42:11,180 --> 00:42:13,010 been applied much there. 771 00:42:13,010 --> 00:42:14,420 So a lot of my research questions 772 00:42:14,420 --> 00:42:16,950 are focused on answering questions like, 773 00:42:16,950 --> 00:42:21,740 how and when can we intervene to best prevent, detect, and treat 774 00:42:21,740 --> 00:42:23,860 cancer? 775 00:42:23,860 --> 00:42:28,370 And so I'd like to share one example with you, which 776 00:42:28,370 --> 00:42:32,480 focused on evaluating the effect of adhering 777 00:42:32,480 --> 00:42:34,940 to guideline-based physical activity 778 00:42:34,940 --> 00:42:39,870 interventions on survival among men with prostate cancer. 779 00:42:39,870 --> 00:42:41,390 So the motivation for this study, 780 00:42:41,390 --> 00:42:43,910 there's a large clinical organization, ASCO, 781 00:42:43,910 --> 00:42:46,160 the American Society of Clinical Oncology, 782 00:42:46,160 --> 00:42:48,680 that had actually called for randomized trials 783 00:42:48,680 --> 00:42:52,720 to generate these estimates for several cancers. 784 00:42:52,720 --> 00:42:54,200 The thing with prostate cancer is 785 00:42:54,200 --> 00:42:56,580 it's a very slowly progressing disease. 786 00:42:56,580 --> 00:42:59,840 So the feasibility of doing a trial to evaluate this 787 00:42:59,840 --> 00:43:01,040 is very limited. 788 00:43:01,040 --> 00:43:04,370 The trial would have to be 10 years long probably. 789 00:43:04,370 --> 00:43:08,390 So given that, given the absence of this randomized evidence, 790 00:43:08,390 --> 00:43:09,920 we did the next best thing that we 791 00:43:09,920 --> 00:43:12,380 could do to generate this estimate, which 792 00:43:12,380 --> 00:43:15,230 was combine high-quality observational data 793 00:43:15,230 --> 00:43:20,090 with advanced EPI methods, in this case parametric g-formula. 794 00:43:20,090 --> 00:43:22,730 And so we leveraged data from the Health Professionals 795 00:43:22,730 --> 00:43:25,430 Follow-up Study, which is a well-characterized prospective 796 00:43:25,430 --> 00:43:26,240 cohort study. 797 00:43:29,670 --> 00:43:32,530 So in these cases, there's a three-step process 798 00:43:32,530 --> 00:43:37,090 that we take to extract the most meaningful and actionable 799 00:43:37,090 --> 00:43:39,980 insights from observational data. 800 00:43:39,980 --> 00:43:41,650 So the first thing that we do is we 801 00:43:41,650 --> 00:43:44,740 specify the protocol of the target trial 802 00:43:44,740 --> 00:43:49,420 that we would have liked to conduct had it been feasible. 803 00:43:49,420 --> 00:43:51,340 The second thing we do is we make sure 804 00:43:51,340 --> 00:43:54,670 that we measure enough covariates to approximately 805 00:43:54,670 --> 00:43:57,280 adjust for confounding and achieve 806 00:43:57,280 --> 00:43:59,805 conditional exchangeability. 807 00:43:59,805 --> 00:44:01,180 And then the third thing we do is 808 00:44:01,180 --> 00:44:04,510 we apply an appropriate method to compare the specified 809 00:44:04,510 --> 00:44:07,360 treatment strategies under this assumption 810 00:44:07,360 --> 00:44:10,670 of conditional exchangeability. 811 00:44:10,670 --> 00:44:13,730 And so in this case, eligible men for this study 812 00:44:13,730 --> 00:44:17,430 had been diagnosed with non-metastatic prostate cancer. 813 00:44:17,430 --> 00:44:19,310 And at baseline, they were free of 814 00:44:19,310 --> 00:44:21,650 cardiovascular and neurologic conditions that 815 00:44:21,650 --> 00:44:24,320 may limit physical ability. 816 00:44:24,320 --> 00:44:26,030 For the treatment strategies, men 817 00:44:26,030 --> 00:44:29,150 were to initiate one of six physical activity 818 00:44:29,150 --> 00:44:33,410 strategies at diagnosis and continue it over followup 819 00:44:33,410 --> 00:44:36,620 until the development of a condition limiting 820 00:44:36,620 --> 00:44:38,010 physical activity. 821 00:44:38,010 --> 00:44:40,900 So this is what made the strategies dynamic. 822 00:44:40,900 --> 00:44:43,010 The intervention over time depended 823 00:44:43,010 --> 00:44:45,620 on these evolving conditions. 824 00:44:45,620 --> 00:44:48,530 And so just to note, we pre-specified 825 00:44:48,530 --> 00:44:51,670 these strategies that we were evaluating 826 00:44:51,670 --> 00:44:54,040 as well as the conditions. 827 00:44:54,040 --> 00:44:56,380 Men were followed until diagnosis, 828 00:44:56,380 --> 00:44:59,793 until death, and to followup 10 years after diagnosis 829 00:44:59,793 --> 00:45:01,210 or administrative end to followup, 830 00:45:01,210 --> 00:45:02,970 whichever happened first. 831 00:45:02,970 --> 00:45:05,140 Our outcome of interest was all cause mortality 832 00:45:05,140 --> 00:45:07,000 within 10 years. 833 00:45:07,000 --> 00:45:10,000 And we were interested in estimating the per protocol 834 00:45:10,000 --> 00:45:12,670 effect of not just initiating these strategies 835 00:45:12,670 --> 00:45:15,200 but adhering to them over followup. 836 00:45:15,200 --> 00:45:19,615 And again, we applied the parametric g-formula. 837 00:45:19,615 --> 00:45:21,740 So I think you've already heard about the g-formula 838 00:45:21,740 --> 00:45:24,720 in a previous lecture, possibly in a slightly different way. 839 00:45:24,720 --> 00:45:26,850 So I won't spend too much time on this. 840 00:45:26,850 --> 00:45:30,380 So the g-formula, essentially the way I think about it 841 00:45:30,380 --> 00:45:33,200 is a generalization of standardization 842 00:45:33,200 --> 00:45:36,380 to time varying exposures and confounders. 843 00:45:36,380 --> 00:45:38,360 So it's basically a weighted average 844 00:45:38,360 --> 00:45:41,120 of risks, where you can think of the weights being 845 00:45:41,120 --> 00:45:43,910 the probability density functions of the time varying 846 00:45:43,910 --> 00:45:47,390 confounders, which we estimate using parametric regression 847 00:45:47,390 --> 00:45:48,350 models. 848 00:45:48,350 --> 00:45:50,090 And we approximate the weighted average 849 00:45:50,090 --> 00:45:54,110 using Monte Carlo simulation. 850 00:45:54,110 --> 00:45:56,840 So practically how do we do this? 851 00:45:56,840 --> 00:45:59,560 So the first thing we do is we fit parametric regression 852 00:45:59,560 --> 00:46:02,020 models for all of the variables that we're 853 00:46:02,020 --> 00:46:03,460 going to be studying. 854 00:46:03,460 --> 00:46:08,690 So for treatment confounders and death at each followup time. 855 00:46:08,690 --> 00:46:10,810 The next thing we do is Monte Carlo simulation 856 00:46:10,810 --> 00:46:12,310 where essentially what we want to do 857 00:46:12,310 --> 00:46:15,880 is simulate the outcome distribution 858 00:46:15,880 --> 00:46:21,140 under each treatment strategy that we're interested in. 859 00:46:21,140 --> 00:46:25,100 And then we bootstrap the confidence intervals. 860 00:46:25,100 --> 00:46:27,495 So I'd like to show you kind of in a schematic what 861 00:46:27,495 --> 00:46:28,870 this looks like, because it might 862 00:46:28,870 --> 00:46:31,040 be a little bit easier to see. 863 00:46:31,040 --> 00:46:32,490 So again, the idea is we're going 864 00:46:32,490 --> 00:46:36,730 to make copies of our data set, where in each copy 865 00:46:36,730 --> 00:46:39,490 everyone is adhering to the strategy 866 00:46:39,490 --> 00:46:42,070 that we're focusing on in that copy. 867 00:46:42,070 --> 00:46:45,650 So how do we construct each of these copies of the data set? 868 00:46:45,650 --> 00:46:48,350 We have to build them each from the ground up, 869 00:46:48,350 --> 00:46:50,290 starting with time 0. 870 00:46:50,290 --> 00:46:54,580 So the values of all of the time varying covariates at time 0 871 00:46:54,580 --> 00:46:57,320 are sampled from their empirical distribution. 872 00:46:57,320 --> 00:47:01,780 So these are actually observed values of the covariates. 873 00:47:01,780 --> 00:47:05,590 How do we get the values at the next time point? 874 00:47:05,590 --> 00:47:07,900 We use the parametric regression models 875 00:47:07,900 --> 00:47:12,040 that I mentioned that we fit in step 1. 876 00:47:12,040 --> 00:47:16,900 Then what we do is we force the level of the intervention 877 00:47:16,900 --> 00:47:20,920 variable to be whatever was specified by that intervention 878 00:47:20,920 --> 00:47:23,320 strategy. 879 00:47:23,320 --> 00:47:26,260 And then we estimate the risk of the outcome 880 00:47:26,260 --> 00:47:29,890 at each time period given these variables, 881 00:47:29,890 --> 00:47:31,540 again using the parametric regression 882 00:47:31,540 --> 00:47:33,520 model for the outcome now. 883 00:47:33,520 --> 00:47:36,070 And so we repeat this over all time periods 884 00:47:36,070 --> 00:47:41,110 to estimate a cumulative risk under that strategy, which 885 00:47:41,110 --> 00:47:45,650 is taken as the average of the subject-specific risks. 886 00:47:45,650 --> 00:47:46,750 So this is what I'm doing. 887 00:47:46,750 --> 00:47:48,292 This is kind of under the hood what's 888 00:47:48,292 --> 00:47:49,630 going on with this method. 889 00:47:49,630 --> 00:47:51,130 DAVID SONTAG: So maybe we should try 890 00:47:51,130 --> 00:47:53,890 to put that in language of what we saw in the class. 891 00:47:53,890 --> 00:47:57,770 And let me know if I'm getting this wrong. 892 00:47:57,770 --> 00:48:02,410 So you first estimate the markup decision process, 893 00:48:02,410 --> 00:48:07,160 which allows you to simulate from the underlying data 894 00:48:07,160 --> 00:48:08,020 distribution. 895 00:48:08,020 --> 00:48:11,350 So you know that probability of this sort of next sequence 896 00:48:11,350 --> 00:48:15,820 of observations, given the previous sequence and action 897 00:48:15,820 --> 00:48:18,550 and previous actions, and then with that, then 898 00:48:18,550 --> 00:48:21,930 you could then intervene and simulate the forms. 899 00:48:21,930 --> 00:48:23,710 Because that was, if you remember 900 00:48:23,710 --> 00:48:26,110 Frederick gave you three different buckets 901 00:48:26,110 --> 00:48:28,040 of approaches. 902 00:48:28,040 --> 00:48:29,540 Then he focused on the middle one. 903 00:48:29,540 --> 00:48:31,180 This is the left-most bucket. 904 00:48:31,180 --> 00:48:31,710 The right? 905 00:48:31,710 --> 00:48:32,952 AUDIENCE: Yes. 906 00:48:32,952 --> 00:48:34,660 DAVID SONTAG: So we didn't talk about it. 907 00:48:34,660 --> 00:48:36,810 AUDIENCE: No, [INAUDIBLE] model based on relevance. 908 00:48:36,810 --> 00:48:37,130 BARBRA DICKERMAN: Yeah. 909 00:48:37,130 --> 00:48:38,020 Yes. 910 00:48:38,020 --> 00:48:40,905 DAVID SONTAG: But it's very sensible. 911 00:48:40,905 --> 00:48:41,530 AUDIENCE: Yeah. 912 00:48:41,530 --> 00:48:43,970 But it seems very hard. 913 00:48:43,970 --> 00:48:45,220 BARBRA DICKERMAN: What's that? 914 00:48:45,220 --> 00:48:46,080 AUDIENCE: Sorry. 915 00:48:46,080 --> 00:48:49,012 Oh, it seems very hard to model this [INAUDIBLE].. 916 00:48:49,012 --> 00:48:49,970 BARBRA DICKERMAN: Yeah. 917 00:48:49,970 --> 00:48:51,150 So that is a challenge. 918 00:48:51,150 --> 00:48:53,370 That is the hardest part about this. 919 00:48:53,370 --> 00:48:55,730 And it's relying on a lot of assumptions, yeah. 920 00:48:59,530 --> 00:49:02,050 So the primary results that kind of 921 00:49:02,050 --> 00:49:04,640 come out after we do all of this. 922 00:49:04,640 --> 00:49:07,720 So this is the estimated risk of all cause mortality 923 00:49:07,720 --> 00:49:10,780 under several physical activity interventions. 924 00:49:10,780 --> 00:49:13,390 So I'm not going to focus too much on the results. 925 00:49:13,390 --> 00:49:17,120 I want to focus on two main takeaways from this slide. 926 00:49:17,120 --> 00:49:20,680 One thing to emphasize is we pre-specified 927 00:49:20,680 --> 00:49:23,450 the weekly duration of physical activity. 928 00:49:23,450 --> 00:49:26,200 Or you can think of this like the dose of the intervention. 929 00:49:26,200 --> 00:49:27,850 We pre-specified that. 930 00:49:27,850 --> 00:49:30,730 And this was based on current guidelines. 931 00:49:30,730 --> 00:49:32,830 So the third row of each band, we 932 00:49:32,830 --> 00:49:36,610 did look at some dose or level beyond the guidelines 933 00:49:36,610 --> 00:49:40,060 to see if there might be additional survival benefits. 934 00:49:40,060 --> 00:49:41,930 But these were all pre-specified. 935 00:49:41,930 --> 00:49:45,430 We also pre-specified all of the time varying covariates 936 00:49:45,430 --> 00:49:47,890 that made these strategies dynamic. 937 00:49:47,890 --> 00:49:49,780 So I mentioned that men were excused 938 00:49:49,780 --> 00:49:52,210 from following the recommended physical activity 939 00:49:52,210 --> 00:49:56,140 levels if they developed one of these listed conditions, 940 00:49:56,140 --> 00:49:59,470 metastasis, MI, stroke, et cetera. 941 00:49:59,470 --> 00:50:01,060 We pre-specified all of those. 942 00:50:01,060 --> 00:50:04,828 It's possible that maybe a different dependence 943 00:50:04,828 --> 00:50:06,370 on a different time varying covariate 944 00:50:06,370 --> 00:50:08,860 may have led to a more optimal strategy. 945 00:50:08,860 --> 00:50:10,870 There was a lot that remained unexplored. 946 00:50:13,560 --> 00:50:16,830 So we did a lot of sensitivity analyses 947 00:50:16,830 --> 00:50:19,500 as part of this project. 948 00:50:19,500 --> 00:50:21,930 I'd like to focus, though, on the sensitivity analyses 949 00:50:21,930 --> 00:50:25,200 that we did for potential unmeasured confounding 950 00:50:25,200 --> 00:50:28,680 by chronic disease that may be severe enough 951 00:50:28,680 --> 00:50:33,280 to affect both physical activity and survival. 952 00:50:33,280 --> 00:50:36,870 And so the g-formula is actually providing a natural way 953 00:50:36,870 --> 00:50:40,110 to at least partly address this by estimating 954 00:50:40,110 --> 00:50:44,900 the risk of these physical activity interventions that 955 00:50:44,900 --> 00:50:47,750 are at each time point t only applied 956 00:50:47,750 --> 00:50:51,650 to men who are healthy enough to maintain a physical activity 957 00:50:51,650 --> 00:50:53,653 level at that time. 958 00:50:53,653 --> 00:50:55,070 And so again in the main analysis, 959 00:50:55,070 --> 00:50:58,400 we excused men from following the recommended levels 960 00:50:58,400 --> 00:51:03,020 if they developed one of these serious conditions. 961 00:51:03,020 --> 00:51:05,180 So in sensitivity analyses, we then 962 00:51:05,180 --> 00:51:08,180 expanded this list of serious conditions 963 00:51:08,180 --> 00:51:12,590 to also include the conditions that are shown in blue text. 964 00:51:12,590 --> 00:51:14,490 And so this attenuated our estimates 965 00:51:14,490 --> 00:51:17,120 but didn't change our conclusions. 966 00:51:17,120 --> 00:51:21,620 One thing to point out is that the validity of this approach 967 00:51:21,620 --> 00:51:25,070 rests on the assumption that at each time t 968 00:51:25,070 --> 00:51:30,350 we had available data needed to identify which 969 00:51:30,350 --> 00:51:32,600 men were healthy at that time enough 970 00:51:32,600 --> 00:51:33,940 to do the physical activity. 971 00:51:33,940 --> 00:51:34,440 Yeah. 972 00:51:34,440 --> 00:51:36,023 AUDIENCE: Sorry, just to double-check, 973 00:51:36,023 --> 00:51:37,735 does excuse mean that you remove them? 974 00:51:37,735 --> 00:51:39,110 BARBRA DICKERMAN: Great question. 975 00:51:39,110 --> 00:51:42,980 So because the strategy was pre-specified to say 976 00:51:42,980 --> 00:51:45,950 that if you develop one of these conditions, 977 00:51:45,950 --> 00:51:50,090 you may essentially do whatever level of physical activity 978 00:51:50,090 --> 00:51:51,440 you're able to do. 979 00:51:51,440 --> 00:51:53,690 So importantly-- I'm glad you brought this up-- 980 00:51:53,690 --> 00:51:56,420 we did not censor men at that time. 981 00:51:56,420 --> 00:51:59,000 They were still followed, because they were still 982 00:51:59,000 --> 00:52:02,330 adhering to the strategy as defined. 983 00:52:02,330 --> 00:52:05,060 Thanks for asking. 984 00:52:05,060 --> 00:52:09,290 And so given that we don't know whether the data contain 985 00:52:09,290 --> 00:52:13,290 at each time t the information necessary to know, 986 00:52:13,290 --> 00:52:16,070 are these men healthy enough at that time, we therefore 987 00:52:16,070 --> 00:52:18,800 conducted a few alternate analyses in which we 988 00:52:18,800 --> 00:52:22,880 lagged physical activity and covariate data by two years. 989 00:52:22,880 --> 00:52:25,580 And we also used a negative outcome control 990 00:52:25,580 --> 00:52:29,810 to explore potential unmeasured confounding by clinical disease 991 00:52:29,810 --> 00:52:31,940 or disease severity. 992 00:52:31,940 --> 00:52:33,440 So what's the rationale behind this? 993 00:52:33,440 --> 00:52:36,770 So in the DAGs below for the original analysis, 994 00:52:36,770 --> 00:52:41,120 we have physical activity A. We have survival Y. 995 00:52:41,120 --> 00:52:45,590 And this may be confounded by disease severity U. 996 00:52:45,590 --> 00:52:49,250 So when we see an association between A and Y in our data, 997 00:52:49,250 --> 00:52:51,070 we want to make sure that it's causal, 998 00:52:51,070 --> 00:52:53,000 that it's because of the blue arrow, 999 00:52:53,000 --> 00:52:55,280 and not because of this confounding bias, 1000 00:52:55,280 --> 00:52:56,640 the red arrow. 1001 00:52:56,640 --> 00:52:58,610 So how can we potentially provide 1002 00:52:58,610 --> 00:53:02,480 evidence for whether that red pathway is there? 1003 00:53:02,480 --> 00:53:05,000 We selected questionnaire nonresponse 1004 00:53:05,000 --> 00:53:08,750 as an alternate outcome, instead of survival, 1005 00:53:08,750 --> 00:53:13,940 that we assumed was not directly affected by physical activity, 1006 00:53:13,940 --> 00:53:16,820 but that we thought would be similarly confounded 1007 00:53:16,820 --> 00:53:19,230 by disease severity. 1008 00:53:19,230 --> 00:53:20,870 And so when we repeated the analysis 1009 00:53:20,870 --> 00:53:23,270 with a negative outcome control, we 1010 00:53:23,270 --> 00:53:26,000 found that physical activity had a nearly null effect 1011 00:53:26,000 --> 00:53:28,940 on questionnaire nonresponse, as we would expect, 1012 00:53:28,940 --> 00:53:34,353 which provides some support that in our original analysis, 1013 00:53:34,353 --> 00:53:36,020 the effect of physical activity on death 1014 00:53:36,020 --> 00:53:39,380 was not confounded through the pathways explored 1015 00:53:39,380 --> 00:53:41,868 through the negative control. 1016 00:53:41,868 --> 00:53:43,910 So one thing to highlight here is the sensitivity 1017 00:53:43,910 --> 00:53:47,820 analyses were driven by our subject matter knowledge. 1018 00:53:47,820 --> 00:53:51,140 And there's nothing in the data that kind of drove this. 1019 00:53:53,700 --> 00:53:55,980 And so just to recap this portion. 1020 00:53:55,980 --> 00:53:59,160 So g-methods are a useful tool, because they 1021 00:53:59,160 --> 00:54:01,710 let us validly estimate the effect 1022 00:54:01,710 --> 00:54:05,490 of pre-specified dynamic strategies 1023 00:54:05,490 --> 00:54:08,460 and estimate adjusted absolute risks, which are clinically 1024 00:54:08,460 --> 00:54:11,520 meaningful to us, and appropriately adjusted survival 1025 00:54:11,520 --> 00:54:14,370 curves, even in the presence of treatment confounder 1026 00:54:14,370 --> 00:54:19,770 feedback, which occurs often in clinical questions. 1027 00:54:19,770 --> 00:54:23,100 And of course, this is under our typical identifiability 1028 00:54:23,100 --> 00:54:25,020 assumptions. 1029 00:54:25,020 --> 00:54:26,700 So this makes it a powerful approach 1030 00:54:26,700 --> 00:54:29,070 to estimate the effects of currently recommended 1031 00:54:29,070 --> 00:54:31,320 or proposed strategies that therefore we 1032 00:54:31,320 --> 00:54:36,000 can specify and write out precisely as we did here. 1033 00:54:36,000 --> 00:54:38,280 However, these pre-specified strategies 1034 00:54:38,280 --> 00:54:41,740 may not be the optimal strategies. 1035 00:54:41,740 --> 00:54:44,310 So again, when I was doing this analysis, 1036 00:54:44,310 --> 00:54:47,790 I was thinking there are so many different weekly durations 1037 00:54:47,790 --> 00:54:50,320 of physical activity that we're not looking at. 1038 00:54:50,320 --> 00:54:53,550 There are so many different time-varying covariates 1039 00:54:53,550 --> 00:54:56,430 where we could have different dependencies on those 1040 00:54:56,430 --> 00:54:58,080 for these strategies over time. 1041 00:54:58,080 --> 00:55:00,960 And maybe those would have led to better survival 1042 00:55:00,960 --> 00:55:05,960 outcomes among these men, but all of that was unexplored.