1 00:00:15,520 --> 00:00:16,520 PROFESSOR: Hi, everyone. 2 00:00:16,520 --> 00:00:18,930 We're getting started now. 3 00:00:18,930 --> 00:00:21,150 So this week's lecture is really picking up 4 00:00:21,150 --> 00:00:23,062 where last week's left off. 5 00:00:23,062 --> 00:00:25,020 You may remember we spent the last week talking 6 00:00:25,020 --> 00:00:26,430 about cause inference. 7 00:00:26,430 --> 00:00:28,440 And I told you how, for last week, 8 00:00:28,440 --> 00:00:31,590 we're going to focus on a one-time setting. 9 00:00:31,590 --> 00:00:34,380 Well, as we know, lots of medicine 10 00:00:34,380 --> 00:00:36,510 has to do with multiple sequential decisions 11 00:00:36,510 --> 00:00:37,590 across time. 12 00:00:37,590 --> 00:00:39,480 And that'll be the focus of this whole week's 13 00:00:39,480 --> 00:00:41,210 worth of discussions. 14 00:00:41,210 --> 00:00:44,280 And as I thought about really what should I 15 00:00:44,280 --> 00:00:45,750 teach in this lecture, I realized 16 00:00:45,750 --> 00:00:48,630 that the person who knew the most about the topic 17 00:00:48,630 --> 00:00:52,200 was in fact a postdoctoral researcher in my lab. 18 00:00:52,200 --> 00:00:54,775 Most about this topic in the general area 19 00:00:54,775 --> 00:00:55,745 of the medical field. 20 00:00:55,745 --> 00:00:56,953 FREDRIK D. JOHANSSON: Thanks. 21 00:00:56,953 --> 00:00:58,732 I'll take it. 22 00:00:58,732 --> 00:00:59,940 AUDIENCE: Global [INAUDIBLE]. 23 00:00:59,940 --> 00:01:03,240 FREDRIK D. JOHANSSON: It's very fair. 24 00:01:03,240 --> 00:01:06,760 PROFESSOR: And so I invited him to come to us today 25 00:01:06,760 --> 00:01:09,150 and to give this as an invited lecture. 26 00:01:09,150 --> 00:01:11,150 And this is Fredrik Johansson. 27 00:01:11,150 --> 00:01:14,250 He'll be a professor in Chalmers, in Sweden, starting 28 00:01:14,250 --> 00:01:15,540 in September. 29 00:01:15,540 --> 00:01:17,498 FREDRIK D. JOHANSSON: Thank you so much, David. 30 00:01:17,498 --> 00:01:18,470 That's very generous. 31 00:01:18,470 --> 00:01:20,760 Yeah, so as David mentioned, last time we 32 00:01:20,760 --> 00:01:22,410 looked a lot at causal effects. 33 00:01:22,410 --> 00:01:27,067 And that's where we will start on this discussion, too. 34 00:01:27,067 --> 00:01:28,650 So I'll just start with this reminder, 35 00:01:28,650 --> 00:01:31,500 here-- we essentially introduced four quantities last time, 36 00:01:31,500 --> 00:01:35,340 or the last two lectures, as far as I know. 37 00:01:35,340 --> 00:01:38,580 We had two potential outcomes, which represented the outcomes 38 00:01:38,580 --> 00:01:41,670 that we would see of some treatment choice 39 00:01:41,670 --> 00:01:43,188 under the various choices. 40 00:01:43,188 --> 00:01:44,480 So, the two different choices-- 41 00:01:44,480 --> 00:01:47,580 1 and 0. 42 00:01:47,580 --> 00:01:51,415 We had a set of covariates, x and a treatment, t. 43 00:01:51,415 --> 00:01:53,040 And we were interested in, essentially, 44 00:01:53,040 --> 00:01:54,707 what is the effect of this treatment, t, 45 00:01:54,707 --> 00:01:58,320 on the outcome, y, given the covariates, x. 46 00:01:58,320 --> 00:02:01,500 And the effect that we focused on that time 47 00:02:01,500 --> 00:02:03,595 was the conditional average treatment effect, 48 00:02:03,595 --> 00:02:05,220 which is exactly the difference between 49 00:02:05,220 --> 00:02:06,870 these potential outcomes-- 50 00:02:06,870 --> 00:02:09,030 a condition on the features. 51 00:02:09,030 --> 00:02:11,190 So the whole last week was about trying 52 00:02:11,190 --> 00:02:15,340 to identify this quantity using various methods. 53 00:02:15,340 --> 00:02:18,213 And the question that didn't come up so much-- 54 00:02:18,213 --> 00:02:20,130 or one question that didn't come up too much-- 55 00:02:20,130 --> 00:02:22,140 is how do we use this quantity? 56 00:02:22,140 --> 00:02:24,000 We might be interested in it, just 57 00:02:24,000 --> 00:02:27,930 in terms of its absolute magnitude. 58 00:02:27,930 --> 00:02:28,950 How large is the effect? 59 00:02:28,950 --> 00:02:31,350 But we might also be interested in designing 60 00:02:31,350 --> 00:02:34,290 a policy for how to treat our patients based 61 00:02:34,290 --> 00:02:36,280 on this quantity. 62 00:02:36,280 --> 00:02:39,847 So today, we will focus on policies. 63 00:02:39,847 --> 00:02:41,430 And what I mean by that, specifically, 64 00:02:41,430 --> 00:02:43,500 is something that takes into account 65 00:02:43,500 --> 00:02:49,370 what we know about a patient and produces a choice or an action 66 00:02:49,370 --> 00:02:51,013 as an output. 67 00:02:51,013 --> 00:02:52,430 Typically, we'll think of policies 68 00:02:52,430 --> 00:02:55,192 as depending on medical history, perhaps 69 00:02:55,192 --> 00:02:57,150 which treatments they have received previously, 70 00:02:57,150 --> 00:03:01,560 what state is the patient currently in. 71 00:03:01,560 --> 00:03:03,548 But we can also base it purely on this number 72 00:03:03,548 --> 00:03:06,090 that we produce last time-- the conditional average treatment 73 00:03:06,090 --> 00:03:06,720 effect. 74 00:03:06,720 --> 00:03:09,930 And one very natural policy is to say, pi of x 75 00:03:09,930 --> 00:03:11,580 is equal to the indicator function 76 00:03:11,580 --> 00:03:14,500 representing if this CATE is positive. 77 00:03:14,500 --> 00:03:17,280 So if the effect is positive, we treat the patient. 78 00:03:17,280 --> 00:03:19,060 If the effect is negative, we don't. 79 00:03:19,060 --> 00:03:22,380 And of course, positive will be relative to the usefulness 80 00:03:22,380 --> 00:03:23,640 of the outcome being high. 81 00:03:23,640 --> 00:03:27,990 But yeah, this is a very natural policy to consider. 82 00:03:27,990 --> 00:03:33,530 However, we can also think about much more complicated policies 83 00:03:33,530 --> 00:03:37,980 that are not just based on this number-- 84 00:03:37,980 --> 00:03:39,550 the quality of the outcome. 85 00:03:39,550 --> 00:03:41,180 We can think about policies that take 86 00:03:41,180 --> 00:03:45,005 into account legislation or cost of medication or side effects. 87 00:03:45,005 --> 00:03:46,380 We're not going to do that today, 88 00:03:46,380 --> 00:03:47,640 but that's something that you can keep in mind 89 00:03:47,640 --> 00:03:48,765 as we discuss these things. 90 00:03:51,600 --> 00:03:53,730 So David mentioned, we should now 91 00:03:53,730 --> 00:03:55,890 move from the one-step setting, where 92 00:03:55,890 --> 00:03:58,810 we have a single treatment acting at a single time 93 00:03:58,810 --> 00:04:01,020 and we only have to take into account the state 94 00:04:01,020 --> 00:04:03,490 of a patient once, basically. 95 00:04:03,490 --> 00:04:06,040 And we will move from that to the sequential setting. 96 00:04:06,040 --> 00:04:11,280 And my first example of such a setting is sepsis management. 97 00:04:11,280 --> 00:04:16,040 So, sepsis is a complication of an infection, which can have 98 00:04:16,040 --> 00:04:17,816 very disastrous consequences. 99 00:04:17,816 --> 00:04:19,899 It can lead to organ failure and ultimately death. 100 00:04:19,899 --> 00:04:21,399 And it's actually one of the leading 101 00:04:21,399 --> 00:04:23,880 causes of deaths in the ICU. 102 00:04:23,880 --> 00:04:26,660 So it's of course important that we can manage and treat 103 00:04:26,660 --> 00:04:28,963 this condition. 104 00:04:28,963 --> 00:04:31,130 When you start treating sepsis, the primary target-- 105 00:04:31,130 --> 00:04:33,980 the first things you should think about fixing-- 106 00:04:33,980 --> 00:04:35,030 is the infection itself. 107 00:04:35,030 --> 00:04:37,130 If we don't treat the infection, things 108 00:04:37,130 --> 00:04:39,413 are going to keep being bad. 109 00:04:39,413 --> 00:04:41,330 But even if we figure out the right antibiotic 110 00:04:41,330 --> 00:04:44,780 to treat the infection that is the source of the septic shock 111 00:04:44,780 --> 00:04:47,695 or the septic inflammation, there 112 00:04:47,695 --> 00:04:49,070 are a lot of different conditions 113 00:04:49,070 --> 00:04:50,660 that we need to manage. 114 00:04:50,660 --> 00:04:54,020 Because the infection itself can lead to fever, 115 00:04:54,020 --> 00:04:57,930 breathing difficulties, low blood pressure, high heart 116 00:04:57,930 --> 00:04:58,430 rate-- 117 00:04:58,430 --> 00:05:02,030 all these kinds of things that are symptoms, but not 118 00:05:02,030 --> 00:05:03,033 the cause in themselves. 119 00:05:03,033 --> 00:05:04,700 But we still have to manage them somehow 120 00:05:04,700 --> 00:05:09,370 so that the patient survives and is comfortable. 121 00:05:09,370 --> 00:05:11,950 So when I say sepsis management, I'm 122 00:05:11,950 --> 00:05:15,580 talking about managing such quantities over time-- 123 00:05:15,580 --> 00:05:19,300 over a patient's stay in the hospital. 124 00:05:19,300 --> 00:05:23,320 So, last time-- again, just to really hammer this in-- we 125 00:05:23,320 --> 00:05:25,780 talked about potential outcomes and the choice 126 00:05:25,780 --> 00:05:27,490 of a single treatment. 127 00:05:27,490 --> 00:05:30,670 So we can think about this in the septic setting as a patient 128 00:05:30,670 --> 00:05:33,510 coming in-- or a patient already being in the hospital, 129 00:05:33,510 --> 00:05:34,550 presumably-- 130 00:05:34,550 --> 00:05:36,670 and is presenting with breathing difficulties. 131 00:05:36,670 --> 00:05:39,370 So that means that their blood oxygen will be low because they 132 00:05:39,370 --> 00:05:40,640 can't breathe on their own. 133 00:05:40,640 --> 00:05:43,150 And we might want to put them on mechanical ventilation 134 00:05:43,150 --> 00:05:46,850 so that we can ensure that they get sufficient oxygen. 135 00:05:46,850 --> 00:05:48,950 We can view this as a single choice. 136 00:05:48,950 --> 00:05:53,480 Should we put the patient on mechanical ventilation or not? 137 00:05:53,480 --> 00:05:56,630 But what we need to take into account here 138 00:05:56,630 --> 00:05:59,210 is what will happen after we make that choice. 139 00:05:59,210 --> 00:06:02,210 What will be the side effects of this choice going further? 140 00:06:02,210 --> 00:06:04,790 Because we want to make sure that the patient is comfortable 141 00:06:04,790 --> 00:06:08,370 and in good health throughout their stay. 142 00:06:08,370 --> 00:06:13,500 So today, we will move towards sequential decision making. 143 00:06:13,500 --> 00:06:16,050 And in particular, what I alluded to just now 144 00:06:16,050 --> 00:06:18,540 is that decisions made in sequence 145 00:06:18,540 --> 00:06:21,750 may have the property that choices early on rule out 146 00:06:21,750 --> 00:06:23,420 certain choices later. 147 00:06:23,420 --> 00:06:26,675 And we'll see an example of that very soon. 148 00:06:26,675 --> 00:06:28,800 And in particular, we'll be interested in coming up 149 00:06:28,800 --> 00:06:33,090 with a policy for making decisions repeatedly 150 00:06:33,090 --> 00:06:34,980 that optimizes a given outcome-- 151 00:06:34,980 --> 00:06:36,210 something that we care about. 152 00:06:36,210 --> 00:06:40,620 It could be minimize the risk of death. 153 00:06:40,620 --> 00:06:44,400 It could be a reward that says that the vitals of a patients 154 00:06:44,400 --> 00:06:45,810 are in the right range. 155 00:06:45,810 --> 00:06:47,830 We might want to optimize that. 156 00:06:47,830 --> 00:06:49,890 But essentially, think about it now as 157 00:06:49,890 --> 00:06:53,280 having this choice of administering 158 00:06:53,280 --> 00:06:56,730 a medication or an intervention at any time, t-- 159 00:06:56,730 --> 00:07:00,440 and having the best policy for doing so. 160 00:07:00,440 --> 00:07:03,460 OK, I'm going to skip that one. 161 00:07:03,460 --> 00:07:06,377 OK, so I mentioned already one potential choice 162 00:07:06,377 --> 00:07:08,210 that we might want to make in the management 163 00:07:08,210 --> 00:07:09,918 of a septic patient, which is to put them 164 00:07:09,918 --> 00:07:12,110 on mechanical ventilation because they can't breathe 165 00:07:12,110 --> 00:07:14,040 on their own. 166 00:07:14,040 --> 00:07:15,650 A side effect of doing so is that they 167 00:07:15,650 --> 00:07:21,650 might suffer discomfort from being intubated. 168 00:07:21,650 --> 00:07:24,680 The procedure is not painless, it's not without discomfort. 169 00:07:24,680 --> 00:07:26,703 So something that you might have to do-- 170 00:07:26,703 --> 00:07:28,370 putting them on mechanical ventilation-- 171 00:07:28,370 --> 00:07:31,440 is to sedate the patient. 172 00:07:31,440 --> 00:07:34,427 So this is an action that is informed 173 00:07:34,427 --> 00:07:36,260 by the previous action, because if we didn't 174 00:07:36,260 --> 00:07:37,850 put the patient on mechanical ventilation, 175 00:07:37,850 --> 00:07:39,725 maybe we wouldn't consider them for sedation. 176 00:07:43,920 --> 00:07:45,810 When we sedate a patient, we run the risk 177 00:07:45,810 --> 00:07:48,970 of lowering their blood pressure. 178 00:07:48,970 --> 00:07:51,780 So we might need to manage that, too. 179 00:07:51,780 --> 00:07:54,270 So if their blood pressure gets too low, 180 00:07:54,270 --> 00:07:56,310 maybe we need to administer vasopressors, 181 00:07:56,310 --> 00:07:58,200 which artificially raise the blood pressure, 182 00:07:58,200 --> 00:08:02,670 or fluids or anything else that takes care of this issue. 183 00:08:02,670 --> 00:08:05,050 So just think of this as an example 184 00:08:05,050 --> 00:08:09,000 of choices cascading, in terms of their consequences, 185 00:08:09,000 --> 00:08:10,590 as we roll forward in time. 186 00:08:13,190 --> 00:08:17,690 Ultimately, we will face the end of the patient's stay. 187 00:08:17,690 --> 00:08:20,990 And hopefully, we managed the patient in a successful way 188 00:08:20,990 --> 00:08:26,330 so that their response or their outcome is a good one. 189 00:08:26,330 --> 00:08:30,650 What I'm illustrating here is that, for any one 190 00:08:30,650 --> 00:08:32,990 patient in our hospitals or in the health care system, 191 00:08:32,990 --> 00:08:35,492 we will only observe one trajectory 192 00:08:35,492 --> 00:08:36,409 through these options. 193 00:08:36,409 --> 00:08:41,250 So I will show this type of illustration many times, 194 00:08:41,250 --> 00:08:47,300 but I hope that you can realize the scope of the decision space 195 00:08:47,300 --> 00:08:47,990 here. 196 00:08:47,990 --> 00:08:50,840 Essentially, at any point, we can choose a different action. 197 00:08:50,840 --> 00:08:52,520 And usually, the number of decisions 198 00:08:52,520 --> 00:08:57,810 that we make in an ICU setting, for example, 199 00:08:57,810 --> 00:08:59,720 is much larger than we could ever 200 00:08:59,720 --> 00:09:02,480 test in a randomized trial. 201 00:09:02,480 --> 00:09:05,450 Think of all of these different trajectories 202 00:09:05,450 --> 00:09:09,100 as being different arms in a randomized controlled trial 203 00:09:09,100 --> 00:09:13,010 that you want to compare the effects or the outcomes of. 204 00:09:13,010 --> 00:09:15,290 It's infeasible to run such a trial, typically. 205 00:09:15,290 --> 00:09:17,210 So one of the big reasons that we 206 00:09:17,210 --> 00:09:19,640 are talking about reinforcement learning today and talking 207 00:09:19,640 --> 00:09:21,800 about learning policies, rather than 208 00:09:21,800 --> 00:09:24,440 causal effects in the setup that we did last week, 209 00:09:24,440 --> 00:09:26,930 is because the space of possible action trajectories 210 00:09:26,930 --> 00:09:27,500 is so large. 211 00:09:35,010 --> 00:09:40,620 Having said that, we now turn to trying to find, 212 00:09:40,620 --> 00:09:44,182 essentially, the policy that picks this orange path here-- 213 00:09:44,182 --> 00:09:45,390 that leads to a good outcome. 214 00:09:45,390 --> 00:09:48,290 And to reason about such a thing, 215 00:09:48,290 --> 00:09:51,220 we need to also reason about what is a good outcome? 216 00:09:51,220 --> 00:09:56,220 What is good reward for our agent, as it proceeds 217 00:09:56,220 --> 00:09:58,890 through time and makes choices? 218 00:09:58,890 --> 00:10:02,970 Some policies that we produce as machine learners 219 00:10:02,970 --> 00:10:06,270 might not be appropriate for a health care setting. 220 00:10:06,270 --> 00:10:08,700 We have to somehow restrict ourself to something that's 221 00:10:08,700 --> 00:10:10,230 realistic. 222 00:10:10,230 --> 00:10:12,080 I won't focus very much on this today. 223 00:10:12,080 --> 00:10:14,580 It's something that will come up in the discussion tomorrow, 224 00:10:14,580 --> 00:10:15,390 hopefully. 225 00:10:15,390 --> 00:10:18,330 And also the notion of evaluating something 226 00:10:18,330 --> 00:10:20,400 for use in the health care system 227 00:10:20,400 --> 00:10:23,478 will also be talked about tomorrow. 228 00:10:23,478 --> 00:10:24,270 AUDIENCE: Thursday. 229 00:10:24,270 --> 00:10:26,370 FREDRIK D. JOHANSSON: Sorry, Thursday. 230 00:10:26,370 --> 00:10:28,200 Next time. 231 00:10:28,200 --> 00:10:33,510 OK, so I'll start by just briefly mentioning some success 232 00:10:33,510 --> 00:10:34,177 stories. 233 00:10:34,177 --> 00:10:35,760 And these are not from the health care 234 00:10:35,760 --> 00:10:37,980 setting, as you can guess from the pictures. 235 00:10:37,980 --> 00:10:41,760 How many have seen some of these pictures? 236 00:10:41,760 --> 00:10:44,745 OK, great-- almost everyone. 237 00:10:47,360 --> 00:10:50,880 Yeah, so these are from various video games-- almost all 238 00:10:50,880 --> 00:10:51,710 of them. 239 00:10:51,710 --> 00:10:54,630 Well, games anyhow. 240 00:10:54,630 --> 00:10:59,680 And these are good examples of when reinforcement learning 241 00:10:59,680 --> 00:11:00,970 works, essentially. 242 00:11:00,970 --> 00:11:05,170 That's why I use these in this slide here-- 243 00:11:05,170 --> 00:11:06,670 because, essentially, it's very hard 244 00:11:06,670 --> 00:11:09,310 to argue that the computer or the program 245 00:11:09,310 --> 00:11:12,160 that eventually beat Lee Sedol. 246 00:11:12,160 --> 00:11:15,430 I think it's in this picture but also, later, Go champions, 247 00:11:15,430 --> 00:11:16,480 essentially. 248 00:11:16,480 --> 00:11:18,452 In the AlphaGo picture in the top left, 249 00:11:18,452 --> 00:11:20,660 it's hard to argue that they're not doing a good job, 250 00:11:20,660 --> 00:11:24,220 because they clearly beat humans here. 251 00:11:24,220 --> 00:11:26,620 But one of the things I want you to keep 252 00:11:26,620 --> 00:11:28,930 in mind throughout this talk is what 253 00:11:28,930 --> 00:11:30,850 is different between these kinds of scenarios? 254 00:11:30,850 --> 00:11:32,500 And we'll come back to that later. 255 00:11:32,500 --> 00:11:34,530 And what is different to the health 256 00:11:34,530 --> 00:11:36,400 care setting, essentially? 257 00:11:36,400 --> 00:11:38,847 So I simply added another example here, 258 00:11:38,847 --> 00:11:39,930 that's why I recognize it. 259 00:11:39,930 --> 00:11:42,097 So there was recently one that's a little bit closer 260 00:11:42,097 --> 00:11:43,680 to my heart, which is AlphaStar. 261 00:11:43,680 --> 00:11:45,500 I play StarCraft. 262 00:11:45,500 --> 00:11:49,380 I like StarCraft, so it should be on the slide. 263 00:11:49,380 --> 00:11:53,400 Anyway, let's move on. 264 00:11:53,400 --> 00:11:55,548 Broadly speaking, these can be summarized 265 00:11:55,548 --> 00:11:56,590 in the following picture. 266 00:11:56,590 --> 00:11:59,330 What goes into those systems? 267 00:11:59,330 --> 00:12:02,960 There's a lot more nuance when it comes to something like Go. 268 00:12:02,960 --> 00:12:05,520 But for the purpose of this class, 269 00:12:05,520 --> 00:12:07,350 we will summarize them with a slide. 270 00:12:07,350 --> 00:12:10,292 So essentially, one of the three quantities 271 00:12:10,292 --> 00:12:12,000 that matters for a reinforcement learning 272 00:12:12,000 --> 00:12:15,090 is the state of the environment, the state of the game, 273 00:12:15,090 --> 00:12:16,260 the state of the patient-- 274 00:12:16,260 --> 00:12:20,500 the state of the thing that we want to optimize, essentially. 275 00:12:20,500 --> 00:12:22,770 So in this case, I've chosen Tic-tac-toe here. 276 00:12:22,770 --> 00:12:25,830 We have a state which represents the current positions 277 00:12:25,830 --> 00:12:28,620 of the circles and crosses. 278 00:12:28,620 --> 00:12:33,540 And given that state of the game, my job as a player 279 00:12:33,540 --> 00:12:37,830 is to choose one of the possible actions-- 280 00:12:37,830 --> 00:12:41,270 one of the free squares to put my cross in. 281 00:12:41,270 --> 00:12:42,690 So I'm the blue player here and I 282 00:12:42,690 --> 00:12:47,680 can consider these five choices for where to put my next cross. 283 00:12:47,680 --> 00:12:51,480 And each of those will lead me to a new state of the game. 284 00:12:51,480 --> 00:12:54,390 If I put my cross over here, that 285 00:12:54,390 --> 00:12:55,770 means that I'm now in this box. 286 00:12:55,770 --> 00:12:57,960 And I have a new set of actions available to me 287 00:12:57,960 --> 00:13:02,720 for the next round, depending on what the red player does. 288 00:13:02,720 --> 00:13:04,470 So we have the state, we have the actions, 289 00:13:04,470 --> 00:13:06,178 and we have the next state, essentially-- 290 00:13:06,178 --> 00:13:08,330 we have a trajectory or a transition of states. 291 00:13:08,330 --> 00:13:11,125 And the last quantity that we need is the notion of a reward. 292 00:13:11,125 --> 00:13:12,750 That's very important for reinforcement 293 00:13:12,750 --> 00:13:15,420 learning, because that's what's driving the learning itself. 294 00:13:15,420 --> 00:13:19,390 We strive to optimize the reward or the outcome of something. 295 00:13:19,390 --> 00:13:22,410 So if we look at the action to the farthest right here, 296 00:13:22,410 --> 00:13:25,710 essentially I left myself open to an attack by the red player 297 00:13:25,710 --> 00:13:28,050 here, because I didn't put my cross there. 298 00:13:28,050 --> 00:13:30,630 Which means that, probably, if the red player is decent, 299 00:13:30,630 --> 00:13:32,460 he will put his circle here and I 300 00:13:32,460 --> 00:13:35,070 will incur a loss, essentially. 301 00:13:35,070 --> 00:13:38,550 So my reward will be negative, if we take positive to be good. 302 00:13:38,550 --> 00:13:41,375 And this is something that I can learn from going forward. 303 00:13:41,375 --> 00:13:42,750 Essentially, what I want to avoid 304 00:13:42,750 --> 00:13:44,125 is ending up in this state that's 305 00:13:44,125 --> 00:13:46,470 shown in the bottom right here. 306 00:13:46,470 --> 00:13:49,830 This is the basic idea of reinforcement 307 00:13:49,830 --> 00:13:54,160 learning for video games and for anything else. 308 00:13:54,160 --> 00:13:57,060 So if we take this board analogy or this example 309 00:13:57,060 --> 00:13:59,290 and move to the health care setting, 310 00:13:59,290 --> 00:14:03,870 we can think of the state of a patient as the game board 311 00:14:03,870 --> 00:14:06,120 or the state of the game. 312 00:14:06,120 --> 00:14:09,960 We will always call this St in this talk. 313 00:14:09,960 --> 00:14:14,170 The treatments that we prescribe or interventions will be At. 314 00:14:14,170 --> 00:14:16,440 And these are like the actions in the game, obviously. 315 00:14:16,440 --> 00:14:18,480 The outcomes of a patient-- could be mortality, 316 00:14:18,480 --> 00:14:20,330 could be managing vitals-- 317 00:14:20,330 --> 00:14:24,540 will be as the rewards in the game, having lost or won. 318 00:14:24,540 --> 00:14:27,340 And then up at the end here, what could possibly go wrong. 319 00:14:27,340 --> 00:14:29,640 Well, as I alluded to before, health 320 00:14:29,640 --> 00:14:33,840 is not a game in the same sense that a video game is a game. 321 00:14:33,840 --> 00:14:35,860 But they share a lot of mathematical structure. 322 00:14:35,860 --> 00:14:38,630 So that's why I make the analogy here. 323 00:14:38,630 --> 00:14:43,190 These quantities here-- S, A, and R-- 324 00:14:43,190 --> 00:14:46,887 will form something called a decision process. 325 00:14:46,887 --> 00:14:48,470 And that's what we'll talk about next. 326 00:14:48,470 --> 00:14:52,310 This is the outline for today and Thursday. 327 00:14:52,310 --> 00:14:54,570 I won't get to this today, but this 328 00:14:54,570 --> 00:14:56,810 is the talks we're considering. 329 00:14:56,810 --> 00:14:59,300 So a decision process is essentially 330 00:14:59,300 --> 00:15:02,690 the world that describes the data that we access 331 00:15:02,690 --> 00:15:06,500 or the world that we're managing our agent in. 332 00:15:09,920 --> 00:15:13,460 Very often, if you've ever seen reinforcement learning taught, 333 00:15:13,460 --> 00:15:16,100 you have seen this picture in some form, usually. 334 00:15:16,100 --> 00:15:17,870 Sometimes there's a mouse and some cheese 335 00:15:17,870 --> 00:15:19,370 and there's other things going on, 336 00:15:19,370 --> 00:15:23,503 but you know what I'm talking about. 337 00:15:23,503 --> 00:15:25,170 But there are the same basic components. 338 00:15:25,170 --> 00:15:28,490 So there's the concept of an agent-- 339 00:15:28,490 --> 00:15:30,410 let's think doctor for now-- 340 00:15:30,410 --> 00:15:32,540 that takes actions repeatedly over time. 341 00:15:32,540 --> 00:15:35,990 So this t here indicates an index of time 342 00:15:35,990 --> 00:15:37,940 and we see that essentially increasing 343 00:15:37,940 --> 00:15:40,010 as we spin around this wheel here. 344 00:15:40,010 --> 00:15:41,690 We move forward in time. 345 00:15:41,690 --> 00:15:44,870 So an agent takes an action and, at any time point, 346 00:15:44,870 --> 00:15:47,210 receives a reward for that action. 347 00:15:47,210 --> 00:15:49,190 And that would be Rt, as I said before. 348 00:15:49,190 --> 00:15:52,560 The environment is responsible for giving that reward. 349 00:15:52,560 --> 00:15:55,750 So for example, if I'm the doctor, I'm the agent, 350 00:15:55,750 --> 00:15:58,790 I make an action or an intervention to my patient, 351 00:15:58,790 --> 00:16:00,620 the patient will be the environment. 352 00:16:00,620 --> 00:16:05,200 And essentially, responses do not respond to my intervention. 353 00:16:05,200 --> 00:16:07,780 The state here is the state of the patient, 354 00:16:07,780 --> 00:16:09,730 as I mentioned before, for example. 355 00:16:09,730 --> 00:16:13,780 But it might also be a state more broadly than the patient, 356 00:16:13,780 --> 00:16:17,590 like the settings of the machine that they're attached to 357 00:16:17,590 --> 00:16:21,610 or the availability of certain drugs in the hospital 358 00:16:21,610 --> 00:16:22,970 or something like that. 359 00:16:22,970 --> 00:16:24,345 So we can think a little bit more 360 00:16:24,345 --> 00:16:26,170 broadly around the patient, too. 361 00:16:26,170 --> 00:16:29,207 I said partially observed here, in that I might not actually 362 00:16:29,207 --> 00:16:31,540 know everything about the patient that's relevant to me. 363 00:16:31,540 --> 00:16:34,420 And we will come back a little bit later to that. 364 00:16:34,420 --> 00:16:37,930 So there are two different formalizations that are very 365 00:16:37,930 --> 00:16:40,510 close to each other, which is when you'd know everything 366 00:16:40,510 --> 00:16:42,360 about s and when you don't. 367 00:16:42,360 --> 00:16:44,550 We will, for the longest part of this talk, 368 00:16:44,550 --> 00:16:46,660 focus on the way I know everything that is 369 00:16:46,660 --> 00:16:49,640 relevant about the environment. 370 00:16:49,640 --> 00:16:51,650 OK, to make this all a bit more concrete, 371 00:16:51,650 --> 00:16:54,080 I'll return to the picture that I showed you before, 372 00:16:54,080 --> 00:16:57,283 but now put it in context of the paper that you read. 373 00:16:57,283 --> 00:16:58,450 Was that the compulsory one? 374 00:16:58,450 --> 00:17:01,020 The mechanical ventilation? 375 00:17:01,020 --> 00:17:03,690 OK, great. 376 00:17:03,690 --> 00:17:09,069 So in this case, they had an interesting reward structure, 377 00:17:09,069 --> 00:17:09,569 essentially. 378 00:17:09,569 --> 00:17:11,361 The thing that they were trying to optimize 379 00:17:11,361 --> 00:17:13,890 was the reward related to the vitals of the patient. 380 00:17:13,890 --> 00:17:17,760 But also whether they were kept on mechanical ventilation 381 00:17:17,760 --> 00:17:18,270 or not. 382 00:17:18,270 --> 00:17:21,690 And the idea of this paper is that you 383 00:17:21,690 --> 00:17:23,880 don't want to keep a patient unnecessarily 384 00:17:23,880 --> 00:17:26,310 on mechanical ventilation, because it has the side effects 385 00:17:26,310 --> 00:17:29,260 that we talked about before. 386 00:17:29,260 --> 00:17:31,000 So at any point in time, essentially, 387 00:17:31,000 --> 00:17:34,330 we can think about taking a patient on or off-- 388 00:17:34,330 --> 00:17:37,840 and also dealing with the sedatives that 389 00:17:37,840 --> 00:17:40,490 are prescribed to them. 390 00:17:40,490 --> 00:17:44,320 So in this example, the state that they considered 391 00:17:44,320 --> 00:17:49,570 in this application included the demographic information 392 00:17:49,570 --> 00:17:52,660 of the patient, which doesn't really change over time. 393 00:17:52,660 --> 00:17:54,160 Their physiological measurements, 394 00:17:54,160 --> 00:17:57,670 ventilator settings, consciousness level, 395 00:17:57,670 --> 00:17:59,410 the dosages of the sedatives they use, 396 00:17:59,410 --> 00:18:02,740 which could be an action, I suppose-- 397 00:18:02,740 --> 00:18:04,440 and a number of other things. 398 00:18:04,440 --> 00:18:08,530 And these are the values that we have to keep track of, 399 00:18:08,530 --> 00:18:10,840 moving forward in time. 400 00:18:10,840 --> 00:18:15,400 The actions concretely included whether to intubate or extubate 401 00:18:15,400 --> 00:18:17,680 the patient, as well as the administer 402 00:18:17,680 --> 00:18:22,370 and dosing the sedatives. 403 00:18:22,370 --> 00:18:26,180 So this is, again, an example of a so-called decision process. 404 00:18:26,180 --> 00:18:30,710 And essentially, the process is the distribution 405 00:18:30,710 --> 00:18:33,170 of these quantities that I've been talking about over time. 406 00:18:33,170 --> 00:18:35,960 So we have the states, the actions, and the rewards. 407 00:18:35,960 --> 00:18:39,440 They all traverse or they all evolve over time. 408 00:18:39,440 --> 00:18:44,203 And the loss of how that happens is the decision process. 409 00:18:44,203 --> 00:18:45,620 I mentioned before that we will be 410 00:18:45,620 --> 00:18:48,320 talking about policies today. 411 00:18:48,320 --> 00:18:51,650 And typically, there's a distinction 412 00:18:51,650 --> 00:18:54,860 between what is called a behavior policy and a target 413 00:18:54,860 --> 00:18:57,140 policy-- or there are different words for this. 414 00:18:57,140 --> 00:18:58,910 Essentially, the thing that we observe 415 00:18:58,910 --> 00:19:00,740 is usually called a behavior policy. 416 00:19:00,740 --> 00:19:03,620 By that, I mean if we go to a hospital and watch what's 417 00:19:03,620 --> 00:19:05,303 happening there at the moment, that 418 00:19:05,303 --> 00:19:06,470 will be the behavior policy. 419 00:19:06,470 --> 00:19:08,160 And I will denote that mu. 420 00:19:08,160 --> 00:19:12,630 So that is what we have to learn from, essentially. 421 00:19:12,630 --> 00:19:17,850 So decision processes so far are incredibly general. 422 00:19:17,850 --> 00:19:21,150 I haven't said anything about what this distribution is like, 423 00:19:21,150 --> 00:19:24,600 but the absolutely dominant restriction 424 00:19:24,600 --> 00:19:26,880 that people make when they study system processes 425 00:19:26,880 --> 00:19:29,920 is to look at Markov decision processes. 426 00:19:29,920 --> 00:19:32,010 And these have a specific conditional independent 427 00:19:32,010 --> 00:19:34,427 structure that I will illustrate in the next slide-- well, 428 00:19:34,427 --> 00:19:36,380 I'll just define it mathematically here. 429 00:19:36,380 --> 00:19:38,640 It says, essentially, that all of the quantities 430 00:19:38,640 --> 00:19:40,980 that we care about-- the states. 431 00:19:40,980 --> 00:19:42,720 I guess that should say state. 432 00:19:42,720 --> 00:19:47,390 Rewards and the actions only depend on the most recent state 433 00:19:47,390 --> 00:19:47,890 in action. 434 00:19:51,780 --> 00:19:56,160 If we observe an action taken by a doctor in the hospital, 435 00:19:56,160 --> 00:19:57,130 for example-- 436 00:19:57,130 --> 00:19:59,190 to make a mark of assumption, we'd 437 00:19:59,190 --> 00:20:01,290 say that this doctor did not look 438 00:20:01,290 --> 00:20:03,360 at anything that happened earlier 439 00:20:03,360 --> 00:20:06,210 in time or any other information than what 440 00:20:06,210 --> 00:20:08,730 is in the state variable that we observe at that time. 441 00:20:08,730 --> 00:20:10,230 That is the assumption that we make. 442 00:20:10,230 --> 00:20:10,730 Yeah? 443 00:20:10,730 --> 00:20:15,630 AUDIENCE: Is that an assumption you can make for a health care? 444 00:20:15,630 --> 00:20:19,890 Because in the end, you don't have access to the real state, 445 00:20:19,890 --> 00:20:23,700 but only about what's measured about the state in health care. 446 00:20:23,700 --> 00:20:26,340 FREDRIK D. JOHANSSON: It's a very good question. 447 00:20:26,340 --> 00:20:30,330 So the nice thing in terms of inferring causal quantities 448 00:20:30,330 --> 00:20:31,890 is that we only need the things that 449 00:20:31,890 --> 00:20:34,210 were used to make the decision in the first place. 450 00:20:34,210 --> 00:20:37,530 So the doctor can only act on such information, too. 451 00:20:37,530 --> 00:20:39,032 Unless we don't record everything 452 00:20:39,032 --> 00:20:40,990 that the doctor knows-- which is also the case. 453 00:20:40,990 --> 00:20:45,300 So that is something that we have to worry about for sure. 454 00:20:45,300 --> 00:20:48,900 Another way to lose information, as I mentioned, 455 00:20:48,900 --> 00:20:52,315 that is relevant for this is if we look to-- 456 00:20:56,700 --> 00:20:58,375 What's the opposite of far? 457 00:20:58,375 --> 00:20:59,000 AUDIENCE: Near. 458 00:20:59,000 --> 00:21:01,930 FREDRIK D. JOHANSSON: Too near back in time, essentially. 459 00:21:01,930 --> 00:21:04,270 So we don't look at the entire history of the patient. 460 00:21:04,270 --> 00:21:07,710 And when I say St here, it doesn't 461 00:21:07,710 --> 00:21:13,113 have to be the instantaneous snapshot of a patient. 462 00:21:13,113 --> 00:21:14,530 We can also Include history there. 463 00:21:14,530 --> 00:21:16,447 Again, we'll come back to that a little later. 464 00:21:20,350 --> 00:21:24,890 OK, so the Markov assumption essentially looks like this. 465 00:21:24,890 --> 00:21:27,100 Or this is how I will illustrate, anyway. 466 00:21:27,100 --> 00:21:30,620 We have a sequence of states here that evolve over time. 467 00:21:30,620 --> 00:21:32,470 I'm allowing myself to put some dots here, 468 00:21:32,470 --> 00:21:35,130 because I don't want to draw forever. 469 00:21:35,130 --> 00:21:37,630 But essentially, you could think of this pattern repeating-- 470 00:21:37,630 --> 00:21:40,570 where the previous state goes into the next state, 471 00:21:40,570 --> 00:21:42,460 the action goes into the next state, 472 00:21:42,460 --> 00:21:45,320 and the action and state goes in through the reward. 473 00:21:45,320 --> 00:21:48,160 This is the world that we will live in for this lecture. 474 00:21:48,160 --> 00:21:50,950 Something that's not allowed under the mark of assumption 475 00:21:50,950 --> 00:21:55,230 is an edge like this, which says that an action at an early time 476 00:21:55,230 --> 00:21:58,150 influences an action at a later time. 477 00:21:58,150 --> 00:22:02,500 And specifically, it can't do so without passing 478 00:22:02,500 --> 00:22:04,360 through a state, for example. 479 00:22:04,360 --> 00:22:06,670 It very well can have an influence 480 00:22:06,670 --> 00:22:11,132 on At by this trajectory here, but not directly. 481 00:22:11,132 --> 00:22:13,090 That that's the Markov assumption in this case. 482 00:22:18,300 --> 00:22:23,330 So you can see that if I were to draw the graph of all 483 00:22:23,330 --> 00:22:28,220 the different measurements that we 484 00:22:28,220 --> 00:22:31,100 see during a state, essentially there 485 00:22:31,100 --> 00:22:34,310 are a lot of errors that I could have had in this picture that I 486 00:22:34,310 --> 00:22:34,877 don't have. 487 00:22:34,877 --> 00:22:36,710 So it may seem that the Markov assumption is 488 00:22:36,710 --> 00:22:41,245 a very strong one, but one way to ensure that the Markov 489 00:22:41,245 --> 00:22:43,370 assumption is more likely is to include more things 490 00:22:43,370 --> 00:22:45,453 in your state, including summaries of the history, 491 00:22:45,453 --> 00:22:48,810 et cetera, that I mentioned before. 492 00:22:48,810 --> 00:22:52,500 An even stronger restriction of decision processes 493 00:22:52,500 --> 00:22:55,590 is to assume that the states over time 494 00:22:55,590 --> 00:22:58,350 are themselves independent. 495 00:22:58,350 --> 00:23:01,530 So this goes by different names-- 496 00:23:01,530 --> 00:23:04,350 sometimes under the name contextual bandits. 497 00:23:04,350 --> 00:23:07,590 But the bandits part of that itself is not so relevant here. 498 00:23:07,590 --> 00:23:10,380 So let's not go into that name too much. 499 00:23:10,380 --> 00:23:12,240 But essentially, what we can say is here, 500 00:23:12,240 --> 00:23:15,300 the state at a later time point is not influenced directly 501 00:23:15,300 --> 00:23:16,890 by the state at a previous time point, 502 00:23:16,890 --> 00:23:19,620 nor the action of the previous time point. 503 00:23:19,620 --> 00:23:22,950 So if you remember what you did last week, 504 00:23:22,950 --> 00:23:25,710 this looks like basically T repetitions 505 00:23:25,710 --> 00:23:27,210 of the very simple graph that we had 506 00:23:27,210 --> 00:23:29,250 for estimating potential outcomes. 507 00:23:29,250 --> 00:23:31,740 And that is indeed mathematically equivalent. 508 00:23:31,740 --> 00:23:37,390 If we assume that this S here represents 509 00:23:37,390 --> 00:23:39,730 the state of a patient and all patients 510 00:23:39,730 --> 00:23:43,360 are drawn from some sum process, essentially. 511 00:23:43,360 --> 00:23:48,550 So that S0, 1, et cetera, up to St are all i.i.d. 512 00:23:48,550 --> 00:23:51,490 draws of the same distribution. 513 00:23:51,490 --> 00:23:53,440 Then we have, essentially, a model 514 00:23:53,440 --> 00:23:57,970 for t different patients with a single time step 515 00:23:57,970 --> 00:24:02,170 or single action, instead of them 516 00:24:02,170 --> 00:24:03,490 being dependent in some way. 517 00:24:03,490 --> 00:24:08,530 So we can see that by going backwards through my slides, 518 00:24:08,530 --> 00:24:10,720 this is essentially what we had last week. 519 00:24:10,720 --> 00:24:12,820 And we just have to add more arrows 520 00:24:12,820 --> 00:24:15,610 to get to whatever we have this week, which indicates 521 00:24:15,610 --> 00:24:17,790 that last week was a special case of this-- 522 00:24:17,790 --> 00:24:20,850 just as David said before. 523 00:24:20,850 --> 00:24:24,210 It also hints at the reinforcement learning problem 524 00:24:24,210 --> 00:24:26,940 being more complicated than the potential outcomes problem. 525 00:24:26,940 --> 00:24:30,820 And we'll see more examples of that later. 526 00:24:30,820 --> 00:24:34,740 But, like with causal effect estimation 527 00:24:34,740 --> 00:24:37,710 that we did last week, we're interested in the influences 528 00:24:37,710 --> 00:24:40,890 of just a few variables, essentially. 529 00:24:40,890 --> 00:24:43,710 So last time we studied the effect of a single treatment 530 00:24:43,710 --> 00:24:44,450 choice. 531 00:24:44,450 --> 00:24:46,560 And in this case, we will study the influence 532 00:24:46,560 --> 00:24:49,440 of these various actions that we take along the way. 533 00:24:49,440 --> 00:24:50,460 That will be the goal. 534 00:24:50,460 --> 00:24:54,540 And it could be either through an immediate effect 535 00:24:54,540 --> 00:24:58,257 on the immediate reward or it can be through the impact 536 00:24:58,257 --> 00:25:00,340 that an action has on the state trajectory itself. 537 00:25:07,760 --> 00:25:10,090 I told you about the world now that we live in. 538 00:25:10,090 --> 00:25:12,080 We have these Ss and As and Rs. 539 00:25:12,080 --> 00:25:15,140 And I haven't told you so much about the goal that we're 540 00:25:15,140 --> 00:25:18,890 trying to solve or the problem that we're trying to solve. 541 00:25:18,890 --> 00:25:22,130 Most RL-- or reinforcement and learning-- 542 00:25:22,130 --> 00:25:27,590 is aimed at optimizing the value of a policy 543 00:25:27,590 --> 00:25:31,010 or finding a policy that has a good return, 544 00:25:31,010 --> 00:25:32,348 a good sum of rewards. 545 00:25:32,348 --> 00:25:34,640 There are many names for this, but essentially a policy 546 00:25:34,640 --> 00:25:36,740 that does well. 547 00:25:36,740 --> 00:25:40,970 The notion of well that we will be using in this lecture 548 00:25:40,970 --> 00:25:43,350 is that of a return. 549 00:25:43,350 --> 00:25:50,330 So the return at a time step t, following the policy, pi, 550 00:25:50,330 --> 00:25:53,300 that I had before, is the sum of the future rewards 551 00:25:53,300 --> 00:25:57,900 that we see if we were to act according to that policy. 552 00:25:57,900 --> 00:25:59,450 So essentially, I stop now. 553 00:25:59,450 --> 00:26:02,570 I ask, OK, if I keep on doing the same as I've 554 00:26:02,570 --> 00:26:04,272 done through my whole life-- 555 00:26:04,272 --> 00:26:05,480 maybe that was a good policy. 556 00:26:05,480 --> 00:26:06,860 I don't know. 557 00:26:06,860 --> 00:26:10,400 And keep going until the end of time, how well will I do? 558 00:26:10,400 --> 00:26:13,790 What is the sum of those rewards that I get, essentially? 559 00:26:13,790 --> 00:26:15,680 That's the return. 560 00:26:15,680 --> 00:26:18,830 The value is the expectation of such things. 561 00:26:18,830 --> 00:26:20,543 So if I'm not the only person, but there 562 00:26:20,543 --> 00:26:22,460 is the whole population of us, the expectation 563 00:26:22,460 --> 00:26:25,080 over that population is the value of the policy. 564 00:26:25,080 --> 00:26:28,410 So if we take patients as a better analogy than my life, 565 00:26:28,410 --> 00:26:31,370 maybe, the expectation is over patients. 566 00:26:31,370 --> 00:26:34,700 If we fact on every patient in our population the same way-- 567 00:26:34,700 --> 00:26:36,590 according to the same policy, that is-- 568 00:26:36,590 --> 00:26:41,610 what is the expected return over those patients? 569 00:26:41,610 --> 00:26:45,170 So as an example, I drew a few trajectories again, 570 00:26:45,170 --> 00:26:46,610 because I like drawing. 571 00:26:46,610 --> 00:26:49,120 And we can think about three different patients here. 572 00:26:49,120 --> 00:26:50,990 They start in different states. 573 00:26:50,990 --> 00:26:53,270 And they will have different action trajectories 574 00:26:53,270 --> 00:26:54,570 as a result. 575 00:26:54,570 --> 00:26:57,320 So we're treating them with the same policy. 576 00:26:57,320 --> 00:26:58,203 Let's call it pi. 577 00:26:58,203 --> 00:26:59,870 But because they're in different states, 578 00:26:59,870 --> 00:27:02,600 they will have different actions at the same times. 579 00:27:02,600 --> 00:27:06,038 So here we take a 0 action, we go down. 580 00:27:06,038 --> 00:27:07,580 Here, we take a 0 action, we go down. 581 00:27:07,580 --> 00:27:09,560 That's what that means here. 582 00:27:09,560 --> 00:27:11,385 The specifics of this is not so important. 583 00:27:11,385 --> 00:27:13,010 But what I want you to pay attention to 584 00:27:13,010 --> 00:27:16,860 is that, after each action, we get a reward. 585 00:27:16,860 --> 00:27:21,790 And at the end, we can sum those up and that's our return. 586 00:27:21,790 --> 00:27:27,820 So each patient has one value for their own trajectory. 587 00:27:27,820 --> 00:27:30,070 And the value of the policy is then the average value 588 00:27:30,070 --> 00:27:31,587 of such trajectories. 589 00:27:35,570 --> 00:27:37,570 So that is what we're trying to optimize. 590 00:27:37,570 --> 00:27:41,950 We have now a notion of good and we want to find a pi such 591 00:27:41,950 --> 00:27:45,023 that V pi up there is good. 592 00:27:45,023 --> 00:27:45,690 That's the goal. 593 00:27:48,690 --> 00:27:53,640 So I think it's time for a bit of an example here. 594 00:27:53,640 --> 00:27:56,430 I want you to play along in a second. 595 00:27:56,430 --> 00:27:59,647 You're going to solve this problem. 596 00:27:59,647 --> 00:28:00,480 It's not a hard one. 597 00:28:00,480 --> 00:28:01,740 So I think you'll manage. 598 00:28:01,740 --> 00:28:03,540 I think you'll be fine. 599 00:28:03,540 --> 00:28:06,720 But this is now yet another example of a world to be in. 600 00:28:06,720 --> 00:28:08,250 This is the robot in a room. 601 00:28:08,250 --> 00:28:10,470 And I've stolen this slide from David, 602 00:28:10,470 --> 00:28:11,780 who stole it from Peter Bodik. 603 00:28:15,010 --> 00:28:18,930 Yeah, so credits to him. 604 00:28:18,930 --> 00:28:21,210 The rules of this world says the following-- 605 00:28:21,210 --> 00:28:25,980 if you tell the robot, who is traversing this set of tiles 606 00:28:25,980 --> 00:28:26,620 here-- 607 00:28:26,620 --> 00:28:28,440 if you tell the robot to go up, there's 608 00:28:28,440 --> 00:28:31,930 a chance he doesn't go up, but goes somewhere else. 609 00:28:31,930 --> 00:28:34,420 So we have the stochastic transitions, essentially. 610 00:28:34,420 --> 00:28:36,990 If I say up, he goes up with point 611 00:28:36,990 --> 00:28:40,950 a probability and somewhere else with uniform probability, say. 612 00:28:40,950 --> 00:28:44,910 So 0.8 up and then 0.2-- 613 00:28:44,910 --> 00:28:47,280 this is the only possible direction 614 00:28:47,280 --> 00:28:48,960 to go in if you start here. 615 00:28:48,960 --> 00:28:50,767 So 0.2 in that way. 616 00:28:50,767 --> 00:28:53,100 There's a chance you move in the wrong direction is what 617 00:28:53,100 --> 00:28:55,980 I'm trying to illustrate here. 618 00:28:55,980 --> 00:28:57,600 There's no chance that they're going 619 00:28:57,600 --> 00:28:58,690 in the opposite direction. 620 00:28:58,690 --> 00:29:01,260 So if I say right here, it can't go that way. 621 00:29:04,960 --> 00:29:07,700 The rewards in this game is plus 1 622 00:29:07,700 --> 00:29:12,020 in the green box up there, minus 1 in the box here. 623 00:29:12,020 --> 00:29:13,970 And these are also terminal states. 624 00:29:13,970 --> 00:29:15,440 So I haven't told you what that is, 625 00:29:15,440 --> 00:29:18,590 but it's essentially a state in which the game ends. 626 00:29:18,590 --> 00:29:23,680 So once you get to either plus 1 or minus 1, the game is over. 627 00:29:23,680 --> 00:29:25,960 For each step that the robot takes, 628 00:29:25,960 --> 00:29:29,260 it incurs 0.04 negative reward. 629 00:29:29,260 --> 00:29:30,910 So that says, essentially, that if you 630 00:29:30,910 --> 00:29:34,720 keep going for a long time, your reward would be bad. 631 00:29:34,720 --> 00:29:37,490 The value of the policy will be bad. 632 00:29:37,490 --> 00:29:39,840 So you want to be efficient. 633 00:29:39,840 --> 00:29:41,550 So basically, you can figure out-- 634 00:29:41,550 --> 00:29:45,910 you want to get to the green thing, that's one part of it. 635 00:29:45,910 --> 00:29:47,410 But you also want to do it quickly. 636 00:29:47,410 --> 00:29:51,210 So what I want you to do now is to essentially figure out 637 00:29:51,210 --> 00:29:54,630 what is the best policy, in terms 638 00:29:54,630 --> 00:29:58,020 of in which way should the arrows point in each 639 00:29:58,020 --> 00:30:01,155 of these different boxes? 640 00:30:01,155 --> 00:30:02,780 Fill in the question mark with an arrow 641 00:30:02,780 --> 00:30:04,070 pointing in some direction. 642 00:30:04,070 --> 00:30:07,340 We know the different transitions will be stochastic, 643 00:30:07,340 --> 00:30:09,210 so you might need to take that into account. 644 00:30:09,210 --> 00:30:11,240 But essentially, figure out how do 645 00:30:11,240 --> 00:30:14,300 I have a policy that gives me the biggest expected reward? 646 00:30:14,300 --> 00:30:16,340 And I'll ask you in a few minutes if one of you 647 00:30:16,340 --> 00:30:18,923 is brave enough to put it on the board or something like that. 648 00:30:21,520 --> 00:30:23,890 AUDIENCE: We start the discount over time? 649 00:30:23,890 --> 00:30:26,067 FREDRIK D. JOHANSSON: There's no discount. 650 00:30:26,067 --> 00:30:27,650 AUDIENCE: Can we talk to our neighbor? 651 00:30:27,650 --> 00:30:28,760 FREDRIK D. JOHANSSON: Yes. 652 00:30:28,760 --> 00:30:29,887 It's encouraged. 653 00:30:33,260 --> 00:30:36,160 [INTERPOSING VOICES] 654 00:30:36,160 --> 00:30:38,440 FREDRIK D. JOHANSSON: So I had a question. 655 00:30:38,440 --> 00:30:41,170 What is the action space? 656 00:30:41,170 --> 00:30:44,660 And essentially, the action space is always up, down, 657 00:30:44,660 --> 00:30:47,710 left, or right, depending on if there's a wall or not. 658 00:30:47,710 --> 00:30:51,292 So you can't go right here, for example. 659 00:30:51,292 --> 00:30:52,750 AUDIENCE: You can't go left either. 660 00:30:52,750 --> 00:30:54,792 FREDRIK D. JOHANSSON: You can't go left, exactly. 661 00:30:54,792 --> 00:30:56,470 Good point. 662 00:30:56,470 --> 00:30:58,330 So each box at the end, when you're done, 663 00:30:58,330 --> 00:31:01,240 should contain an arrow pointing in some direction. 664 00:31:01,240 --> 00:31:04,000 All right, I think we'll see if anybody 665 00:31:04,000 --> 00:31:06,420 has solved this problem now. 666 00:31:06,420 --> 00:31:09,200 Who thinks they have solved it? 667 00:31:09,200 --> 00:31:10,690 Great. 668 00:31:10,690 --> 00:31:13,890 Would you like to share your solution? 669 00:31:13,890 --> 00:31:18,503 AUDIENCE: Yeah, so I think it's going to go up first. 670 00:31:18,503 --> 00:31:20,920 FREDRIK D. JOHANSSON: I'm going to try and replicate this. 671 00:31:20,920 --> 00:31:24,130 Ooh, sorry about that. 672 00:31:24,130 --> 00:31:26,050 OK, you're saying up here? 673 00:31:26,050 --> 00:31:27,850 AUDIENCE: Yeah. 674 00:31:27,850 --> 00:31:30,620 The basic idea is you want to reduce the chance that you're 675 00:31:30,620 --> 00:31:33,070 ever adjacent to the red box. 676 00:31:33,070 --> 00:31:36,380 So just do everything you can to stay far from it. 677 00:31:36,380 --> 00:31:38,020 Yeah, so attempt to go up and then 678 00:31:38,020 --> 00:31:40,970 once you eventually get there, you just have to go right. 679 00:31:40,970 --> 00:31:42,330 FREDRIK D. JOHANSSON: OK. 680 00:31:42,330 --> 00:31:43,783 And then? 681 00:31:43,783 --> 00:31:44,700 AUDIENCE: [INAUDIBLE]. 682 00:31:44,700 --> 00:31:45,430 FREDRIK D. JOHANSSON: OK. 683 00:31:45,430 --> 00:31:46,472 So what about these ones? 684 00:31:49,807 --> 00:31:51,640 This is also part of the policy, by the way. 685 00:31:54,412 --> 00:31:56,260 AUDIENCE: I hadn't thought about this. 686 00:31:56,260 --> 00:31:57,320 FREDRIK D. JOHANSSON: OK. 687 00:31:57,320 --> 00:31:59,940 AUDIENCE: But those, you [INAUDIBLE],, right? 688 00:31:59,940 --> 00:32:01,260 FREDRIK D. JOHANSSON: No. 689 00:32:01,260 --> 00:32:02,945 AUDIENCE: Minus 0.04. 690 00:32:02,945 --> 00:32:04,320 FREDRIK D. JOHANSSON: So discount 691 00:32:04,320 --> 00:32:05,528 usually means something else. 692 00:32:05,528 --> 00:32:06,900 We'll get to that later. 693 00:32:06,900 --> 00:32:11,880 But that is a reward for just taking any step. 694 00:32:11,880 --> 00:32:14,480 If you move into a space that is not terminal, 695 00:32:14,480 --> 00:32:16,023 you incur that negative reward. 696 00:32:16,023 --> 00:32:17,690 AUDIENCE: So if you keep bouncing around 697 00:32:17,690 --> 00:32:19,440 for a really long time, you incur a long negative reward. 698 00:32:19,440 --> 00:32:21,060 FREDRIK D. JOHANSSON: If we had this, 699 00:32:21,060 --> 00:32:23,160 there's some chance I'd never get out of all this. 700 00:32:23,160 --> 00:32:24,300 And very little chance of that working out. 701 00:32:24,300 --> 00:32:25,950 But it's a very bad policy, because you 702 00:32:25,950 --> 00:32:27,075 keep moving back and forth. 703 00:32:29,220 --> 00:32:32,230 All right, we had an arm somewhere. 704 00:32:32,230 --> 00:32:34,857 What should I do here? 705 00:32:34,857 --> 00:32:36,190 AUDIENCE: You could take a vote. 706 00:32:36,190 --> 00:32:37,232 FREDRIK D. JOHANSSON: OK. 707 00:32:37,232 --> 00:32:38,468 Who thinks right? 708 00:32:38,468 --> 00:32:39,860 Really? 709 00:32:39,860 --> 00:32:41,416 Who thinks left? 710 00:32:41,416 --> 00:32:43,540 OK, interesting. 711 00:32:43,540 --> 00:32:46,020 I don't actually remember. 712 00:32:46,020 --> 00:32:48,630 Let's see. 713 00:32:48,630 --> 00:32:49,292 Go ahead. 714 00:32:49,292 --> 00:32:50,980 AUDIENCE: I was just saying, that's an easy one. 715 00:32:50,980 --> 00:32:52,938 FREDRIK D. JOHANSSON: Yeah, so this is the part 716 00:32:52,938 --> 00:32:54,200 that we already determined. 717 00:32:54,200 --> 00:32:55,700 If we had deterministic transitions, 718 00:32:55,700 --> 00:32:57,990 this would be great, because we don't have 719 00:32:57,990 --> 00:32:59,800 to think about the other ones. 720 00:32:59,800 --> 00:33:03,040 This is what Peter put on the slide. 721 00:33:03,040 --> 00:33:07,370 So I'm going to have to disagree with the vote there, actually. 722 00:33:07,370 --> 00:33:13,680 It depends, actually, heavily on the minus 0.04. 723 00:33:13,680 --> 00:33:15,540 So if you increase that by a little bit, 724 00:33:15,540 --> 00:33:17,757 you might want to go that way instead. 725 00:33:17,757 --> 00:33:18,590 Or if you decrease-- 726 00:33:18,590 --> 00:33:19,298 I don't remember. 727 00:33:19,298 --> 00:33:20,080 Decrease, exactly. 728 00:33:20,080 --> 00:33:22,288 And if you increase it, you might get something else. 729 00:33:22,288 --> 00:33:24,390 It might actually be good to terminate. 730 00:33:24,390 --> 00:33:26,310 So those details matter a little bit. 731 00:33:26,310 --> 00:33:28,140 But I think you've got the general idea. 732 00:33:28,140 --> 00:33:29,807 And especially I like that you commented 733 00:33:29,807 --> 00:33:31,710 that you want to stay away from the red one, 734 00:33:31,710 --> 00:33:33,960 because if you look at these different paths. 735 00:33:33,960 --> 00:33:35,670 You go up there and there-- 736 00:33:35,670 --> 00:33:37,170 they have the same number of states, 737 00:33:37,170 --> 00:33:39,212 but there's less chance you end up in the red box 738 00:33:39,212 --> 00:33:42,030 if you take the upper route. 739 00:33:42,030 --> 00:33:43,200 Great. 740 00:33:43,200 --> 00:33:44,805 So we have an example of a policy 741 00:33:44,805 --> 00:33:46,680 and we have an example of a decision process. 742 00:33:46,680 --> 00:33:49,560 And things are working out so far. 743 00:33:49,560 --> 00:33:52,560 But how do we do this? 744 00:33:52,560 --> 00:33:56,110 As far as the class goes, this was a blackbox experiment. 745 00:33:56,110 --> 00:33:58,710 I don't know anything about how you figured that out. 746 00:33:58,710 --> 00:34:00,757 So reinforcement learning is about that-- 747 00:34:00,757 --> 00:34:02,340 reinforcement learning is try and come 748 00:34:02,340 --> 00:34:06,690 up with a policy in a rigorous way, hopefully-- ideally. 749 00:34:06,690 --> 00:34:09,728 So that would be the next topic here. 750 00:34:09,728 --> 00:34:12,270 Up until this point, are there any questions that you've been 751 00:34:12,270 --> 00:34:15,170 dying to ask, but haven't? 752 00:34:15,170 --> 00:34:18,389 AUDIENCE: I'm curious how much behavioral biases could 753 00:34:18,389 --> 00:34:20,699 play into the first Markov assumption? 754 00:34:20,699 --> 00:34:22,679 So for example, if you're a clinician who's 755 00:34:22,679 --> 00:34:24,179 been working for 30 years and you're 756 00:34:24,179 --> 00:34:26,298 just really used to giving a certain treatment. 757 00:34:26,298 --> 00:34:27,840 An action that you gave in the past-- 758 00:34:27,840 --> 00:34:30,800 that habit might influence an action in the future. 759 00:34:30,800 --> 00:34:33,420 And if that is a worry, how one might 760 00:34:33,420 --> 00:34:36,510 think about addressing it. 761 00:34:36,510 --> 00:34:38,460 FREDRIK D. JOHANSSON: Interesting. 762 00:34:38,460 --> 00:34:40,830 I guess it depends a little bit on how it manifests, 763 00:34:40,830 --> 00:34:45,239 in that if it also influenced your most recent action, maybe 764 00:34:45,239 --> 00:34:47,880 you have an observation of that already in some sense. 765 00:34:51,929 --> 00:34:53,130 It's a very broad question. 766 00:34:53,130 --> 00:34:54,690 What effect will that have? 767 00:34:54,690 --> 00:34:56,639 Did you have something specific in mind? 768 00:34:56,639 --> 00:34:58,222 AUDIENCE: I guess I was just wondering 769 00:34:58,222 --> 00:35:01,833 if it violated that assumption, that an action of the past 770 00:35:01,833 --> 00:35:02,750 influenced an action-- 771 00:35:02,750 --> 00:35:04,167 FREDRIK D. JOHANSSON: Interesting. 772 00:35:04,167 --> 00:35:09,450 So I guess my response there is that the action didn't really 773 00:35:09,450 --> 00:35:11,160 depend on the choice of action before, 774 00:35:11,160 --> 00:35:13,140 because the policy remained the same. 775 00:35:13,140 --> 00:35:16,230 You could have a bias towards an action without that 776 00:35:16,230 --> 00:35:19,200 being dependent on what you gave as action before, 777 00:35:19,200 --> 00:35:20,700 if you know what I mean. 778 00:35:20,700 --> 00:35:22,787 Say my probability of giving action one is 1, 779 00:35:22,787 --> 00:35:24,870 then it doesn't matter that I give it in the past. 780 00:35:24,870 --> 00:35:27,450 My policy is still the same. 781 00:35:27,450 --> 00:35:29,290 So, not necessarily. 782 00:35:29,290 --> 00:35:30,720 It could have other consequences. 783 00:35:30,720 --> 00:35:34,740 We might have reason to come back to that question later. 784 00:35:34,740 --> 00:35:35,700 Yup. 785 00:35:35,700 --> 00:35:39,330 AUDIENCE: Just practically, I would 786 00:35:39,330 --> 00:35:42,820 think that a doctor would want to be consistent. 787 00:35:42,820 --> 00:35:44,720 And so you wouldn't, for example, 788 00:35:44,720 --> 00:35:46,240 want to put somebody on a ventilator 789 00:35:46,240 --> 00:35:49,423 and then immediately take them off and then immediately put 790 00:35:49,423 --> 00:35:50,810 them back on again. 791 00:35:50,810 --> 00:35:52,990 So that would be an example where 792 00:35:52,990 --> 00:35:55,115 the past action influences what you're going to do. 793 00:35:55,115 --> 00:35:56,740 FREDRIK D. JOHANSSON: Completely, yeah. 794 00:35:56,740 --> 00:35:58,300 I think that's a great example. 795 00:35:58,300 --> 00:36:03,340 And what you would hope is that the state variable in that case 796 00:36:03,340 --> 00:36:06,020 includes some notion of treatment history. 797 00:36:06,020 --> 00:36:08,740 That's what your job is then. 798 00:36:08,740 --> 00:36:11,230 So that's why state can be somewhat misleading 799 00:36:11,230 --> 00:36:13,030 as a term-- at least for me, I'm not 800 00:36:13,030 --> 00:36:17,190 American or English-speaking. 801 00:36:17,190 --> 00:36:20,560 But yeah, I think of it as too instantaneous sometimes. 802 00:36:20,560 --> 00:36:22,660 So we'll move into reinforcement learning now. 803 00:36:22,660 --> 00:36:28,540 And what I had you do on the last slide-- 804 00:36:28,540 --> 00:36:31,990 well, I don't know which method you use, but most likely 805 00:36:31,990 --> 00:36:33,160 the middle one. 806 00:36:33,160 --> 00:36:36,730 There are three very common paradigms 807 00:36:36,730 --> 00:36:38,290 for reinforcement learning. 808 00:36:38,290 --> 00:36:43,630 And they are essentially divided by what they focus on modeling. 809 00:36:43,630 --> 00:36:46,900 Unsurprisingly, model-based RL focused on-- 810 00:36:46,900 --> 00:36:49,360 well, it has some sort of model in it, at least. 811 00:36:52,702 --> 00:36:54,160 What you mean by model in this case 812 00:36:54,160 --> 00:36:55,720 is a model of the transitions. 813 00:36:55,720 --> 00:36:59,290 So what state will I end up in, given the action in the state 814 00:36:59,290 --> 00:37:01,160 I'm in at the moment? 815 00:37:01,160 --> 00:37:03,550 So model-based RL tries to essentially create a model 816 00:37:03,550 --> 00:37:05,746 for the environment or of the environment. 817 00:37:08,740 --> 00:37:11,770 There are several examples of model-based RL. 818 00:37:11,770 --> 00:37:13,300 One of them is G-computation, which 819 00:37:13,300 --> 00:37:15,910 comes out of the statistic literature, if you like. 820 00:37:15,910 --> 00:37:19,095 And MDPs are essentially-- 821 00:37:19,095 --> 00:37:20,470 that's a Markov decision process, 822 00:37:20,470 --> 00:37:22,780 which is essentially trying to estimate the whole distribution 823 00:37:22,780 --> 00:37:23,947 that we talked about before. 824 00:37:28,620 --> 00:37:30,580 There are various ups and downsides of this. 825 00:37:30,580 --> 00:37:33,100 We won't have time to go into all of these paradigms today. 826 00:37:33,100 --> 00:37:37,835 We will actually focus only on value-based RL today. 827 00:37:37,835 --> 00:37:39,460 Yeah, you can ask me offline if you are 828 00:37:39,460 --> 00:37:41,500 interested in model-based RL. 829 00:37:41,500 --> 00:37:44,320 The rightmost one here is policy-based RL, 830 00:37:44,320 --> 00:37:46,720 where you essentially focus only on modeling 831 00:37:46,720 --> 00:37:52,570 the policy that was used in the data that you observed. 832 00:37:52,570 --> 00:37:56,590 And the policy that you want to essentially arrive at. 833 00:37:56,590 --> 00:37:58,390 So you're optimizing a policy and you 834 00:37:58,390 --> 00:38:01,570 are estimating a policy that was used in the past. 835 00:38:01,570 --> 00:38:04,990 And the middle one focuses on neither of those 836 00:38:04,990 --> 00:38:07,720 and focuses on only estimating the return-- 837 00:38:07,720 --> 00:38:11,080 that was the G. Or the reward function as a function 838 00:38:11,080 --> 00:38:13,102 of your actions and states. 839 00:38:13,102 --> 00:38:14,560 And it's interesting to me that you 840 00:38:14,560 --> 00:38:17,260 can pick any of the variables-- 841 00:38:17,260 --> 00:38:18,822 A, S, and R-- and model those. 842 00:38:18,822 --> 00:38:21,280 And you can arrive at something reasonable in reinforcement 843 00:38:21,280 --> 00:38:22,710 learning. 844 00:38:22,710 --> 00:38:24,330 This one is particularly interesting, 845 00:38:24,330 --> 00:38:29,010 because it doesn't try to understand 846 00:38:29,010 --> 00:38:32,632 how do you arrive at a certain return based 847 00:38:32,632 --> 00:38:33,840 on the actions in the states? 848 00:38:33,840 --> 00:38:35,880 It's just optimize the policy directly. 849 00:38:35,880 --> 00:38:37,110 And it has some obvious-- 850 00:38:37,110 --> 00:38:42,010 well, not obvious, but it has some downsides, not doing that. 851 00:38:42,010 --> 00:38:44,980 OK, anyway, we're going to focus on value-based RL. 852 00:38:44,980 --> 00:38:49,090 And the very dominant instantiation of value-based RL 853 00:38:49,090 --> 00:38:49,920 is Q-learning. 854 00:38:49,920 --> 00:38:51,870 I'm sure you've heard of it. 855 00:38:51,870 --> 00:38:54,990 It is what drove the success stories that I showed before, 856 00:38:54,990 --> 00:38:58,120 the goal in the StarCraft and everything. 857 00:38:58,120 --> 00:39:01,170 G-estimation is another example of this, which, again, has come 858 00:39:01,170 --> 00:39:02,670 from the statistic literature. 859 00:39:02,670 --> 00:39:06,600 But we'll focus on Q-learning today. 860 00:39:06,600 --> 00:39:10,210 So Q-learning is an example of dynamic programming, 861 00:39:10,210 --> 00:39:10,843 in some sense. 862 00:39:10,843 --> 00:39:12,260 That's how it's usually explained. 863 00:39:12,260 --> 00:39:14,718 And I just wanted to check-- how many have heard the phrase 864 00:39:14,718 --> 00:39:15,930 dynamic programming before? 865 00:39:15,930 --> 00:39:18,030 OK, great. 866 00:39:18,030 --> 00:39:21,360 So I won't go into details of dynamic programming in general. 867 00:39:21,360 --> 00:39:25,950 But the general idea is one of recursion. 868 00:39:25,950 --> 00:39:30,120 In this case, you know something about what 869 00:39:30,120 --> 00:39:31,500 is a good terminal state. 870 00:39:31,500 --> 00:39:33,500 And then you want to figure out how to get there 871 00:39:33,500 --> 00:39:35,730 and how to get to the state before that and the state 872 00:39:35,730 --> 00:39:37,470 before that and so on. 873 00:39:37,470 --> 00:39:39,888 That is the recursion that we're talking about. 874 00:39:39,888 --> 00:39:42,180 The end state that is the best here is fairly obvious-- 875 00:39:42,180 --> 00:39:43,350 that is the plus 1 here. 876 00:39:47,130 --> 00:39:50,940 The only way to get there is by stopping here first, 877 00:39:50,940 --> 00:39:55,170 because you can't move from here since it's a terminal state. 878 00:39:55,170 --> 00:39:57,010 Your only bet is that one. 879 00:39:57,010 --> 00:39:59,550 And then we can ask what is the best way to get to 3, 1? 880 00:39:59,550 --> 00:40:02,100 How do we get to the state before the best state? 881 00:40:02,100 --> 00:40:06,370 Well, we can say that one way is go from here. 882 00:40:06,370 --> 00:40:07,470 And one way from here. 883 00:40:07,470 --> 00:40:09,300 And as we got from the audience before, 884 00:40:09,300 --> 00:40:12,600 this is a slightly worse way to get there then 885 00:40:12,600 --> 00:40:15,810 from there, because here we have a possibility of ending up 886 00:40:15,810 --> 00:40:17,950 in minus 1. 887 00:40:17,950 --> 00:40:20,923 So then we recurse further and essentially, 888 00:40:20,923 --> 00:40:22,840 we end up with something like this that says-- 889 00:40:25,860 --> 00:40:30,160 or what I tried to illustrate here is that the green boxes-- 890 00:40:30,160 --> 00:40:33,340 I'm sorry for any colorblind members of the audience, 891 00:40:33,340 --> 00:40:35,870 because this was a poor choice of mine. 892 00:40:35,870 --> 00:40:38,110 Anyway, this bottom side here is mostly red 893 00:40:38,110 --> 00:40:39,412 and this is mostly green. 894 00:40:39,412 --> 00:40:41,620 And you can follow the green color here, essentially, 895 00:40:41,620 --> 00:40:45,460 to get to the best end state. 896 00:40:45,460 --> 00:40:47,950 And what I used here to color this in 897 00:40:47,950 --> 00:40:50,860 is this idea of knowing how good a state is, 898 00:40:50,860 --> 00:40:54,550 depending on how good the state after that state is. 899 00:40:54,550 --> 00:40:57,960 So I knew that plus 1 is a good end state over there. 900 00:40:57,960 --> 00:41:03,870 And that led me to recurse backwards, essentially. 901 00:41:03,870 --> 00:41:07,520 So the question, then, is how do we 902 00:41:07,520 --> 00:41:10,210 know that that state over there is a good one? 903 00:41:10,210 --> 00:41:11,960 When we have it visualized in front of us, 904 00:41:11,960 --> 00:41:13,340 it's very easy to see. 905 00:41:13,340 --> 00:41:16,230 And it's very easy because we know that plus 1 is a terminal 906 00:41:16,230 --> 00:41:16,730 state here. 907 00:41:16,730 --> 00:41:20,180 It ends there, so those are the only states we 908 00:41:20,180 --> 00:41:21,680 need to consider in this case. 909 00:41:21,680 --> 00:41:25,370 But more in general, how do we learn 910 00:41:25,370 --> 00:41:28,520 what is the value of a state? 911 00:41:28,520 --> 00:41:32,730 That will be the purpose of Q-learning. 912 00:41:32,730 --> 00:41:35,040 If we have an idea of what is a good state, 913 00:41:35,040 --> 00:41:40,260 we can always do that recursion that I explained very briefly. 914 00:41:40,260 --> 00:41:42,750 You find a state that has the high value 915 00:41:42,750 --> 00:41:44,250 and you figure out how to get there. 916 00:41:47,930 --> 00:41:53,070 So we're going to have to define now what I mean by value. 917 00:41:53,070 --> 00:41:56,698 I've used that word a few different times. 918 00:41:56,698 --> 00:41:58,740 I say recall here, but I don't know if I actually 919 00:41:58,740 --> 00:41:59,830 had it on a slide before. 920 00:41:59,830 --> 00:42:02,332 So let's just say this is the definition of value 921 00:42:02,332 --> 00:42:03,540 that we will be working with. 922 00:42:07,175 --> 00:42:09,050 I think I had it on a slide before, actually. 923 00:42:09,050 --> 00:42:11,050 This is the expected return. 924 00:42:11,050 --> 00:42:12,670 Remember, this G here was the sum 925 00:42:12,670 --> 00:42:17,470 of rewards going into the future, starting at time, t. 926 00:42:17,470 --> 00:42:20,680 And the value, then, of this state 927 00:42:20,680 --> 00:42:22,150 is the expectation of such returns. 928 00:42:25,690 --> 00:42:28,000 Before, I said that the value of an policy 929 00:42:28,000 --> 00:42:32,140 was the expectation of returns, period. 930 00:42:32,140 --> 00:42:34,900 And the value of a state and the policy 931 00:42:34,900 --> 00:42:38,330 is the value of that return starting in a certain state. 932 00:42:41,220 --> 00:42:44,010 We can stratify this further if we like and say 933 00:42:44,010 --> 00:42:47,090 that the value of a state action pair 934 00:42:47,090 --> 00:42:50,270 is the expected return, starting in a certain state 935 00:42:50,270 --> 00:42:51,580 and taking an action, a. 936 00:42:51,580 --> 00:42:55,990 And after that, following the policy, pi. 937 00:42:55,990 --> 00:42:59,260 This would be the so-called Q value 938 00:42:59,260 --> 00:43:01,000 of a state-action pair-- s, a. 939 00:43:03,650 --> 00:43:07,540 And this is where Q-learning gets its name. 940 00:43:07,540 --> 00:43:12,100 So Q-learning attempts to estimate the Q function-- 941 00:43:12,100 --> 00:43:14,797 the expected return starting in a state, s, and taking action, 942 00:43:14,797 --> 00:43:15,297 a-- 943 00:43:17,920 --> 00:43:18,510 from data. 944 00:43:22,090 --> 00:43:23,740 The Q-learning is also associated 945 00:43:23,740 --> 00:43:26,050 with a deterministic policy. 946 00:43:26,050 --> 00:43:28,840 So the policy and the Q function go together 947 00:43:28,840 --> 00:43:30,740 in this specific way. 948 00:43:30,740 --> 00:43:33,040 If we have a Q function, Q, that tries 949 00:43:33,040 --> 00:43:37,330 to estimate the value of a policy, pi, the pi itself is 950 00:43:37,330 --> 00:43:40,750 the arg max according to that Q. It sounds a little recursive, 951 00:43:40,750 --> 00:43:44,010 but hopefully it will be OK. 952 00:43:44,010 --> 00:43:46,090 Maybe it's more obvious if we look here. 953 00:43:46,090 --> 00:43:48,790 So Q, I said before, was the value of starting an s, 954 00:43:48,790 --> 00:43:53,560 taking action, a, and then following policy, pi. 955 00:43:53,560 --> 00:43:57,670 This is defined by the decision process itself. 956 00:43:57,670 --> 00:44:02,363 The best pi, the best policy, is the one that has the highest Q. 957 00:44:02,363 --> 00:44:03,780 And this is what we call a Q-star. 958 00:44:06,328 --> 00:44:08,120 Well, that is not what we call Q-star, that 959 00:44:08,120 --> 00:44:10,030 is what we call little q-star. 960 00:44:10,030 --> 00:44:12,530 Q-star, the best estimate of this, is obviously the thing 961 00:44:12,530 --> 00:44:13,320 itself. 962 00:44:13,320 --> 00:44:15,440 So if you can find a good function that 963 00:44:15,440 --> 00:44:17,780 assigns a value to a state-action pair, 964 00:44:17,780 --> 00:44:19,310 the best such function you can get 965 00:44:19,310 --> 00:44:21,290 is the one that is equal to little q-star. 966 00:44:24,270 --> 00:44:26,670 I hope that wasn't too confusing. 967 00:44:26,670 --> 00:44:30,300 I'll show on the next slide why that might be reasonable. 968 00:44:30,300 --> 00:44:34,190 So Q-learning is based on a general idea 969 00:44:34,190 --> 00:44:38,950 from dynamic programming, which is the so-called Bellman 970 00:44:38,950 --> 00:44:39,450 question. 971 00:44:39,450 --> 00:44:39,950 There we go. 972 00:44:46,680 --> 00:44:49,230 This is an instantiation of Bellman optimality, which 973 00:44:49,230 --> 00:44:55,950 says that the best state-action value 974 00:44:55,950 --> 00:44:58,670 function has the property that it 975 00:44:58,670 --> 00:45:03,210 is equal to the immediate reward of taking action, a, and state, 976 00:45:03,210 --> 00:45:07,290 s, plus this, which is the maximum Q 977 00:45:07,290 --> 00:45:08,420 value for the next state. 978 00:45:08,420 --> 00:45:10,170 So we're going to stare at this for a bit, 979 00:45:10,170 --> 00:45:14,500 because there's a bit here to digest. 980 00:45:14,500 --> 00:45:19,390 Remember, q-star assigns a value to any state action pair. 981 00:45:19,390 --> 00:45:22,160 So we have q-star here, we have q-star here. 982 00:45:22,160 --> 00:45:25,030 This thing here is supposed to represent the value going 983 00:45:25,030 --> 00:45:29,260 forward in time after I've made this choice, action, a, 984 00:45:29,260 --> 00:45:29,920 and state, s. 985 00:45:33,470 --> 00:45:36,790 If I have a good idea of how good it is to take action, 986 00:45:36,790 --> 00:45:39,830 a, instead of s, it should both incorporate the immediate 987 00:45:39,830 --> 00:45:41,600 reward that I get-- that's RT-- 988 00:45:41,600 --> 00:45:44,220 and how good that choice was going forward. 989 00:45:44,220 --> 00:45:46,590 So think about mechanical ventilation, as I said before. 990 00:45:46,590 --> 00:45:48,507 If we put a patient on mechanical ventilation, 991 00:45:48,507 --> 00:45:50,720 we have to do a bunch of other things after that. 992 00:45:50,720 --> 00:45:53,930 If none of those other things lead to a good outcome, 993 00:45:53,930 --> 00:45:56,630 this part will be low. 994 00:45:56,630 --> 00:45:59,620 Even if the immediate return is good. 995 00:45:59,620 --> 00:46:03,710 So for the optimal q-star, this quantity holds. 996 00:46:03,710 --> 00:46:05,840 We know that-- we can prove that. 997 00:46:05,840 --> 00:46:07,820 So the question is how do we find this thing? 998 00:46:07,820 --> 00:46:09,290 How do we find q-star? 999 00:46:09,290 --> 00:46:11,930 Because q-star is not only the thing that gives you 1000 00:46:11,930 --> 00:46:13,070 the optimal policy-- 1001 00:46:13,070 --> 00:46:15,987 it also satisfied this equality. 1002 00:46:15,987 --> 00:46:17,570 This is not true for every Q function, 1003 00:46:17,570 --> 00:46:18,987 but it's true for the optimal one. 1004 00:46:21,460 --> 00:46:22,409 Questions? 1005 00:46:26,990 --> 00:46:29,750 If you haven't seen this before, it might be a little tough 1006 00:46:29,750 --> 00:46:32,130 to digest. 1007 00:46:32,130 --> 00:46:33,130 Is the notation clear? 1008 00:46:33,130 --> 00:46:34,750 Essentially, here you have the state 1009 00:46:34,750 --> 00:46:36,780 that you are arriving at the next time. 1010 00:46:36,780 --> 00:46:40,990 A prime is the parameter of this here, or the argument to this. 1011 00:46:40,990 --> 00:46:44,508 You're taking the best possible q-star value and then 1012 00:46:44,508 --> 00:46:45,800 state that you arrive at after. 1013 00:46:45,800 --> 00:46:46,425 Yeah, go ahead. 1014 00:46:46,425 --> 00:46:49,105 AUDIENCE: Can you instantiate an example you have on the board? 1015 00:46:49,105 --> 00:46:50,188 FREDRIK D. JOHANSSON: Yes. 1016 00:46:50,188 --> 00:46:52,840 Actually, I might do a full example of Q-learning 1017 00:46:52,840 --> 00:46:53,530 in a second. 1018 00:46:53,530 --> 00:46:54,760 Yes, I will. 1019 00:46:54,760 --> 00:46:56,134 I'll get to that example then. 1020 00:47:00,453 --> 00:47:02,120 Yeah, I was debating whether to do that. 1021 00:47:02,120 --> 00:47:04,120 It might take some time, but it could be useful. 1022 00:47:04,120 --> 00:47:04,930 So where are we? 1023 00:47:09,510 --> 00:47:12,930 So what I showed you before-- the Bellman inequality. 1024 00:47:12,930 --> 00:47:14,880 We know that this holds for the optimal thing. 1025 00:47:14,880 --> 00:47:18,330 And if there is a quality that is true at an optimum, 1026 00:47:18,330 --> 00:47:21,900 one general idea in optimization is this so-called fixed point 1027 00:47:21,900 --> 00:47:26,230 iteration that you can do to arrive there. 1028 00:47:26,230 --> 00:47:29,610 And that's essentially what we will do to get to a good Q. 1029 00:47:29,610 --> 00:47:31,980 So a nice thing about Q-learning is 1030 00:47:31,980 --> 00:47:36,750 that if your states and action spaces are small and discrete, 1031 00:47:36,750 --> 00:47:39,190 you can represent the Q function as a table. 1032 00:47:39,190 --> 00:47:40,690 So all you have to keep track of is, 1033 00:47:40,690 --> 00:47:44,970 how good is the certain action in a certain state? 1034 00:47:44,970 --> 00:47:47,555 Or all actions in all states, rather? 1035 00:47:47,555 --> 00:47:48,680 So that's what we did here. 1036 00:47:48,680 --> 00:47:51,330 This is a table. 1037 00:47:51,330 --> 00:47:54,990 I've described to you the policy here, but what we'll do next 1038 00:47:54,990 --> 00:47:58,560 is to describe the value of each action. 1039 00:47:58,560 --> 00:48:02,370 So you can think of a value of taking the right one, bottom, 1040 00:48:02,370 --> 00:48:04,080 top, and left, essentially. 1041 00:48:04,080 --> 00:48:08,600 Those will be the values that we need to consider. 1042 00:48:08,600 --> 00:48:10,930 And so what Q-learning can do with discrete states is 1043 00:48:10,930 --> 00:48:14,110 to essentially start from somewhere, 1044 00:48:14,110 --> 00:48:16,450 start from some idea of what Q is-- could be random, 1045 00:48:16,450 --> 00:48:17,710 could be 0. 1046 00:48:17,710 --> 00:48:20,890 And then repeat the following fixed-point iteration, 1047 00:48:20,890 --> 00:48:25,000 where you update your former idea of what 1048 00:48:25,000 --> 00:48:27,820 Q should be, with its current value 1049 00:48:27,820 --> 00:48:32,830 plus essentially a mixture of the immediate reward for taking 1050 00:48:32,830 --> 00:48:35,980 action, At, in that state, and the future reward, 1051 00:48:35,980 --> 00:48:38,810 as judged by your current estimate of the Q function. 1052 00:48:38,810 --> 00:48:40,390 So we'll do that now in practice. 1053 00:48:40,390 --> 00:48:41,200 Yeah. 1054 00:48:41,200 --> 00:48:42,825 AUDIENCE: Throughout this, where are we 1055 00:48:42,825 --> 00:48:44,350 getting the transition probabilities 1056 00:48:44,350 --> 00:48:45,517 or the behavior of the game? 1057 00:48:45,517 --> 00:48:47,892 FREDRIK D. JOHANSSON: So they're not used here, actually. 1058 00:48:47,892 --> 00:48:50,030 A value-based RL-- I didn't say that explicitly, 1059 00:48:50,030 --> 00:48:53,240 but they don't rely on knowing the transition probabilities. 1060 00:48:53,240 --> 00:48:56,290 What you might ask is where do we get the S and the As 1061 00:48:56,290 --> 00:48:58,060 and the Rs from? 1062 00:48:58,060 --> 00:48:59,560 And we'll get to that. 1063 00:48:59,560 --> 00:49:00,780 How do we estimate these? 1064 00:49:00,780 --> 00:49:03,130 We'll get to that later. 1065 00:49:03,130 --> 00:49:05,720 Good question, though. 1066 00:49:05,720 --> 00:49:07,770 I'm going to throw a very messy slide at you. 1067 00:49:07,770 --> 00:49:09,780 Here you go. 1068 00:49:09,780 --> 00:49:11,620 A lot of numbers. 1069 00:49:11,620 --> 00:49:14,595 So what I've done now here is a more exhaustive version 1070 00:49:14,595 --> 00:49:15,720 of what I put on the board. 1071 00:49:15,720 --> 00:49:20,070 For each little triangle here represents the Q value 1072 00:49:20,070 --> 00:49:21,300 for the state-action pair. 1073 00:49:21,300 --> 00:49:23,930 So this triangle is, again, for the action right 1074 00:49:23,930 --> 00:49:24,930 if you're in this state. 1075 00:49:27,870 --> 00:49:31,470 So what I've put on the first slide here 1076 00:49:31,470 --> 00:49:38,770 is the immediate reward of each action. 1077 00:49:38,770 --> 00:49:42,960 So we know that any step will cost us minus 0.04. 1078 00:49:42,960 --> 00:49:44,710 So that's why there's a lot of those here. 1079 00:49:44,710 --> 00:49:49,250 These white boxes here are not possible actions. 1080 00:49:49,250 --> 00:49:51,350 Up here, you have a 0.96, because it's 1081 00:49:51,350 --> 00:49:54,170 1, which is the immediate reward of going right here, 1082 00:49:54,170 --> 00:49:56,330 minus 0.04. 1083 00:49:56,330 --> 00:50:01,220 These two are minus 1.04 for the same reason-- 1084 00:50:01,220 --> 00:50:03,910 because you arrive in minus 1. 1085 00:50:03,910 --> 00:50:07,370 OK, so that's the first step and the second step done. 1086 00:50:07,370 --> 00:50:09,470 We initialize Qs to be 0. 1087 00:50:09,470 --> 00:50:12,770 And then we picked these two parameters of the problem, 1088 00:50:12,770 --> 00:50:14,430 alpha and gamma, to be 1. 1089 00:50:14,430 --> 00:50:18,170 And then we did the first iteration of Q-learning, 1090 00:50:18,170 --> 00:50:21,560 where we set the Q to be the old version of Q, 1091 00:50:21,560 --> 00:50:25,850 which was 0, plus alpha times this thing here. 1092 00:50:25,850 --> 00:50:28,340 So Q was 0, that means that this is also 0. 1093 00:50:28,340 --> 00:50:31,820 So the only thing we need to look at is this thing here. 1094 00:50:31,820 --> 00:50:35,300 This also is 0, because the Qs for all states 1095 00:50:35,300 --> 00:50:37,780 were 0, so the only thing we end up with is R. 1096 00:50:37,780 --> 00:50:39,530 And that's what populated this table here. 1097 00:50:44,000 --> 00:50:47,900 Next timestep-- I'm doing Q-learning now 1098 00:50:47,900 --> 00:50:51,380 in a way where I update all the state-action pairs at once. 1099 00:50:51,380 --> 00:50:52,130 How can I do that? 1100 00:50:52,130 --> 00:50:54,547 Well, it depends on the question I got there, essentially. 1101 00:50:54,547 --> 00:50:55,550 What data do I observe? 1102 00:50:55,550 --> 00:50:59,800 Or how do I get to know the rewards of the S&A pairs? 1103 00:50:59,800 --> 00:51:02,530 We'll come back to that. 1104 00:51:02,530 --> 00:51:09,360 So in the next step, I have to update everything again. 1105 00:51:09,360 --> 00:51:12,390 So it's the previous Q value, which was minus 0.04 1106 00:51:12,390 --> 00:51:16,590 for a lot of things, then plus the immediate reward, which 1107 00:51:16,590 --> 00:51:17,460 was this RT. 1108 00:51:17,460 --> 00:51:19,260 And I have to keep going. 1109 00:51:19,260 --> 00:51:23,100 So the dominant thing for the table this time 1110 00:51:23,100 --> 00:51:27,240 was that the best Q value for almost all of these boxes 1111 00:51:27,240 --> 00:51:29,580 was minus 0.04. 1112 00:51:29,580 --> 00:51:31,770 So essentially I will add the immediate reward 1113 00:51:31,770 --> 00:51:33,940 plus that almost everywhere. 1114 00:51:33,940 --> 00:51:37,240 What is interesting, though, is that here, the best Q value 1115 00:51:37,240 --> 00:51:38,590 was 0.96. 1116 00:51:38,590 --> 00:51:41,050 And it will remain so. 1117 00:51:41,050 --> 00:51:44,020 That means that the best Q value for the adjacent states-- 1118 00:51:44,020 --> 00:51:49,840 we look at this max here and get 0.96 out. 1119 00:51:49,840 --> 00:51:52,840 And then add the immediate reward. 1120 00:51:52,840 --> 00:51:56,230 Getting to here gives me 0.96 minus 0.04 1121 00:51:56,230 --> 00:51:58,920 for the immediate reward. 1122 00:51:58,920 --> 00:52:02,890 And now we can figure out what will happen next. 1123 00:52:02,890 --> 00:52:09,730 These values will spread out as you go further away 1124 00:52:09,730 --> 00:52:10,602 from the plus 1. 1125 00:52:10,602 --> 00:52:12,560 I don't think we should go through all of this, 1126 00:52:12,560 --> 00:52:14,260 but you get a sense, essentially, 1127 00:52:14,260 --> 00:52:19,000 how information is moved from the plus 1 and away. 1128 00:52:19,000 --> 00:52:20,830 And I'm sure that's how you solved 1129 00:52:20,830 --> 00:52:24,220 it yourself, in your head. 1130 00:52:24,220 --> 00:52:26,110 But this makes it clear why you can do that, 1131 00:52:26,110 --> 00:52:28,720 even if you don't know where the terminal states are 1132 00:52:28,720 --> 00:52:32,710 or where the value of the state-action pairs are. 1133 00:52:35,320 --> 00:52:37,070 AUDIENCE: Doesn't this calculation 1134 00:52:37,070 --> 00:52:40,250 assume that if you want to move in a certain direction, 1135 00:52:40,250 --> 00:52:41,870 you will move in that direction? 1136 00:52:41,870 --> 00:52:42,990 FREDRIK D. JOHANSSON: Yes. 1137 00:52:42,990 --> 00:52:43,490 Sorry. 1138 00:52:43,490 --> 00:52:44,710 Thanks for reminding me. 1139 00:52:44,710 --> 00:52:46,520 That should have been in the slide, yes. 1140 00:52:46,520 --> 00:52:47,020 Thank you. 1141 00:52:49,892 --> 00:52:51,350 I'm going to skip the rest of this. 1142 00:52:51,350 --> 00:52:52,030 I hope you forgive me. 1143 00:52:52,030 --> 00:52:53,390 We can talk more about it later. 1144 00:52:58,138 --> 00:52:59,680 Thanks for reminding me, Pete, there, 1145 00:52:59,680 --> 00:53:01,430 that one of the things I exploited here 1146 00:53:01,430 --> 00:53:05,000 was that I assume just deterministic transitions. 1147 00:53:05,000 --> 00:53:07,160 Another thing that I relied very heavily on here 1148 00:53:07,160 --> 00:53:10,165 is that I can represent this Q function as a table. 1149 00:53:10,165 --> 00:53:12,290 I drew all these boxes and I filled the numbers in. 1150 00:53:12,290 --> 00:53:13,440 That's easy enough. 1151 00:53:13,440 --> 00:53:15,830 But what if I have thousands of states and thousands 1152 00:53:15,830 --> 00:53:17,510 of actions? 1153 00:53:17,510 --> 00:53:18,992 That's a large table. 1154 00:53:18,992 --> 00:53:21,450 And not only is it a large table for me to keep in memory-- 1155 00:53:21,450 --> 00:53:24,080 it's also very bad for me statistically. 1156 00:53:24,080 --> 00:53:26,900 If I want to observe anything about a state-action pair, 1157 00:53:26,900 --> 00:53:28,790 I have to do that action in that state. 1158 00:53:28,790 --> 00:53:31,340 And if you think about treating patients in a hospital, 1159 00:53:31,340 --> 00:53:33,440 you're not going to try everything in every state, 1160 00:53:33,440 --> 00:53:34,100 usually. 1161 00:53:34,100 --> 00:53:37,490 You're also not going to have infinite numbers of patients. 1162 00:53:37,490 --> 00:53:40,220 So how do you figure out what is the immediate reward 1163 00:53:40,220 --> 00:53:44,160 of taking a certain action in a certain state? 1164 00:53:44,160 --> 00:53:47,720 And this is where a function approximation comes in. 1165 00:53:47,720 --> 00:53:51,940 Essentially, if you can't represent your data set table, 1166 00:53:51,940 --> 00:53:57,610 either for statistical reasons or for memory reasons, 1167 00:53:57,610 --> 00:54:01,820 let's say, you might want to approximate the Q function 1168 00:54:01,820 --> 00:54:06,150 with a parametric or with a non-parametric function. 1169 00:54:06,150 --> 00:54:07,680 And this is exactly what we can do. 1170 00:54:07,680 --> 00:54:11,260 So we can draw now an analogy to what we did last week. 1171 00:54:11,260 --> 00:54:14,490 I'm going to come back to this, but essentially 1172 00:54:14,490 --> 00:54:19,500 instead of doing this fixed-point iteration that we 1173 00:54:19,500 --> 00:54:23,070 did before, we will try and look for a function Q theta that 1174 00:54:23,070 --> 00:54:29,122 is equal to R plus gamma max Q. 1175 00:54:29,122 --> 00:54:31,080 Remember before, we had the Bellman inequality? 1176 00:54:31,080 --> 00:54:38,880 We said that q-star S, A is equal to R S, A, 1177 00:54:38,880 --> 00:54:58,880 let's say, plus gamma max A prime q star S prime A prime, 1178 00:54:58,880 --> 00:55:01,380 where S prime is the state we get to after taking action 1179 00:55:01,380 --> 00:55:04,230 A in state S. So the only thing I've done here 1180 00:55:04,230 --> 00:55:09,300 is to take this equality and make it instead a loss function 1181 00:55:09,300 --> 00:55:13,140 on the violation of this equality. 1182 00:55:13,140 --> 00:55:15,080 So by minimizing this quantity, I 1183 00:55:15,080 --> 00:55:17,270 will find something that has approximately 1184 00:55:17,270 --> 00:55:20,880 the Bellman equality that we talked about before. 1185 00:55:20,880 --> 00:55:23,490 This is the idea of fitted Q-learning, where 1186 00:55:23,490 --> 00:55:28,270 you substitute the tabular representation 1187 00:55:28,270 --> 00:55:30,542 with the function approximations, essentially. 1188 00:55:30,542 --> 00:55:32,250 So just to make this a bit more concrete, 1189 00:55:32,250 --> 00:55:33,625 we can think about the case where 1190 00:55:33,625 --> 00:55:35,920 we have only a single step. 1191 00:55:35,920 --> 00:55:38,760 There's only a single action to make, 1192 00:55:38,760 --> 00:55:41,720 which means that there is no future part of this equation 1193 00:55:41,720 --> 00:55:42,220 here. 1194 00:55:42,220 --> 00:55:44,970 This part goes away, because there's only one 1195 00:55:44,970 --> 00:55:46,920 stage in our trajectory. 1196 00:55:46,920 --> 00:55:48,510 So we have only the immediate reward. 1197 00:55:48,510 --> 00:55:51,080 We have only the Q function. 1198 00:55:51,080 --> 00:55:56,720 Now, this is exactly a regression equation in the way 1199 00:55:56,720 --> 00:55:59,860 that you've seen it when estimating potential outcomes. 1200 00:55:59,860 --> 00:56:07,060 RT here represents the outcome of doing 1201 00:56:07,060 --> 00:56:10,060 action A and state S. And Q here will 1202 00:56:10,060 --> 00:56:11,800 be our estimate of this RT. 1203 00:56:15,620 --> 00:56:18,230 Again, I've said this before-- if we have a single time 1204 00:56:18,230 --> 00:56:22,820 point in our process, the problem reduces 1205 00:56:22,820 --> 00:56:24,490 to estimating potential outcomes, 1206 00:56:24,490 --> 00:56:26,270 just the way we saw it last time. 1207 00:56:26,270 --> 00:56:30,410 We have curves that correspond outcomes 1208 00:56:30,410 --> 00:56:31,730 under different actions. 1209 00:56:31,730 --> 00:56:33,710 And we can do regression adjustment, 1210 00:56:33,710 --> 00:56:37,010 trying to find an F such that this quantity is small 1211 00:56:37,010 --> 00:56:39,740 so that we can model each different potential outcomes. 1212 00:56:39,740 --> 00:56:42,632 And that's exactly what happens with the fitted Q iteration 1213 00:56:42,632 --> 00:56:44,090 if you have a single timestep, too. 1214 00:56:47,670 --> 00:56:51,060 So to make it even more concrete, 1215 00:56:51,060 --> 00:56:55,860 we can say that there's some target value, G hat, which 1216 00:56:55,860 --> 00:57:00,690 represents the immediate reward and the future rewards that is 1217 00:57:00,690 --> 00:57:01,980 the target of our regression. 1218 00:57:01,980 --> 00:57:03,897 And we're fitting some function to that value. 1219 00:57:10,010 --> 00:57:15,010 So the question we got before was 1220 00:57:15,010 --> 00:57:16,970 how do I know the transition matrix? 1221 00:57:16,970 --> 00:57:20,590 How do I get any information about this thing? 1222 00:57:20,590 --> 00:57:22,480 I say here on the slide that, OK, 1223 00:57:22,480 --> 00:57:25,737 we have some target that's R plus future Q values. 1224 00:57:25,737 --> 00:57:27,820 We have some prediction and we have an expectation 1225 00:57:27,820 --> 00:57:29,470 of our transitions here. 1226 00:57:29,470 --> 00:57:33,510 But how do I evaluate this thing? 1227 00:57:33,510 --> 00:57:37,887 The transitions I have to get from somewhere, right? 1228 00:57:37,887 --> 00:57:39,970 And another way to say that is what are the inputs 1229 00:57:39,970 --> 00:57:41,470 and the outputs of our regression? 1230 00:57:41,470 --> 00:57:44,800 Because when we estimate potential outcomes, 1231 00:57:44,800 --> 00:57:48,070 we have a very clear idea of this. 1232 00:57:48,070 --> 00:57:53,080 We know that y, the outcome itself, is a target. 1233 00:57:53,080 --> 00:57:57,670 And the input is the covariates, x. 1234 00:57:57,670 --> 00:58:01,510 But here, we have a moving target, because this Q hat, 1235 00:58:01,510 --> 00:58:03,250 it has to come from somewhere, too. 1236 00:58:03,250 --> 00:58:06,260 This is something that we estimate as well. 1237 00:58:06,260 --> 00:58:10,630 So usually what happens is that we alternate between updating 1238 00:58:10,630 --> 00:58:12,970 this target, Q, and Q theta. 1239 00:58:12,970 --> 00:58:15,880 So essentially, we copy Q theta to become our new Q hat 1240 00:58:15,880 --> 00:58:18,010 and we iterate this somehow. 1241 00:58:18,010 --> 00:58:22,100 But I still haven't told you how to evaluate this expectation. 1242 00:58:22,100 --> 00:58:26,080 So usually in RL, there are a few different ways to do this. 1243 00:58:26,080 --> 00:58:33,190 And either depending on where you coming from, essentially, 1244 00:58:33,190 --> 00:58:35,680 these are varyingly viable. 1245 00:58:35,680 --> 00:58:41,820 So if we look back at this thing here, 1246 00:58:41,820 --> 00:58:44,250 it relies on having tuples of transitions-- 1247 00:58:44,250 --> 00:58:45,930 the state, the action, the next state, 1248 00:58:45,930 --> 00:58:47,850 and the reward that I got. 1249 00:58:47,850 --> 00:58:50,760 So I have to somehow observe those. 1250 00:58:50,760 --> 00:58:54,420 And I can obtain them in various ways. 1251 00:58:54,420 --> 00:58:56,280 A very common one when it comes to learning 1252 00:58:56,280 --> 00:58:57,660 to play video games, for example, 1253 00:58:57,660 --> 00:59:00,110 is that you do something called on-policy exploration. 1254 00:59:00,110 --> 00:59:02,310 That means that you observe data from the policy 1255 00:59:02,310 --> 00:59:04,200 that you're currently optimizing. 1256 00:59:04,200 --> 00:59:06,360 You just play the game according to the policies 1257 00:59:06,360 --> 00:59:07,650 that you have at the moment. 1258 00:59:07,650 --> 00:59:09,150 And the analogy in health care would 1259 00:59:09,150 --> 00:59:12,480 be that you have some idea of how to treat patients 1260 00:59:12,480 --> 00:59:15,160 and you just do that and see what happens. 1261 00:59:15,160 --> 00:59:17,190 That could be problematic, especially 1262 00:59:17,190 --> 00:59:18,810 if you've got that policy-- 1263 00:59:18,810 --> 00:59:21,540 if you randomly initialized it or if you got it for some 1264 00:59:21,540 --> 00:59:24,700 somewhere very suboptimal. 1265 00:59:24,700 --> 00:59:27,370 A different thing that we're more, perhaps, 1266 00:59:27,370 --> 00:59:30,670 comfortable with in health care, in a restricted setting, 1267 00:59:30,670 --> 00:59:32,890 is the idea of a randomized trial, where, 1268 00:59:32,890 --> 00:59:35,230 instead of trying out some policy that you're currently 1269 00:59:35,230 --> 00:59:38,200 learning, you decide on a population 1270 00:59:38,200 --> 00:59:41,003 where it's OK to flip a coin, essentially, 1271 00:59:41,003 --> 00:59:42,670 between different actions that you have. 1272 00:59:45,655 --> 00:59:47,530 The difference between the sequential setting 1273 00:59:47,530 --> 00:59:49,155 and the one-step setting is that now we 1274 00:59:49,155 --> 00:59:52,000 have to randomize a sequence of actions, which 1275 00:59:52,000 --> 00:59:54,280 is a little bit unlike the clinical trials 1276 00:59:54,280 --> 00:59:56,740 that you have seen before, I think. 1277 00:59:56,740 --> 00:59:58,840 The last one, which is the most studied one 1278 00:59:58,840 --> 01:00:01,750 when it comes to practice, I would say, 1279 01:00:01,750 --> 01:00:05,470 is the one that we talk about this week-- 1280 01:00:05,470 --> 01:00:10,750 is off-policy evaluation or learning, in which case 1281 01:00:10,750 --> 01:00:12,702 you observe health care records, for example. 1282 01:00:12,702 --> 01:00:13,660 You observe registries. 1283 01:00:13,660 --> 01:00:15,730 You observe some data from the health care system 1284 01:00:15,730 --> 01:00:17,680 where patients have already been treated 1285 01:00:17,680 --> 01:00:19,300 and you try to extract a good policy 1286 01:00:19,300 --> 01:00:21,110 based on that information. 1287 01:00:21,110 --> 01:00:24,160 So that means that you see these transitions between state 1288 01:00:24,160 --> 01:00:26,470 and action and the next state and the reward. 1289 01:00:26,470 --> 01:00:28,710 You see that based on what happened in the past 1290 01:00:28,710 --> 01:00:30,460 and you have to figure out a pattern there 1291 01:00:30,460 --> 01:00:35,250 that helps you come up with a good action or a good policy. 1292 01:00:35,250 --> 01:00:38,330 So we'll focus on that one for now. 1293 01:00:38,330 --> 01:00:44,730 The last part of this talk will be about, essentially, 1294 01:00:44,730 --> 01:00:46,230 what we have to be careful with when 1295 01:00:46,230 --> 01:00:50,153 we learn with off-policy data. 1296 01:00:50,153 --> 01:00:51,570 Any questions up until this point? 1297 01:00:54,580 --> 01:00:55,910 Yeah. 1298 01:00:55,910 --> 01:00:59,150 AUDIENCE: So if [INAUDIBLE] getting there 1299 01:00:59,150 --> 01:01:03,060 for the [INAUDIBLE],, are there any requirements that 1300 01:01:03,060 --> 01:01:06,164 has to be met by [INAUDIBLE],, like how 1301 01:01:06,164 --> 01:01:09,467 we had [INAUDIBLE] and cause inference? 1302 01:01:09,467 --> 01:01:11,300 FREDRIK D. JOHANSSON: Yeah, I'll get to that 1303 01:01:11,300 --> 01:01:13,520 on the next set of slides. 1304 01:01:13,520 --> 01:01:14,980 Thank you. 1305 01:01:14,980 --> 01:01:17,560 Any other questions about the Q-learning part? 1306 01:01:17,560 --> 01:01:19,880 A colleague of mine, Rahul, he said-- 1307 01:01:19,880 --> 01:01:22,130 or maybe he just paraphrased it from someone else. 1308 01:01:22,130 --> 01:01:24,130 But essentially, you have to see RL 1309 01:01:24,130 --> 01:01:27,350 10 times before you get it, or something to that effect. 1310 01:01:27,350 --> 01:01:28,440 I had the same experience. 1311 01:01:28,440 --> 01:01:31,275 So hopefully you have questions for me after. 1312 01:01:31,275 --> 01:01:32,900 AUDIENCE: Human reinforcement learning. 1313 01:01:36,010 --> 01:01:37,800 FREDRIK D. JOHANSSON: Exactly. 1314 01:01:37,800 --> 01:01:41,180 But I think what you should take from the last two sections, 1315 01:01:41,180 --> 01:01:43,305 if not how to do Q-learning in detail, 1316 01:01:43,305 --> 01:01:44,930 because I glossed over a lot of things. 1317 01:01:44,930 --> 01:01:48,230 You should take with you the idea of dynamic programming 1318 01:01:48,230 --> 01:01:49,880 and figuring out, how can I learn 1319 01:01:49,880 --> 01:01:53,330 about what's good early on in my process from what's good late? 1320 01:01:53,330 --> 01:01:56,090 And the idea of moving towards a good state 1321 01:01:56,090 --> 01:01:58,640 and not just arriving there immediately. 1322 01:01:58,640 --> 01:02:01,520 And there are many ways to think about that. 1323 01:02:01,520 --> 01:02:05,780 OK, we'll move on to off-policy learning. 1324 01:02:05,780 --> 01:02:09,470 And again, the set-up here is that we receive trajectories 1325 01:02:09,470 --> 01:02:12,990 of patient states, actions, and rewards from some source. 1326 01:02:12,990 --> 01:02:16,000 We don't know what these sources necessarily-- well, we probably 1327 01:02:16,000 --> 01:02:17,000 know what the source is. 1328 01:02:17,000 --> 01:02:19,400 But we don't know how these actions were performed, 1329 01:02:19,400 --> 01:02:21,740 i.e., we don't know what the policy was that generated 1330 01:02:21,740 --> 01:02:22,670 these trajectories. 1331 01:02:22,670 --> 01:02:24,230 And this is the same set-up as when 1332 01:02:24,230 --> 01:02:28,580 you estimated causal effects last week, to a large extent. 1333 01:02:28,580 --> 01:02:31,190 We say that the actions are drawn, again, 1334 01:02:31,190 --> 01:02:33,925 according to some behavior policy unknown to us. 1335 01:02:33,925 --> 01:02:35,300 But we want to figure out what is 1336 01:02:35,300 --> 01:02:37,880 the value of a new policy, pi. 1337 01:02:37,880 --> 01:02:40,070 So when I showed you very early on-- 1338 01:02:40,070 --> 01:02:41,360 I wish I had that slide again. 1339 01:02:41,360 --> 01:02:45,740 But essentially, a bunch of patient trajectories and some 1340 01:02:45,740 --> 01:02:46,760 return. 1341 01:02:46,760 --> 01:02:49,400 Patient trajectories, some return. 1342 01:02:49,400 --> 01:02:52,190 The average of those, that's called a value. 1343 01:02:52,190 --> 01:02:55,430 If we have trajectories according to a certain policy, 1344 01:02:55,430 --> 01:02:57,140 that is the value of that policy-- 1345 01:02:57,140 --> 01:02:59,030 the average of these things. 1346 01:02:59,030 --> 01:03:01,760 But when we have trajectories according to one policy 1347 01:03:01,760 --> 01:03:03,980 and want to figure out the value of another one, 1348 01:03:03,980 --> 01:03:06,900 that's the same problem as the covariate adjustment problem 1349 01:03:06,900 --> 01:03:08,400 that you had last week, essentially. 1350 01:03:08,400 --> 01:03:13,320 Or the confounding problem, essentially. 1351 01:03:13,320 --> 01:03:15,740 The trajectories that we draw are 1352 01:03:15,740 --> 01:03:18,860 biased according to the policy of the clinician that 1353 01:03:18,860 --> 01:03:20,000 created them. 1354 01:03:20,000 --> 01:03:22,862 And we want to figure out the value of a different policy. 1355 01:03:22,862 --> 01:03:24,320 So it's the same as the confounding 1356 01:03:24,320 --> 01:03:26,030 problem from the last time. 1357 01:03:26,030 --> 01:03:30,260 And because it is the same as the confounding from last time, 1358 01:03:30,260 --> 01:03:33,160 we know that this is at least as hard as doing that. 1359 01:03:33,160 --> 01:03:36,310 We have confounding-- I already alluded to variance issues. 1360 01:03:36,310 --> 01:03:39,210 And you mentioned overlap or positivity as well. 1361 01:03:39,210 --> 01:03:42,155 And in fact, we need to make the same assumptions but even 1362 01:03:42,155 --> 01:03:44,030 stronger assumptions for this to be possible. 1363 01:03:46,857 --> 01:03:48,190 These are sufficient conditions. 1364 01:03:48,190 --> 01:03:50,410 So, under very certain circumstances, 1365 01:03:50,410 --> 01:03:53,157 you don't need them. 1366 01:03:53,157 --> 01:03:55,240 I should say, these are fairly general assumptions 1367 01:03:55,240 --> 01:03:56,350 that are still strict-- 1368 01:03:56,350 --> 01:03:58,330 that's how I should put it. 1369 01:03:58,330 --> 01:03:59,830 So last time, we looked at something 1370 01:03:59,830 --> 01:04:00,970 called strong ignorability. 1371 01:04:00,970 --> 01:04:02,450 I realized the text is pretty small in here. 1372 01:04:02,450 --> 01:04:03,410 Can you see in the back? 1373 01:04:03,410 --> 01:04:03,910 Is that OK? 1374 01:04:03,910 --> 01:04:05,140 OK, great. 1375 01:04:05,140 --> 01:04:07,840 So strong ignorability said that the potential outcomes-- 1376 01:04:07,840 --> 01:04:11,080 Y0 and Y1-- are conditionally independent of the treatment, 1377 01:04:11,080 --> 01:04:15,940 t, given the set of variables, x, or the variable, x. 1378 01:04:15,940 --> 01:04:18,940 And that's saying that it doesn't matter if we 1379 01:04:18,940 --> 01:04:20,890 know what treatment was given. 1380 01:04:20,890 --> 01:04:22,930 We can figure out just based on x 1381 01:04:22,930 --> 01:04:25,270 what would happen under either treatment arm, where 1382 01:04:25,270 --> 01:04:29,610 we should treat this patient, with t equals 0, t equals 1. 1383 01:04:29,610 --> 01:04:30,840 We had an idea of-- 1384 01:04:30,840 --> 01:04:32,340 or an assumption of-- overlap, which 1385 01:04:32,340 --> 01:04:35,970 says that any treatment could be observed 1386 01:04:35,970 --> 01:04:40,682 in any state or any context, x. 1387 01:04:44,250 --> 01:04:45,660 That's what that means. 1388 01:04:45,660 --> 01:04:48,600 And that is only to ensure that we 1389 01:04:48,600 --> 01:04:50,820 can estimate at least a conditional average treatment 1390 01:04:50,820 --> 01:04:54,210 effect at x. 1391 01:04:54,210 --> 01:04:56,958 And if we want to estimate the average treatment 1392 01:04:56,958 --> 01:04:58,500 effect in a population, we would need 1393 01:04:58,500 --> 01:05:02,617 to have that for every x in that population. 1394 01:05:02,617 --> 01:05:04,200 So what happens in the sequential case 1395 01:05:04,200 --> 01:05:08,070 is that we need even stronger assumptions. 1396 01:05:08,070 --> 01:05:10,160 There's some notation I haven't introduced here 1397 01:05:10,160 --> 01:05:11,230 and I apologize for that. 1398 01:05:11,230 --> 01:05:15,508 But there's a bar here over these Ss and As-- 1399 01:05:15,508 --> 01:05:16,800 I don't know if you can see it. 1400 01:05:16,800 --> 01:05:18,660 That usually indicates in this literature 1401 01:05:18,660 --> 01:05:23,100 that you're looking at the sequence, up to the index here. 1402 01:05:23,100 --> 01:05:27,140 So all the states up until t have observed 1403 01:05:27,140 --> 01:05:29,267 and all the actions up until t minus 1. 1404 01:05:34,790 --> 01:05:39,522 So in order for the best policy to be identifiable-- 1405 01:05:39,522 --> 01:05:41,480 or the value of a positive to be identifiable-- 1406 01:05:41,480 --> 01:05:43,880 we need this strong condition. 1407 01:05:43,880 --> 01:05:46,397 So the return of a policy is independent 1408 01:05:46,397 --> 01:05:48,230 of the current action, given everything that 1409 01:05:48,230 --> 01:05:49,105 happened in the past. 1410 01:05:54,310 --> 01:05:56,060 This is weaker than the Markov assumption, 1411 01:05:56,060 --> 01:05:58,708 to be clear, because there, we said that anything that happens 1412 01:05:58,708 --> 01:06:00,500 in the future is conditionally independent, 1413 01:06:00,500 --> 01:06:02,430 given the current state. 1414 01:06:02,430 --> 01:06:05,480 So this is weaker, because we now 1415 01:06:05,480 --> 01:06:08,957 just need to observe something in the history. 1416 01:06:08,957 --> 01:06:11,040 We need to observe all confounders in the history, 1417 01:06:11,040 --> 01:06:12,180 in this instance. 1418 01:06:12,180 --> 01:06:14,090 We don't need to summarize them in S. 1419 01:06:14,090 --> 01:06:16,010 And we'll get back to this in the next slide. 1420 01:06:16,010 --> 01:06:18,410 Positivity is the real difficult one, though, 1421 01:06:18,410 --> 01:06:20,780 because what we're saying is that at any point 1422 01:06:20,780 --> 01:06:26,120 in the trajectory, any action should be possible in order 1423 01:06:26,120 --> 01:06:28,670 for us to estimate the value of any possible policy. 1424 01:06:28,670 --> 01:06:31,770 And we know that that's not going to be true in practice. 1425 01:06:31,770 --> 01:06:33,890 We're not going to consider every possible action 1426 01:06:33,890 --> 01:06:37,820 at every possible point in the health care setting. 1427 01:06:37,820 --> 01:06:39,570 There's just no way. 1428 01:06:39,570 --> 01:06:42,740 So what that tells us is that we can't 1429 01:06:42,740 --> 01:06:45,440 estimate the value of every possible policy. 1430 01:06:45,440 --> 01:06:47,900 We can only estimate the value of policies 1431 01:06:47,900 --> 01:06:54,660 that are consistent with the support that we do have. 1432 01:06:54,660 --> 01:06:58,050 If we never see action 4 at time 3, 1433 01:06:58,050 --> 01:07:01,130 there's no way we can learn about a policy that does that-- 1434 01:07:01,130 --> 01:07:02,880 that takes action 4 at time 3. 1435 01:07:02,880 --> 01:07:04,150 That's what I'm trying to say. 1436 01:07:04,150 --> 01:07:09,950 So in some sense, this is stronger, 1437 01:07:09,950 --> 01:07:13,742 just because of how sequential settings work. 1438 01:07:13,742 --> 01:07:15,950 It's more about the application domain than anything, 1439 01:07:15,950 --> 01:07:17,770 I would say. 1440 01:07:17,770 --> 01:07:19,643 In the next set of slides, we'll focus on 1441 01:07:19,643 --> 01:07:21,810 sequential randomization or sequential ignorability, 1442 01:07:21,810 --> 01:07:23,520 as it's sometimes called. 1443 01:07:23,520 --> 01:07:25,940 And tomorrow, we'll talk a little bit 1444 01:07:25,940 --> 01:07:28,820 about the statistics involved in or resulting 1445 01:07:28,820 --> 01:07:32,180 from the positivity assumption and things 1446 01:07:32,180 --> 01:07:33,740 like importance weighting, et cetera. 1447 01:07:33,740 --> 01:07:34,310 Did I say tomorrow? 1448 01:07:34,310 --> 01:07:35,018 I meant Thursday. 1449 01:07:38,350 --> 01:07:42,490 So last recap on the potential outcome story. 1450 01:07:42,490 --> 01:07:43,330 This is a slide-- 1451 01:07:43,330 --> 01:07:44,788 I'm not sure if he showed this one, 1452 01:07:44,788 --> 01:07:47,470 but it's one that we used in a lot of talks. 1453 01:07:47,470 --> 01:07:50,260 And it, again, just serves to illustrate the idea 1454 01:07:50,260 --> 01:07:52,000 of a one-timestep decision. 1455 01:07:52,000 --> 01:07:53,520 So we have here, Anna. 1456 01:07:53,520 --> 01:07:54,320 A patient comes in. 1457 01:07:54,320 --> 01:07:59,050 She has high blood sugar and some other properties. 1458 01:07:59,050 --> 01:08:01,870 And we're debating whether to give her medication A or B. 1459 01:08:01,870 --> 01:08:03,910 And to do that, we want to figure out 1460 01:08:03,910 --> 01:08:06,970 what would be her blood sugar under these different choices 1461 01:08:06,970 --> 01:08:09,170 a few months down the line? 1462 01:08:09,170 --> 01:08:12,010 So I'm just using this here to introduce you 1463 01:08:12,010 --> 01:08:13,240 to the patient, Anna. 1464 01:08:13,240 --> 01:08:15,750 And we're going to talk about Anna a little bit more. 1465 01:08:15,750 --> 01:08:19,899 So treating Anna once, we can represent as this causal graph 1466 01:08:19,899 --> 01:08:22,410 that you've seen a lot of times now. 1467 01:08:22,410 --> 01:08:24,670 We had some treatment, A, we had some state, S, 1468 01:08:24,670 --> 01:08:27,970 and some outcome, R. We want to figure out the effect of this A 1469 01:08:27,970 --> 01:08:30,040 on the outcome, R. 1470 01:08:30,040 --> 01:08:31,689 Ignorability in this case just says 1471 01:08:31,689 --> 01:08:36,189 that the potential outcomes under each action, A, 1472 01:08:36,189 --> 01:08:40,930 is conditionally independent of A, given S. 1473 01:08:40,930 --> 01:08:46,090 And so we know that ignorability and overlap is 1474 01:08:46,090 --> 01:08:50,180 sufficient conditions for identification of this effect. 1475 01:08:50,180 --> 01:08:53,370 But what happens now if we add another time point? 1476 01:08:53,370 --> 01:08:56,340 OK, so in this case, if I have no extra arrows here-- 1477 01:08:56,340 --> 01:08:58,470 I just have completely independent time points-- 1478 01:08:58,470 --> 01:09:01,040 ignorability clearly still holds. 1479 01:09:01,040 --> 01:09:04,130 There's no links going from A to R, there's no from S 1480 01:09:04,130 --> 01:09:05,859 to R, et cetera. 1481 01:09:05,859 --> 01:09:07,109 So ignorability is still fine. 1482 01:09:15,260 --> 01:09:19,609 If Anna's health status in the future depends on the actions 1483 01:09:19,609 --> 01:09:26,850 that I take now, here, then the situation 1484 01:09:26,850 --> 01:09:28,060 is a little bit different. 1485 01:09:28,060 --> 01:09:32,640 So this is now not in the completely independent actions 1486 01:09:32,640 --> 01:09:34,710 that I make, but the actions here 1487 01:09:34,710 --> 01:09:36,729 influence the state in the future. 1488 01:09:36,729 --> 01:09:38,160 So we've seen this. 1489 01:09:38,160 --> 01:09:42,210 This is a Markov decision process, as you've seen before. 1490 01:09:42,210 --> 01:09:45,240 This is very likely in practice. 1491 01:09:45,240 --> 01:09:47,460 Also, if Anna, for example, is diabetic, 1492 01:09:47,460 --> 01:09:49,620 as we saw in the example that I mentioned, 1493 01:09:49,620 --> 01:09:52,890 it's likely that she will remain so. 1494 01:09:52,890 --> 01:09:56,290 This previous state will influence the future state. 1495 01:09:56,290 --> 01:09:58,630 These things seem very reasonable, right? 1496 01:09:58,630 --> 01:10:01,840 But now I'm trying to argue about the sequential 1497 01:10:01,840 --> 01:10:03,010 ignorability assumption. 1498 01:10:03,010 --> 01:10:04,210 How can we break that? 1499 01:10:04,210 --> 01:10:05,980 How can we break ignorability when it 1500 01:10:05,980 --> 01:10:07,390 comes to the sequential, say? 1501 01:10:15,110 --> 01:10:17,700 If you have an action here-- 1502 01:10:17,700 --> 01:10:21,415 so the outcome at a later point depends on an earlier choice. 1503 01:10:21,415 --> 01:10:22,790 That might certainly be the case, 1504 01:10:22,790 --> 01:10:25,320 because we could have a delayed effect of something. 1505 01:10:25,320 --> 01:10:28,460 So if we measure, say, a lab value which 1506 01:10:28,460 --> 01:10:31,220 could be in the right range or not, 1507 01:10:31,220 --> 01:10:32,960 it could very well depend on medication 1508 01:10:32,960 --> 01:10:36,140 we gave a long time ago. 1509 01:10:36,140 --> 01:10:37,640 And it's also likely that the reward 1510 01:10:37,640 --> 01:10:41,840 could depend on a state which is much earlier, depending 1511 01:10:41,840 --> 01:10:44,150 on what we include in that state variable. 1512 01:10:44,150 --> 01:10:46,400 We already have an example, I think, from the audience 1513 01:10:46,400 --> 01:10:47,147 on that. 1514 01:10:47,147 --> 01:10:49,730 So actually, ignorability should have a big red cross over it, 1515 01:10:49,730 --> 01:10:50,980 because it doesn't hold there. 1516 01:10:50,980 --> 01:10:53,373 And it's luckily on the next slide. 1517 01:10:53,373 --> 01:10:54,790 Because there are even more errors 1518 01:10:54,790 --> 01:10:58,345 that we can have, conceivably, in the medical setting. 1519 01:10:58,345 --> 01:10:59,720 The example that we got from Pete 1520 01:10:59,720 --> 01:11:01,303 before was, essentially, that if we've 1521 01:11:01,303 --> 01:11:04,870 tried an action previously, we might not want to try it again. 1522 01:11:04,870 --> 01:11:07,082 Or if we knew that something worked previously, 1523 01:11:07,082 --> 01:11:08,290 we might want to do it again. 1524 01:11:08,290 --> 01:11:09,790 So if we had a good reward here, we 1525 01:11:09,790 --> 01:11:12,810 might want to do the same thing twice. 1526 01:11:12,810 --> 01:11:15,370 And this arrow here says that if we 1527 01:11:15,370 --> 01:11:18,940 know that a patient had a symptom earlier on, 1528 01:11:18,940 --> 01:11:21,190 we might want to base our actions on it later. 1529 01:11:21,190 --> 01:11:23,885 We've known that the patient had an allergic reaction 1530 01:11:23,885 --> 01:11:25,010 at some point, for example. 1531 01:11:25,010 --> 01:11:27,958 We might not want to use that medication at a later time. 1532 01:11:27,958 --> 01:11:30,250 AUDIENCE: But you can always put everything in a state. 1533 01:11:30,250 --> 01:11:31,705 FREDRIK D. JOHANSSON: Exactly. 1534 01:11:31,705 --> 01:11:33,580 So this depends on what you put in the state. 1535 01:11:36,970 --> 01:11:38,980 This is an example where I should introduce 1536 01:11:38,980 --> 01:11:41,740 these arrows to show that, if I haven't 1537 01:11:41,740 --> 01:11:46,030 got that information here, then I introduce this dependence. 1538 01:11:46,030 --> 01:11:48,040 So if I don't have the information 1539 01:11:48,040 --> 01:11:54,660 about allergic reaction or some symptom before in here, 1540 01:11:54,660 --> 01:11:58,220 then I have to do something else. 1541 01:11:58,220 --> 01:12:00,400 So exactly that is the point. 1542 01:12:00,400 --> 01:12:03,410 If I can summarize history in some good way-- 1543 01:12:03,410 --> 01:12:08,570 if I can compress all of these four variables 1544 01:12:08,570 --> 01:12:12,248 into some variable age standard for the history, 1545 01:12:12,248 --> 01:12:14,540 then I have ignorability, with respect to that history, 1546 01:12:14,540 --> 01:12:23,030 H. This is your solution and it introduces a new problem, 1547 01:12:23,030 --> 01:12:28,720 because history is usually a really large thing. 1548 01:12:28,720 --> 01:12:30,720 We know that history grows with time, obviously. 1549 01:12:30,720 --> 01:12:32,660 But usually we don't observe patients 1550 01:12:32,660 --> 01:12:34,260 for the same number of time points. 1551 01:12:34,260 --> 01:12:36,710 So how do we represent that for a program? 1552 01:12:36,710 --> 01:12:38,850 How do we represent that to a learning algorithm? 1553 01:12:38,850 --> 01:12:41,270 That's something we have to deal with. 1554 01:12:41,270 --> 01:12:43,260 You can pad history with 0s, et cetera, 1555 01:12:43,260 --> 01:12:45,050 but if you keep every timestep and repeat 1556 01:12:45,050 --> 01:12:46,800 every variable in every timestep, 1557 01:12:46,800 --> 01:12:49,070 you get a very large object. 1558 01:12:49,070 --> 01:12:51,050 That might introduce statistical problems, 1559 01:12:51,050 --> 01:12:52,967 because now you have much more variance if you 1560 01:12:52,967 --> 01:12:54,900 have new variables, et cetera. 1561 01:12:54,900 --> 01:12:57,330 So one thing that people do is that they 1562 01:12:57,330 --> 01:12:59,595 look some amount of time backwards-- so 1563 01:12:59,595 --> 01:13:01,470 instead of just looking at one timestep back, 1564 01:13:01,470 --> 01:13:03,330 you now look at a length k window. 1565 01:13:03,330 --> 01:13:08,120 And your state essentially grows by a factor, k. 1566 01:13:08,120 --> 01:13:10,050 And another alternative is to try and learn 1567 01:13:10,050 --> 01:13:11,100 a summary function. 1568 01:13:11,100 --> 01:13:13,290 Learn some function that is relevant for predicting 1569 01:13:13,290 --> 01:13:16,050 the outcome that takes all of the history into account, 1570 01:13:16,050 --> 01:13:20,490 but has a smaller representation than just t times the variables 1571 01:13:20,490 --> 01:13:22,250 that you have. 1572 01:13:22,250 --> 01:13:25,340 But this is something that needs to happen, usually. 1573 01:13:29,070 --> 01:13:30,570 Most health care data, in practice-- 1574 01:13:30,570 --> 01:13:31,790 you have to make choices about this. 1575 01:13:31,790 --> 01:13:33,740 I just want to stress that that's something you really 1576 01:13:33,740 --> 01:13:34,250 can't avoid. 1577 01:13:37,960 --> 01:13:40,460 The last point I want to make is that unobserved confounding 1578 01:13:40,460 --> 01:13:44,480 is also a problem that is not avoidable just 1579 01:13:44,480 --> 01:13:46,850 due to summarizing history. 1580 01:13:46,850 --> 01:13:48,890 We can introduce new confounding. 1581 01:13:48,890 --> 01:13:51,230 That is a problem, if we don't summarize history well. 1582 01:13:51,230 --> 01:13:53,212 But we can also have unobserved confounders, 1583 01:13:53,212 --> 01:13:54,920 just like we can in the one-step setting. 1584 01:13:58,220 --> 01:14:01,460 One example is if we have an unobserved confounded 1585 01:14:01,460 --> 01:14:03,390 in the same way as we did before. 1586 01:14:03,390 --> 01:14:08,300 It impacts both the action at time 1 1587 01:14:08,300 --> 01:14:09,590 and the reward at time 1. 1588 01:14:09,590 --> 01:14:11,150 But of course, now we're in the sequential setting. 1589 01:14:11,150 --> 01:14:13,525 The confounding structure could be much more complicated. 1590 01:14:13,525 --> 01:14:17,000 We could have a confounder that influences an early action 1591 01:14:17,000 --> 01:14:18,398 and a late reward. 1592 01:14:18,398 --> 01:14:19,940 So it might be a little harder for us 1593 01:14:19,940 --> 01:14:24,083 to characterize what is the set of potential confounders? 1594 01:14:24,083 --> 01:14:25,500 So I just wanted to point that out 1595 01:14:25,500 --> 01:14:27,870 and to reinforce that this is only harder 1596 01:14:27,870 --> 01:14:29,520 than the one-step setting. 1597 01:14:29,520 --> 01:14:30,900 So we're wrapping up now. 1598 01:14:30,900 --> 01:14:37,050 I just want to end on a point about the games 1599 01:14:37,050 --> 01:14:38,820 that we looked at before. 1600 01:14:38,820 --> 01:14:43,170 One of the big reasons that these algorithms were 1601 01:14:43,170 --> 01:14:45,390 so successful in playing games was 1602 01:14:45,390 --> 01:14:48,030 that we have full observability in these settings. 1603 01:14:48,030 --> 01:14:53,520 We know everything from the game board itself-- 1604 01:14:53,520 --> 01:14:55,470 when it comes to Go, at least. 1605 01:14:55,470 --> 01:14:57,650 We can debate that when it comes to the video games. 1606 01:14:57,650 --> 01:15:01,650 But in Go, we have complete observability of the board. 1607 01:15:01,650 --> 01:15:04,320 Everything we need to know for an optimal decision 1608 01:15:04,320 --> 01:15:05,970 is there at any time point. 1609 01:15:09,712 --> 01:15:11,670 Not only can we observe it through the history, 1610 01:15:11,670 --> 01:15:14,980 but in the case of Go, you don't even need to look at history. 1611 01:15:14,980 --> 01:15:17,430 We certainly have Markov dynamics with respect 1612 01:15:17,430 --> 01:15:18,810 to the board itself. 1613 01:15:18,810 --> 01:15:22,140 You don't ever have to remember what was a move earlier on, 1614 01:15:22,140 --> 01:15:24,390 unless you want to read into your opponent, I suppose. 1615 01:15:24,390 --> 01:15:27,420 But that's a game theoretic notion 1616 01:15:27,420 --> 01:15:29,890 we're not going to get into here. 1617 01:15:29,890 --> 01:15:33,450 But more importantly, we can explore the dynamics 1618 01:15:33,450 --> 01:15:35,190 of these systems almost limitlessly, 1619 01:15:35,190 --> 01:15:38,040 just by simulation and self-play. 1620 01:15:38,040 --> 01:15:40,415 And that's true regardless if you have full observability 1621 01:15:40,415 --> 01:15:42,123 or not-- like in StarCraft, you might not 1622 01:15:42,123 --> 01:15:43,170 have full observability. 1623 01:15:43,170 --> 01:15:46,780 But you can try your things out endlessly. 1624 01:15:46,780 --> 01:15:49,390 And contrast that with having, I don't know, 1625 01:15:49,390 --> 01:15:52,750 700 patients with rheumatoid arthritis or something 1626 01:15:52,750 --> 01:15:53,420 like that. 1627 01:15:53,420 --> 01:15:56,100 Those are the samples you have. 1628 01:15:56,100 --> 01:15:59,270 You're not going to get new ones. 1629 01:15:59,270 --> 01:16:02,150 So that is an amazing obstacle for us 1630 01:16:02,150 --> 01:16:04,780 to overcome if we want to do this in a good way. 1631 01:16:04,780 --> 01:16:07,640 The current algorithms are really 1632 01:16:07,640 --> 01:16:09,980 inefficient with the data that they use. 1633 01:16:09,980 --> 01:16:14,150 And that's why this limitless exploration or simulation 1634 01:16:14,150 --> 01:16:17,720 has been so important for these games. 1635 01:16:17,720 --> 01:16:19,310 And that's also why the games are 1636 01:16:19,310 --> 01:16:21,716 the success stories of this. 1637 01:16:21,716 --> 01:16:24,690 A last point is that typically for these settings 1638 01:16:24,690 --> 01:16:28,390 that I put here, we have no noise, essentially. 1639 01:16:28,390 --> 01:16:31,650 We get perfect observations of actions and states 1640 01:16:31,650 --> 01:16:33,370 and outcomes and everything like that. 1641 01:16:33,370 --> 01:16:36,010 And that's really true in any real-world application. 1642 01:16:36,010 --> 01:16:36,510 All right. 1643 01:16:36,510 --> 01:16:37,590 I'm going to wrap up. 1644 01:16:37,590 --> 01:16:42,960 Tomorrow-- nope, Thursday, David is 1645 01:16:42,960 --> 01:16:45,612 going to talk about more explicitly 1646 01:16:45,612 --> 01:16:47,820 if we want to do this properly in health care, what's 1647 01:16:47,820 --> 01:16:48,230 going to happen? 1648 01:16:48,230 --> 01:16:50,760 We're going to have a great discussion, I'm sure, as well. 1649 01:16:50,760 --> 01:16:53,190 So don't mind the slide. 1650 01:16:53,190 --> 01:16:53,930 It's Thursday. 1651 01:16:53,930 --> 01:16:54,430 All right. 1652 01:16:54,430 --> 01:16:55,350 Thanks a lot. 1653 01:16:55,350 --> 01:16:59,900 [APPLAUSE]