1 00:00:15,220 --> 00:00:18,810 PROFESSOR: OK, so the last topic for the class 2 00:00:18,810 --> 00:00:20,710 is interpretability. 3 00:00:20,710 --> 00:00:25,180 As you know, the modern machine learning models 4 00:00:25,180 --> 00:00:30,890 are justifiably reputed to be very difficult to understand. 5 00:00:30,890 --> 00:00:35,290 So if I give you something like the GPT2 model, which 6 00:00:35,290 --> 00:00:38,170 we talked about in natural language processing, 7 00:00:38,170 --> 00:00:43,210 and I tell you that it has 1.5 billion parameters 8 00:00:43,210 --> 00:00:49,290 and then you say, why is it working? 9 00:00:49,290 --> 00:00:52,710 Clearly the answer is not because 10 00:00:52,710 --> 00:00:56,580 these particular parameters have these particular values. 11 00:00:56,580 --> 00:00:58,870 There is no way to understand that. 12 00:00:58,870 --> 00:01:02,040 And so the topic today is something 13 00:01:02,040 --> 00:01:04,800 that we raised a little bit in the lecture 14 00:01:04,800 --> 00:01:07,260 on fairness, where one of the issues 15 00:01:07,260 --> 00:01:10,740 there was also that if you can't understand the model 16 00:01:10,740 --> 00:01:14,760 you can't tell if the model has baked-in prejudices 17 00:01:14,760 --> 00:01:16,440 by examining it. 18 00:01:16,440 --> 00:01:19,500 And so today we're going to look at different methods 19 00:01:19,500 --> 00:01:21,720 that people have developed to try 20 00:01:21,720 --> 00:01:25,035 to overcome this problem of inscrutable models. 21 00:01:27,870 --> 00:01:33,430 So there is a very interesting bit of history. 22 00:01:33,430 --> 00:01:35,790 How many of you know of George Miller's 7 23 00:01:35,790 --> 00:01:39,180 plus or minus 2 result? 24 00:01:39,180 --> 00:01:40,430 Only a few. 25 00:01:40,430 --> 00:01:48,240 So Miller was a psychologist at Harvard, I think, in the 1950s. 26 00:01:48,240 --> 00:01:52,730 And he wrote this paper in 1956 called "The Magical Number 7 27 00:01:52,730 --> 00:01:54,860 Plus or Minus 2-- 28 00:01:54,860 --> 00:01:59,000 Some Limits On Our Capacity for Processing Information." 29 00:01:59,000 --> 00:02:01,220 It's quite an interesting paper. 30 00:02:01,220 --> 00:02:07,930 So he started off with something that I had forgotten. 31 00:02:07,930 --> 00:02:10,750 I read this paper many, many years ago. 32 00:02:10,750 --> 00:02:14,740 And I'd forgotten that he starts off with the question 33 00:02:14,740 --> 00:02:18,760 of how many different things can you sense? 34 00:02:18,760 --> 00:02:22,010 How many different levels of things can you sense? 35 00:02:22,010 --> 00:02:25,240 So if I put headphones on you and I 36 00:02:25,240 --> 00:02:28,390 ask you to tell me on a scale of 1 37 00:02:28,390 --> 00:02:33,160 to n how loud is the sound that I'm playing in your headphone, 38 00:02:33,160 --> 00:02:37,610 it turns out people get confused when you get beyond about five, 39 00:02:37,610 --> 00:02:41,920 six, seven different levels of intensity. 40 00:02:41,920 --> 00:02:44,530 And similarly, if I give you a bunch of colors 41 00:02:44,530 --> 00:02:49,840 and I ask you to tell me where the boundaries are 42 00:02:49,840 --> 00:02:52,300 between different colors, people seem 43 00:02:52,300 --> 00:02:57,100 to come up with 7 plus or minus 2 as the number of colors 44 00:02:57,100 --> 00:02:58,470 that they can distinguish. 45 00:02:58,470 --> 00:03:01,210 And so there is a long psychological literature 46 00:03:01,210 --> 00:03:03,460 of this. 47 00:03:03,460 --> 00:03:08,780 And then Miller went on to do experiments 48 00:03:08,780 --> 00:03:12,200 where he asked people to memorize lists of things. 49 00:03:12,200 --> 00:03:14,360 And what he discovered is, again, 50 00:03:14,360 --> 00:03:17,690 that you could memorize a list of about 7 51 00:03:17,690 --> 00:03:19,760 plus or minus 2 things. 52 00:03:19,760 --> 00:03:23,580 And beyond that, you couldn't remember the list anymore. 53 00:03:23,580 --> 00:03:26,660 So this tells us something about the cognitive capacity 54 00:03:26,660 --> 00:03:28,220 of the human mind. 55 00:03:28,220 --> 00:03:31,790 And it suggests that if I give you an explanation that 56 00:03:31,790 --> 00:03:35,450 has 20 things in it, you're unlikely to be 57 00:03:35,450 --> 00:03:38,780 able to fathom it because you can't keep all the moving 58 00:03:38,780 --> 00:03:41,930 parts in your mind at one time. 59 00:03:41,930 --> 00:03:45,350 Now, it's a tricky result, because he does point out 60 00:03:45,350 --> 00:03:52,280 even in 1956 that if you chunk things into bigger chunks, 61 00:03:52,280 --> 00:03:57,540 you can remember seven of those, even if they're much bigger. 62 00:03:57,540 --> 00:04:00,870 And so people who are very good at memorizing things, 63 00:04:00,870 --> 00:04:03,960 for example, make up patterns. 64 00:04:03,960 --> 00:04:05,910 And they remember those patterns, 65 00:04:05,910 --> 00:04:08,700 which then allow them to actually remember 66 00:04:08,700 --> 00:04:10,070 more primitive objects. 67 00:04:10,070 --> 00:04:12,180 So you know-- and we still don't really 68 00:04:12,180 --> 00:04:14,910 understand how memory works. 69 00:04:14,910 --> 00:04:18,029 But this is just an interesting observation, 70 00:04:18,029 --> 00:04:20,730 and I think plays into the question 71 00:04:20,730 --> 00:04:27,280 of how do you explain things in a complicated model? 72 00:04:27,280 --> 00:04:29,830 Because it suggests that you can't explain 73 00:04:29,830 --> 00:04:32,470 too many different things because people 74 00:04:32,470 --> 00:04:36,010 won't understand what you're talking about. 75 00:04:36,010 --> 00:04:36,880 OK. 76 00:04:36,880 --> 00:04:41,270 So what leads to complex models? 77 00:04:41,270 --> 00:04:43,460 Well, as I say, overfitting certainly 78 00:04:43,460 --> 00:04:46,550 leads to complex models. 79 00:04:46,550 --> 00:04:49,760 I remember in the 1970s when we started 80 00:04:49,760 --> 00:04:55,820 working on expert systems in healthcare, 81 00:04:55,820 --> 00:05:00,080 I made a very bad faux pas. 82 00:05:00,080 --> 00:05:03,170 I went to the first joint conference 83 00:05:03,170 --> 00:05:06,230 between statisticians and artificial intelligence 84 00:05:06,230 --> 00:05:07,880 researchers. 85 00:05:07,880 --> 00:05:12,530 And the statisticians were all about understanding 86 00:05:12,530 --> 00:05:16,700 the variance and understanding statistical significance and so 87 00:05:16,700 --> 00:05:17,660 on. 88 00:05:17,660 --> 00:05:22,700 And I was all about trying to model details of what was going 89 00:05:22,700 --> 00:05:25,250 on in an individual patient. 90 00:05:25,250 --> 00:05:29,210 And in some discussion after my talk, somebody challenged me. 91 00:05:29,210 --> 00:05:32,270 And I said, well, what we AI people are really 92 00:05:32,270 --> 00:05:34,820 doing is fitting what you guys think 93 00:05:34,820 --> 00:05:38,030 is the noise, because we're trying 94 00:05:38,030 --> 00:05:42,920 to make a lot more detailed refinements in our theories 95 00:05:42,920 --> 00:05:47,690 and our models than what the typical statistical model does. 96 00:05:47,690 --> 00:05:53,150 And of course, I was roundly booed out of the hall. 97 00:05:53,150 --> 00:05:56,420 And people shunned me for the rest of the conference 98 00:05:56,420 --> 00:05:59,420 because I had done something really stupid 99 00:05:59,420 --> 00:06:03,170 to admit that I was fitting noise. 100 00:06:03,170 --> 00:06:05,090 And of course, I didn't really believe 101 00:06:05,090 --> 00:06:06,380 that I was fitting noise. 102 00:06:06,380 --> 00:06:09,020 I believed that what I was fitting 103 00:06:09,020 --> 00:06:11,900 was what the average statistician just 104 00:06:11,900 --> 00:06:13,490 chalks up to noise. 105 00:06:13,490 --> 00:06:18,430 And we're interested in more details of the mechanisms. 106 00:06:18,430 --> 00:06:21,700 So overfitting we have a pretty good 107 00:06:21,700 --> 00:06:23,620 handle on by regularization. 108 00:06:23,620 --> 00:06:25,450 So you can-- you know, you've seen 109 00:06:25,450 --> 00:06:28,060 lots of examples of regularization 110 00:06:28,060 --> 00:06:29,530 throughout the course. 111 00:06:29,530 --> 00:06:33,430 And people keep coming up with interesting ideas for how 112 00:06:33,430 --> 00:06:37,960 to apply regularization in order to simplify models or make them 113 00:06:37,960 --> 00:06:41,110 fit some preconception of what the model ought 114 00:06:41,110 --> 00:06:45,710 to look like before you start learning it from data. 115 00:06:45,710 --> 00:06:48,400 But the problem is that there really 116 00:06:48,400 --> 00:06:51,370 is true complexity to these models, 117 00:06:51,370 --> 00:06:54,220 whether or not you're fitting noise. 118 00:06:54,220 --> 00:06:58,330 There's-- the world is a complicated place. 119 00:06:58,330 --> 00:07:00,370 Human beings were not designed. 120 00:07:00,370 --> 00:07:02,000 They evolved. 121 00:07:02,000 --> 00:07:05,980 And so there's all kinds of bizarre stuff left over 122 00:07:05,980 --> 00:07:08,590 from our evolutionary heritage. 123 00:07:08,590 --> 00:07:11,930 And so it is just complex. 124 00:07:11,930 --> 00:07:15,380 It's hard to understand in a simple way 125 00:07:15,380 --> 00:07:18,470 how to make predictions that are useful when the world really 126 00:07:18,470 --> 00:07:20,000 is complex. 127 00:07:20,000 --> 00:07:24,630 So what do we do in order to try to deal with this? 128 00:07:24,630 --> 00:07:27,480 Well, one approach is to make up what 129 00:07:27,480 --> 00:07:32,190 I call just-so stories that give a simplified explanation of how 130 00:07:32,190 --> 00:07:34,960 a complicated thing actually works. 131 00:07:34,960 --> 00:07:37,530 So how many of you have read these stories 132 00:07:37,530 --> 00:07:40,030 when you were a kid? 133 00:07:40,030 --> 00:07:41,200 Nobody? 134 00:07:41,200 --> 00:07:42,160 My God. 135 00:07:42,160 --> 00:07:44,020 OK. 136 00:07:44,020 --> 00:07:46,330 Must be a generational thing. 137 00:07:46,330 --> 00:07:49,690 So Rudyard Kipling was a famous author. 138 00:07:49,690 --> 00:07:52,990 And he wrote the series of just-so stories, things 139 00:07:52,990 --> 00:07:57,190 like How the Lion Got His Mane and How the Camel Got 140 00:07:57,190 --> 00:07:58,870 His Hump and so on. 141 00:07:58,870 --> 00:08:02,020 And of course, they're all total bull, right? 142 00:08:02,020 --> 00:08:08,500 I mean, it's not a Darwinian evolutionary explanation 143 00:08:08,500 --> 00:08:11,650 of why male lions have manes. 144 00:08:11,650 --> 00:08:13,940 It's just some made up story. 145 00:08:13,940 --> 00:08:16,030 But they're really cute stories. 146 00:08:16,030 --> 00:08:18,270 And I enjoyed them as a kid. 147 00:08:18,270 --> 00:08:23,170 And maybe you would have, too, if your parents 148 00:08:23,170 --> 00:08:26,410 had read them to you. 149 00:08:26,410 --> 00:08:31,990 So I mean, I use this as a kind of pejorative 150 00:08:31,990 --> 00:08:35,620 because what the people who follow 151 00:08:35,620 --> 00:08:38,740 this line of investigation do is they take 152 00:08:38,740 --> 00:08:40,870 some very complicated model. 153 00:08:40,870 --> 00:08:44,770 They make a local approximation to it that says, 154 00:08:44,770 --> 00:08:48,610 this is not an approximation to the entire model, 155 00:08:48,610 --> 00:08:51,670 but it's an approximation to the model in the vicinity 156 00:08:51,670 --> 00:08:53,770 of a particular case. 157 00:08:53,770 --> 00:08:56,500 And then they explain that simplified model. 158 00:08:56,500 --> 00:08:58,990 And I'll show you some examples of that 159 00:08:58,990 --> 00:09:01,570 through the lecture today. 160 00:09:01,570 --> 00:09:03,850 And the other approach which I'll also 161 00:09:03,850 --> 00:09:07,150 show you some examples of is that you simply 162 00:09:07,150 --> 00:09:10,920 trade off somewhat lower performance for a simple-- 163 00:09:10,920 --> 00:09:14,770 a model that's simple enough to be able to explain. 164 00:09:14,770 --> 00:09:19,750 So things like decision trees and logistic regression 165 00:09:19,750 --> 00:09:22,660 and so on typically don't perform quite 166 00:09:22,660 --> 00:09:28,240 as well as the best, most sophisticated models, 167 00:09:28,240 --> 00:09:31,570 although you've seen plenty of examples in this class 168 00:09:31,570 --> 00:09:34,780 where, in fact, they do perform quite well 169 00:09:34,780 --> 00:09:36,520 and where they're not outperformed 170 00:09:36,520 --> 00:09:38,260 by the fancy models. 171 00:09:38,260 --> 00:09:40,600 But in general, you can do a little better 172 00:09:40,600 --> 00:09:42,740 by tweaking a fancy model. 173 00:09:42,740 --> 00:09:44,710 But then it becomes incomprehensible. 174 00:09:44,710 --> 00:09:46,880 And so people are willing to say, 175 00:09:46,880 --> 00:09:51,250 OK, I'm going to give up 1% or 2% in performance 176 00:09:51,250 --> 00:09:55,570 in order to have a model that I can really understand. 177 00:09:55,570 --> 00:09:59,440 And the reason it makes sense is because these models are not 178 00:09:59,440 --> 00:10:00,550 self-executing. 179 00:10:00,550 --> 00:10:04,690 They're typically used as advice for some human being 180 00:10:04,690 --> 00:10:06,460 who makes ultimate decisions. 181 00:10:06,460 --> 00:10:08,920 Your surgeon is not going to look 182 00:10:08,920 --> 00:10:10,930 at one of these models that says, 183 00:10:10,930 --> 00:10:16,430 take out the guy's left kidney and say, OK, I guess. 184 00:10:16,430 --> 00:10:19,540 They're going to go, well, does that make sense? 185 00:10:19,540 --> 00:10:21,730 And in order to answer the question of, 186 00:10:21,730 --> 00:10:22,860 does that make sense? 187 00:10:22,860 --> 00:10:25,870 It really helps to know what the model is-- 188 00:10:25,870 --> 00:10:29,470 what the model's recommendation is based on. 189 00:10:29,470 --> 00:10:31,180 What is its internal logic? 190 00:10:31,180 --> 00:10:35,750 And so even an approximation to that is useful. 191 00:10:35,750 --> 00:10:43,510 So the need for trust, clinical adoption of ML models-- 192 00:10:43,510 --> 00:10:46,060 there are two approaches in this paper 193 00:10:46,060 --> 00:10:49,540 that I'm going to talk about where they say, OK, 194 00:10:49,540 --> 00:10:54,640 what you'd like to do is to look at case-specific predictions. 195 00:10:54,640 --> 00:10:58,450 So there is a particular patient in a particular state 196 00:10:58,450 --> 00:11:00,430 and you want to understand what the model is 197 00:11:00,430 --> 00:11:02,440 saying about that patient. 198 00:11:02,440 --> 00:11:05,230 And then you also want to have confidence in the model 199 00:11:05,230 --> 00:11:06,530 overall. 200 00:11:06,530 --> 00:11:10,810 And so you'd like to be able to have an explanatory capability 201 00:11:10,810 --> 00:11:14,410 that says, here are some interesting representative 202 00:11:14,410 --> 00:11:15,560 cases. 203 00:11:15,560 --> 00:11:17,740 And here's how the model views them. 204 00:11:17,740 --> 00:11:19,690 Look through them and decide whether you 205 00:11:19,690 --> 00:11:23,800 agree with the approach that this model is taking. 206 00:11:23,800 --> 00:11:28,270 Now, remember my critique of randomized controlled trials 207 00:11:28,270 --> 00:11:31,510 that people do these trials. 208 00:11:31,510 --> 00:11:36,580 They choose the simplest cases, the smallest number of patients 209 00:11:36,580 --> 00:11:40,180 that they need in order to reach statistical significance, 210 00:11:40,180 --> 00:11:44,450 the shortest amount of follow-up time, et cetera. 211 00:11:44,450 --> 00:11:46,420 And then the results of those trials 212 00:11:46,420 --> 00:11:49,310 are applied to very different populations. 213 00:11:49,310 --> 00:11:52,450 So Davids talked about the cohort shift 214 00:11:52,450 --> 00:11:55,330 as a generalization of that idea. 215 00:11:55,330 --> 00:11:58,270 But the same thing happens in these machine learning models 216 00:11:58,270 --> 00:12:01,270 that you train on some set of data. 217 00:12:01,270 --> 00:12:03,820 The typical publication will then 218 00:12:03,820 --> 00:12:08,770 test on some held-out subset of the same data. 219 00:12:08,770 --> 00:12:11,410 But that's not a very accurate representation 220 00:12:11,410 --> 00:12:12,880 of the real world. 221 00:12:12,880 --> 00:12:17,080 If you then try to apply that model to data from a totally 222 00:12:17,080 --> 00:12:19,210 different source, the chances are 223 00:12:19,210 --> 00:12:21,910 you will have specialized it in some way 224 00:12:21,910 --> 00:12:23,680 that you don't appreciate. 225 00:12:23,680 --> 00:12:25,930 And the results that you get are not 226 00:12:25,930 --> 00:12:29,440 as good as what you got on the held-out test data 227 00:12:29,440 --> 00:12:32,500 because it's more heterogeneous. 228 00:12:32,500 --> 00:12:35,310 I think I mentioned that Jeff Drazen, 229 00:12:35,310 --> 00:12:38,140 the editor-in-chief of the New England Journal, 230 00:12:38,140 --> 00:12:44,230 had a meeting about a year ago in which he was arguing that 231 00:12:44,230 --> 00:12:48,220 the journal shouldn't ever publish a research study unless 232 00:12:48,220 --> 00:12:52,420 it's been validated on two independent data sets 233 00:12:52,420 --> 00:12:57,980 because he's tired of publishing studies that wind up getting 234 00:12:57,980 --> 00:13:00,500 retracted because-- 235 00:13:00,500 --> 00:13:04,520 not because of any overt badness on the part 236 00:13:04,520 --> 00:13:05,690 of the investigators. 237 00:13:05,690 --> 00:13:08,180 They've done exactly the kinds of things 238 00:13:08,180 --> 00:13:11,220 that you've learned how to do in this class. 239 00:13:11,220 --> 00:13:13,610 But when they go to apply that model 240 00:13:13,610 --> 00:13:16,220 to a different population, it just 241 00:13:16,220 --> 00:13:18,380 doesn't work nearly as well as it 242 00:13:18,380 --> 00:13:20,910 did in the published version. 243 00:13:20,910 --> 00:13:23,130 And of course, there are all the publication 244 00:13:23,130 --> 00:13:29,850 bias issues about if 50 of us do the same experiment 245 00:13:29,850 --> 00:13:32,460 and by random chance some of us are going to get 246 00:13:32,460 --> 00:13:34,508 better results than others. 247 00:13:34,508 --> 00:13:36,050 And those are the ones that are going 248 00:13:36,050 --> 00:13:37,950 to get published because the people who 249 00:13:37,950 --> 00:13:42,270 got poor results don't have anything interesting to report. 250 00:13:42,270 --> 00:13:45,090 And so there's that whole issue of publication bias, 251 00:13:45,090 --> 00:13:48,110 which is another serious one. 252 00:13:48,110 --> 00:13:48,610 OK. 253 00:13:53,200 --> 00:13:56,500 So I wanted to just spend a minute to say, you know, 254 00:13:56,500 --> 00:14:00,290 explanation is not a new idea. 255 00:14:00,290 --> 00:14:02,980 So in the expert systems era that we 256 00:14:02,980 --> 00:14:07,510 talked about a little bit in one of our earlier classes, 257 00:14:07,510 --> 00:14:11,950 we talked about the idea that we would take medical-- 258 00:14:11,950 --> 00:14:15,580 human medical experts and debrief them of what 259 00:14:15,580 --> 00:14:21,220 they knew and then try to encode those in patterns or in rules 260 00:14:21,220 --> 00:14:24,850 or in various ways in a computer program in order 261 00:14:24,850 --> 00:14:27,040 to reproduce their behavior. 262 00:14:27,040 --> 00:14:29,580 So Mycin was one of those programs-- 263 00:14:29,580 --> 00:14:34,060 [INAUDIBLE] PhD thesis-- in 1975. 264 00:14:34,060 --> 00:14:36,880 And they published this nice paper 265 00:14:36,880 --> 00:14:41,170 that was about explanation and rule acquisition capabilities 266 00:14:41,170 --> 00:14:42,970 of the Mycin system. 267 00:14:42,970 --> 00:14:46,180 And as an illustration, they gave some examples 268 00:14:46,180 --> 00:14:48,590 of what you could do with the system. 269 00:14:48,590 --> 00:14:53,410 So rules, they argued, were quite understandable 270 00:14:53,410 --> 00:14:56,950 because they say if a bunch of conditions, then you 271 00:14:56,950 --> 00:15:00,130 can draw the following conclusion. 272 00:15:00,130 --> 00:15:03,460 So given that, you can say, well, 273 00:15:03,460 --> 00:15:07,330 when the program comes back and says, 274 00:15:07,330 --> 00:15:10,110 in light of the site from which the culture was obtained 275 00:15:10,110 --> 00:15:12,190 and the method of collection, do you 276 00:15:12,190 --> 00:15:16,810 feel that a significant number of organism 1 were detected-- 277 00:15:16,810 --> 00:15:18,490 were obtained? 278 00:15:18,490 --> 00:15:23,170 In other words, if you took a sample from somebody's body 279 00:15:23,170 --> 00:15:24,820 and you're looking for an infection, 280 00:15:24,820 --> 00:15:28,510 do you think you got enough organisms in that sample? 281 00:15:28,510 --> 00:15:32,800 And the user says, well, why are you asking me this question? 282 00:15:32,800 --> 00:15:37,150 And the answer in terms of the rules that the system works by 283 00:15:37,150 --> 00:15:37,780 is pretty good. 284 00:15:37,780 --> 00:15:39,760 It says it's important to find out 285 00:15:39,760 --> 00:15:43,750 whether there's therapeutically significant disease associated 286 00:15:43,750 --> 00:15:46,150 with this occurrence of organism 1. 287 00:15:46,150 --> 00:15:49,480 We've already established that the culture is not 288 00:15:49,480 --> 00:15:52,210 one of those that are normally sterile 289 00:15:52,210 --> 00:15:55,420 and the method of collection is sterile. 290 00:15:55,420 --> 00:15:58,300 Therefore, if the organism has been observed 291 00:15:58,300 --> 00:16:00,670 in significant numbers, then there's 292 00:16:00,670 --> 00:16:03,790 strongly suggestive evidence that there's therapeutically 293 00:16:03,790 --> 00:16:07,180 significant disease associated with this occurrence 294 00:16:07,180 --> 00:16:09,410 of the organism. 295 00:16:09,410 --> 00:16:15,580 So if you find bugs in a place carefully collected, 296 00:16:15,580 --> 00:16:17,560 then that suggests that you ought 297 00:16:17,560 --> 00:16:21,430 to probably treat this patient if there are were bunch of-- 298 00:16:21,430 --> 00:16:24,580 enough bugs there. 299 00:16:24,580 --> 00:16:28,120 And there's also strongly suggestive evidence 300 00:16:28,120 --> 00:16:30,730 that the organism is not a contaminant, 301 00:16:30,730 --> 00:16:33,850 because the collection method was sterile. 302 00:16:33,850 --> 00:16:39,090 And you can go on with this and you can say, well, why that? 303 00:16:39,090 --> 00:16:42,800 So why that question? 304 00:16:42,800 --> 00:16:47,740 And it traces back in its evolution of these rules 305 00:16:47,740 --> 00:16:49,750 and it says, well, in order to find out 306 00:16:49,750 --> 00:16:52,540 the locus of infection, it's already 307 00:16:52,540 --> 00:16:55,840 been established that the site of the culture is known. 308 00:16:55,840 --> 00:16:58,460 The number of days since the specimen was obtained 309 00:16:58,460 --> 00:16:59,250 is less than 7. 310 00:16:59,250 --> 00:17:01,780 Therefore, there is therapeutically 311 00:17:01,780 --> 00:17:05,349 significant disease associated with this occurrence 312 00:17:05,349 --> 00:17:06,680 of the organism. 313 00:17:06,680 --> 00:17:10,359 So there's some rule that says if you've got bugs 314 00:17:10,359 --> 00:17:13,339 and it happened within the last seven days, 315 00:17:13,339 --> 00:17:17,589 the patient probably really does have an infection. 316 00:17:17,589 --> 00:17:20,690 And I mean, I've got a lot of examples of this. 317 00:17:20,690 --> 00:17:23,460 But you can keep going why. 318 00:17:23,460 --> 00:17:26,089 You know, this is the two-year-old. 319 00:17:26,089 --> 00:17:27,339 But why, daddy? 320 00:17:27,339 --> 00:17:28,210 But why? 321 00:17:28,210 --> 00:17:29,920 But why? 322 00:17:29,920 --> 00:17:35,900 Well, why is it important to find out a locus of infection? 323 00:17:35,900 --> 00:17:38,810 And, well, there's a reason, which 324 00:17:38,810 --> 00:17:41,780 is that there is a rule that will conclude, 325 00:17:41,780 --> 00:17:45,080 for example, that the abdomen is a locus of infection 326 00:17:45,080 --> 00:17:49,340 or the pelvis is a locus of infection of the patient 327 00:17:49,340 --> 00:17:53,150 if you satisfy these criteria. 328 00:17:53,150 --> 00:17:56,900 And so this is a kind of rudimentary explanation 329 00:17:56,900 --> 00:18:00,410 that comes directly out of the fact 330 00:18:00,410 --> 00:18:02,660 that these are rule-based systems 331 00:18:02,660 --> 00:18:06,230 and so you can just play back the rules. 332 00:18:06,230 --> 00:18:08,510 One of the things I like is you can also 333 00:18:08,510 --> 00:18:10,650 ask freeform questions. 334 00:18:10,650 --> 00:18:14,930 1975, the natural language processing was not so good. 335 00:18:14,930 --> 00:18:18,080 And so this worked about one time in five. 336 00:18:18,080 --> 00:18:21,410 But you could walk up to it and type some question. 337 00:18:21,410 --> 00:18:24,980 And for example, do you ever prescribe carbenicillin 338 00:18:24,980 --> 00:18:27,390 for pseudomonas infections? 339 00:18:27,390 --> 00:18:29,270 And it says, well, there are three rules 340 00:18:29,270 --> 00:18:33,530 in my database of rules that would conclude something 341 00:18:33,530 --> 00:18:35,940 relevant to that question. 342 00:18:35,940 --> 00:18:37,910 So which one do you want to see? 343 00:18:37,910 --> 00:18:40,910 And if you say, I want to see rule 64, 344 00:18:40,910 --> 00:18:42,950 it says, well, that rule says if it's 345 00:18:42,950 --> 00:18:47,720 known with certainty that the organism is a pseudomonas 346 00:18:47,720 --> 00:18:52,080 and the drug under consideration is gentamicin, 347 00:18:52,080 --> 00:18:54,930 then a more appropriate therapy would 348 00:18:54,930 --> 00:18:58,890 be a combination of gentamicin and carbenicillin. 349 00:18:58,890 --> 00:19:03,630 Again, this is medical knowledge as of 1975. 350 00:19:03,630 --> 00:19:06,690 But my guess is the real underlying reason 351 00:19:06,690 --> 00:19:09,570 is that there probably were pseudomonas 352 00:19:09,570 --> 00:19:12,766 that were resistant by that point, to gentamicin, 353 00:19:12,766 --> 00:19:15,750 and so they used a combination therapy. 354 00:19:15,750 --> 00:19:19,530 Now, notice, by the way, that this explanation capability 355 00:19:19,530 --> 00:19:22,570 does not tell you that, right? 356 00:19:22,570 --> 00:19:26,050 Because it doesn't actually understand the rationale 357 00:19:26,050 --> 00:19:28,390 behind these individual rules. 358 00:19:28,390 --> 00:19:31,360 And at the time there was also research, for example, 359 00:19:31,360 --> 00:19:35,230 by one of my students on how to do a better job of that 360 00:19:35,230 --> 00:19:40,540 by encoding not only the rules or the patterns, 361 00:19:40,540 --> 00:19:44,260 but also the rationale behind them so that the explanations 362 00:19:44,260 --> 00:19:46,690 could be more sensible. 363 00:19:46,690 --> 00:19:48,670 OK. 364 00:19:48,670 --> 00:19:54,820 Well, the granddaddy of the standard just-so story approach 365 00:19:54,820 --> 00:20:00,760 to explanation of complex models today comes from this paper 366 00:20:00,760 --> 00:20:03,100 and a system called LIME-- 367 00:20:03,100 --> 00:20:07,150 Locally Interpretable Model-agnostic Explanations. 368 00:20:07,150 --> 00:20:09,520 And just to give you an illustration, 369 00:20:09,520 --> 00:20:12,280 you have some complicated model and it's 370 00:20:12,280 --> 00:20:16,930 trying to explain why the doctor or the human being 371 00:20:16,930 --> 00:20:19,510 made a certain decision, or why the model made 372 00:20:19,510 --> 00:20:21,220 a certain decision. 373 00:20:21,220 --> 00:20:23,680 And so it says, well, here are the data 374 00:20:23,680 --> 00:20:25,190 we have about the patient. 375 00:20:25,190 --> 00:20:28,390 We know that the patient is sneezing. 376 00:20:28,390 --> 00:20:30,190 And we know their weight and their headache 377 00:20:30,190 --> 00:20:34,390 and their age and the fact that they have no fatigue. 378 00:20:34,390 --> 00:20:37,180 And so the explainer says, well, why 379 00:20:37,180 --> 00:20:41,410 did the model decide this patient has the flu? 380 00:20:41,410 --> 00:20:44,980 Well, positives are sneeze and headache. 381 00:20:44,980 --> 00:20:48,750 And a negative is no fatigue. 382 00:20:48,750 --> 00:20:51,960 So it goes into this complicated model 383 00:20:51,960 --> 00:20:56,310 and it says, well, I can't explain all the numerology that 384 00:20:56,310 --> 00:21:00,030 happens in that neural network or Bayesian network 385 00:21:00,030 --> 00:21:02,940 or whatever network it's using. 386 00:21:02,940 --> 00:21:07,890 But I can specify that it looks like these 387 00:21:07,890 --> 00:21:11,570 are the most important positive and negative contributors. 388 00:21:11,570 --> 00:21:12,190 Yeah? 389 00:21:12,190 --> 00:21:13,565 AUDIENCE: Is this for notes only, 390 00:21:13,565 --> 00:21:15,900 or it's for all types of data? 391 00:21:15,900 --> 00:21:18,660 PROFESSOR: I'll show you some other kind of data in a minute. 392 00:21:18,660 --> 00:21:21,630 I think they originally worked it out for notes, 393 00:21:21,630 --> 00:21:25,710 but it was also used for images and other kinds of data, 394 00:21:25,710 --> 00:21:27,350 as well. 395 00:21:27,350 --> 00:21:27,850 OK. 396 00:21:32,270 --> 00:21:35,090 And the argument they make is that this approach also 397 00:21:35,090 --> 00:21:38,300 helps to detect data leakage, for example 398 00:21:38,300 --> 00:21:46,640 in one of their experiments, the headers of the data had 399 00:21:46,640 --> 00:21:50,180 information in them that that correlated highly 400 00:21:50,180 --> 00:21:52,790 with the result. 401 00:21:52,790 --> 00:21:55,220 I think there-- I can't remember if it was these guys, 402 00:21:55,220 --> 00:22:00,140 but somebody was assigning study IDs to each case. 403 00:22:00,140 --> 00:22:04,100 And they did it a stupid way so that all the small numbers 404 00:22:04,100 --> 00:22:07,970 corresponded to people who had the disease and the big numbers 405 00:22:07,970 --> 00:22:10,400 corresponded to the people who didn't. 406 00:22:10,400 --> 00:22:13,730 And of course, the most parsimonious predictive model 407 00:22:13,730 --> 00:22:18,440 just used the ID number and said, OK, I got it. 408 00:22:18,440 --> 00:22:20,720 So this would help you identify that, 409 00:22:20,720 --> 00:22:25,920 because if you see that the best predictor is the ID number, 410 00:22:25,920 --> 00:22:28,640 then you would say, hmm, there's something a little fishy 411 00:22:28,640 --> 00:22:29,390 going on here. 412 00:22:32,050 --> 00:22:36,190 Well-- so here's an example where this kind of capability 413 00:22:36,190 --> 00:22:37,760 is very useful. 414 00:22:37,760 --> 00:22:39,430 So this was another-- 415 00:22:39,430 --> 00:22:41,500 this was from a newsgroup. 416 00:22:41,500 --> 00:22:44,860 And they were trying to decide whether a post was 417 00:22:44,860 --> 00:22:46,630 about Christianity or atheism. 418 00:22:49,760 --> 00:22:52,500 Now, look at these two models. 419 00:22:52,500 --> 00:22:54,650 So there's algorithm 1 and algorithm 2 420 00:22:54,650 --> 00:22:56,930 or model 1 and model 2. 421 00:22:56,930 --> 00:23:01,100 And when you explain a particular case 422 00:23:01,100 --> 00:23:05,510 about using model 1, it says, while the words 423 00:23:05,510 --> 00:23:10,790 that I consider important are God, mean, anyone, this, 424 00:23:10,790 --> 00:23:13,010 Koresh, and through-- 425 00:23:13,010 --> 00:23:17,430 does anybody remember who David Koresh was? 426 00:23:17,430 --> 00:23:20,210 He was some cult leader who-- 427 00:23:20,210 --> 00:23:25,950 I can't remember if he killed a bunch of people or bad things 428 00:23:25,950 --> 00:23:26,940 happened. 429 00:23:26,940 --> 00:23:30,360 Oh, I think he was the guy in Waco, Texas 430 00:23:30,360 --> 00:23:37,650 that the FBI and the ATF went in and set their place on fire 431 00:23:37,650 --> 00:23:40,440 and a whole bunch of people died. 432 00:23:40,440 --> 00:23:44,700 So the prediction in this case is atheism. 433 00:23:44,700 --> 00:23:49,710 And you notice that God and Koresh and Mean are negatives. 434 00:23:49,710 --> 00:23:53,670 And anyone this and through are positives. 435 00:23:53,670 --> 00:23:57,360 And you go, I don't know, is that good? 436 00:23:57,360 --> 00:24:00,900 But then you look at algorithm 2 and you say, 437 00:24:00,900 --> 00:24:03,400 this also made the correct prediction, 438 00:24:03,400 --> 00:24:07,050 which is that this particular article is about atheism. 439 00:24:07,050 --> 00:24:11,346 But the positives were the word by and in, 440 00:24:11,346 --> 00:24:14,730 not terribly specific. 441 00:24:14,730 --> 00:24:18,230 And the negatives were things like NNTP. 442 00:24:18,230 --> 00:24:19,160 You know what that is? 443 00:24:19,160 --> 00:24:22,650 That's the Network Time Protocol. 444 00:24:22,650 --> 00:24:27,270 It's some technical thing, and posting and host. 445 00:24:27,270 --> 00:24:29,580 So this is probably like metadata 446 00:24:29,580 --> 00:24:34,860 that got into the header of the articles or something. 447 00:24:34,860 --> 00:24:38,600 So it happened that in this case, 448 00:24:38,600 --> 00:24:42,950 algorithm 2 turned out to be more accurate than algorithm 449 00:24:42,950 --> 00:24:48,450 1 on their held out test data, but not for any good reason. 450 00:24:48,450 --> 00:24:50,900 And so the explanation capability 451 00:24:50,900 --> 00:24:53,480 allows you to clue in on the fact 452 00:24:53,480 --> 00:24:56,600 that even though this thing is getting the right answers, 453 00:24:56,600 --> 00:25:00,460 it's not for sensible reasons. 454 00:25:00,460 --> 00:25:00,960 OK. 455 00:25:03,500 --> 00:25:05,810 So what would you like from an explanation? 456 00:25:05,810 --> 00:25:08,910 Well, they say you'd like it to be interpretable. 457 00:25:08,910 --> 00:25:11,900 So it should provide qualitative understanding 458 00:25:11,900 --> 00:25:13,880 of the relationship between the input 459 00:25:13,880 --> 00:25:16,460 variables and the response. 460 00:25:16,460 --> 00:25:18,530 But they also say that that's going 461 00:25:18,530 --> 00:25:20,630 to depend on the audience. 462 00:25:20,630 --> 00:25:23,930 It requires sparsity for the George Miller argument 463 00:25:23,930 --> 00:25:25,760 that I was making before. 464 00:25:25,760 --> 00:25:28,250 You can't keep too many things in mind. 465 00:25:28,250 --> 00:25:32,420 And the features themselves that you're explaining 466 00:25:32,420 --> 00:25:33,990 must make sense. 467 00:25:33,990 --> 00:25:37,220 So for example, if I say, well, the reason this 468 00:25:37,220 --> 00:25:40,670 decided that is because the eigenvector 469 00:25:40,670 --> 00:25:43,790 for the first principle component 470 00:25:43,790 --> 00:25:47,450 was the following, that's not going 471 00:25:47,450 --> 00:25:48,830 to mean much to most people. 472 00:25:51,560 --> 00:25:55,190 And then they also say, well, it ought to have local fidelity. 473 00:25:55,190 --> 00:25:58,370 So it must correspond to how the model behaves 474 00:25:58,370 --> 00:26:01,220 in the vicinity of the particular instance 475 00:26:01,220 --> 00:26:03,560 that you're trying to explain. 476 00:26:03,560 --> 00:26:09,350 And their third criterion, which I think is a little iffier, 477 00:26:09,350 --> 00:26:11,630 is that it must be model-agnostic. 478 00:26:11,630 --> 00:26:14,940 In other words, you can't take advantage of anything 479 00:26:14,940 --> 00:26:18,170 you know that is specific about the structure 480 00:26:18,170 --> 00:26:21,420 of the model, the way you trained it, anything like that. 481 00:26:21,420 --> 00:26:25,190 It has to be a general purpose explainer that 482 00:26:25,190 --> 00:26:27,550 works on any kind of complicated model. 483 00:26:27,550 --> 00:26:28,206 Yeah? 484 00:26:28,206 --> 00:26:29,914 AUDIENCE: What is the reasoning for that? 485 00:26:32,210 --> 00:26:35,300 PROFESSOR: I think their reasoning for why they insist 486 00:26:35,300 --> 00:26:37,340 on this is because they don't want 487 00:26:37,340 --> 00:26:40,010 to have to write a separate explainer 488 00:26:40,010 --> 00:26:42,620 for each possible model. 489 00:26:42,620 --> 00:26:46,290 So it's much more efficient if you can get this done. 490 00:26:46,290 --> 00:26:49,520 But I actually question whether this is always a good idea 491 00:26:49,520 --> 00:26:50,790 or not. 492 00:26:50,790 --> 00:26:54,130 But nevertheless, this is one of their assumptions. 493 00:26:54,130 --> 00:26:54,630 OK. 494 00:26:54,630 --> 00:26:57,620 So here's the setup that they use. 495 00:26:57,620 --> 00:27:01,160 They say, all right, x is a vector 496 00:27:01,160 --> 00:27:06,890 in some D-dimensional space that defines your original data. 497 00:27:06,890 --> 00:27:08,750 And what we're going to do in order 498 00:27:08,750 --> 00:27:12,830 to make the data explainable, in order to make the data, 499 00:27:12,830 --> 00:27:15,290 not the model, explainable, is we're 500 00:27:15,290 --> 00:27:17,750 going to define a new set of variables, 501 00:27:17,750 --> 00:27:21,170 x prime, that are all binary and that 502 00:27:21,170 --> 00:27:25,640 are in some space of dimension D prime that 503 00:27:25,640 --> 00:27:30,020 is probably lower than D. 504 00:27:30,020 --> 00:27:33,140 So we're simplifying the data that we're going 505 00:27:33,140 --> 00:27:37,150 to explain about this model. 506 00:27:37,150 --> 00:27:40,330 Then they say, OK, we're going to build an explanation 507 00:27:40,330 --> 00:27:45,700 model, g, where g is a class of interpretable models. 508 00:27:45,700 --> 00:27:48,912 So what's an interpretable model? 509 00:27:48,912 --> 00:27:50,370 Well, they don't tell you, but they 510 00:27:50,370 --> 00:27:55,080 say, well, examples might be linear models, additive scores, 511 00:27:55,080 --> 00:27:57,690 decision trees, falling rule lists, 512 00:27:57,690 --> 00:28:01,090 which we'll see later in the lecture. 513 00:28:01,090 --> 00:28:03,840 And the domain of this is this input, 514 00:28:03,840 --> 00:28:08,430 the simplified input data, the binary variables in D prime 515 00:28:08,430 --> 00:28:14,580 dimensions, and the model complexity is going to be some 516 00:28:14,580 --> 00:28:17,760 measure of the depth of the decision tree, 517 00:28:17,760 --> 00:28:21,930 the number of non-zero weights, and the logistic regression-- 518 00:28:21,930 --> 00:28:27,700 the number of clauses in a falling rule list, et cetera. 519 00:28:27,700 --> 00:28:29,550 So it's some complexity measure. 520 00:28:29,550 --> 00:28:32,160 And you want to minimize complexity. 521 00:28:32,160 --> 00:28:34,770 So then they say, all right, the real model, 522 00:28:34,770 --> 00:28:40,980 the hairy, complicated full-bore model is f. 523 00:28:40,980 --> 00:28:47,230 And that maps the original data space into some probability. 524 00:28:47,230 --> 00:28:49,750 And for example, for classification, 525 00:28:49,750 --> 00:28:53,770 f is the probability that x belongs to a certain class. 526 00:28:53,770 --> 00:28:56,350 And then they also need a proximity measure. 527 00:28:56,350 --> 00:28:59,110 So they need to say, we have to have 528 00:28:59,110 --> 00:29:03,340 a way of comparing two cases and saying how close are they 529 00:29:03,340 --> 00:29:04,820 to each other? 530 00:29:04,820 --> 00:29:07,330 And the reason for that is because, remember, 531 00:29:07,330 --> 00:29:10,000 they're going to give you an explanation 532 00:29:10,000 --> 00:29:13,900 of a particular case and the most relevant things that 533 00:29:13,900 --> 00:29:16,270 will help with that explanation are 534 00:29:16,270 --> 00:29:19,085 the ones that are near it in this high dimensional input 535 00:29:19,085 --> 00:29:19,585 space. 536 00:29:22,990 --> 00:29:25,270 So they then define their loss function 537 00:29:25,270 --> 00:29:29,530 based on the actual decision algorithm, 538 00:29:29,530 --> 00:29:34,690 based on the simplified one, and based on the proximity measure. 539 00:29:34,690 --> 00:29:37,750 And they say, well, the best explanation 540 00:29:37,750 --> 00:29:42,160 is that g which minimizes this loss function 541 00:29:42,160 --> 00:29:45,370 plus the complexity of g. 542 00:29:45,370 --> 00:29:47,970 Pretty straightforward. 543 00:29:47,970 --> 00:29:51,260 So that's our best model. 544 00:29:56,090 --> 00:30:01,070 Now, the clever idea here is to say, 545 00:30:01,070 --> 00:30:05,390 instead of using all of the data that we started with, 546 00:30:05,390 --> 00:30:09,950 what we're going to do is to sample the data 547 00:30:09,950 --> 00:30:13,370 so that we take more sample points near the point we're 548 00:30:13,370 --> 00:30:16,920 interested in explaining. 549 00:30:16,920 --> 00:30:19,980 We're going to sample in the simplified space that 550 00:30:19,980 --> 00:30:22,620 is explainable and then we'll build 551 00:30:22,620 --> 00:30:28,860 that g model, the explanatory model, from that sample of data 552 00:30:28,860 --> 00:30:32,010 where we weight by that proximity function 553 00:30:32,010 --> 00:30:35,730 so the things that are closer will have a larger influence 554 00:30:35,730 --> 00:30:39,200 on the model that we learn. 555 00:30:39,200 --> 00:30:43,750 And then we recapture the-- 556 00:30:46,330 --> 00:30:51,480 sort of the closest point to this simplified representation. 557 00:30:51,480 --> 00:30:55,360 We can calculate what its answer should be. 558 00:30:55,360 --> 00:30:59,290 And that becomes the label for that point. 559 00:30:59,290 --> 00:31:01,380 And so now we train a simple model 560 00:31:01,380 --> 00:31:04,860 to predict the label that the complicated model would 561 00:31:04,860 --> 00:31:09,750 have predicted for the point that we've sampled. 562 00:31:09,750 --> 00:31:10,610 Yeah? 563 00:31:10,610 --> 00:31:13,420 AUDIENCE: So the proximity measure is [INAUDIBLE]?? 564 00:31:18,550 --> 00:31:20,800 PROFESSOR: It's a distance function of some sort. 565 00:31:20,800 --> 00:31:23,110 And I'll say more about it in a minute, 566 00:31:23,110 --> 00:31:25,540 because that's one of the critiques 567 00:31:25,540 --> 00:31:28,600 of this particular method has to do with how do you 568 00:31:28,600 --> 00:31:31,420 choose that distance function? 569 00:31:31,420 --> 00:31:35,350 But it's basically a similarity. 570 00:31:35,350 --> 00:31:39,250 So here's a nice, graphical explanation of what's going on. 571 00:31:39,250 --> 00:31:42,220 Suppose that the actual model-- 572 00:31:42,220 --> 00:31:46,260 the decision boundary is between the blue and the pink regions. 573 00:31:46,260 --> 00:31:46,760 OK. 574 00:31:46,760 --> 00:31:51,710 So it's this god awful, hairy, complicated decision model. 575 00:31:51,710 --> 00:31:57,320 And we're trying to explain why this big, red plus wound up 576 00:31:57,320 --> 00:32:00,780 in the pink rather than in the blue. 577 00:32:00,780 --> 00:32:02,600 So the approach that they take is 578 00:32:02,600 --> 00:32:06,410 to say, well, let's sample a bunch of points 579 00:32:06,410 --> 00:32:09,250 weighted by shortest distance. 580 00:32:09,250 --> 00:32:13,310 So we do sample a few points out here. 581 00:32:13,310 --> 00:32:16,280 But mostly we're sampling points near the point 582 00:32:16,280 --> 00:32:19,550 that we're interested in. 583 00:32:19,550 --> 00:32:23,680 We then learn a linear boundary between the positive 584 00:32:23,680 --> 00:32:26,070 and the negative cases. 585 00:32:26,070 --> 00:32:29,310 And that boundary is an approximation 586 00:32:29,310 --> 00:32:34,290 to the actual boundary in the more complicated decision 587 00:32:34,290 --> 00:32:36,540 model. 588 00:32:36,540 --> 00:32:38,960 So now we can give an explanation 589 00:32:38,960 --> 00:32:43,700 just like you saw before which says, well, 590 00:32:43,700 --> 00:32:47,810 this is some D prime dimensional space. 591 00:32:47,810 --> 00:32:52,760 And so which variables in that D prime dimensional space 592 00:32:52,760 --> 00:32:54,710 are the ones that influence where 593 00:32:54,710 --> 00:33:00,020 you are on one side or another of this newly computed decision 594 00:33:00,020 --> 00:33:03,090 boundary, and to what extent? 595 00:33:03,090 --> 00:33:06,264 And that becomes the explanation. 596 00:33:06,264 --> 00:33:07,730 OK? 597 00:33:07,730 --> 00:33:08,410 Nice idea. 598 00:33:12,940 --> 00:33:16,315 So if you apply this to text classification-- yes? 599 00:33:16,315 --> 00:33:18,770 AUDIENCE: I was just going to ask if the-- 600 00:33:18,770 --> 00:33:21,950 there's a worry that if explanation is just fictitious, 601 00:33:21,950 --> 00:33:23,550 like, we can understand it? 602 00:33:23,550 --> 00:33:27,190 But is there reason to believe that we should believe it 603 00:33:27,190 --> 00:33:29,020 if that's really the true nature of things 604 00:33:29,020 --> 00:33:31,170 that the linear does-- you know, it would be like, 605 00:33:31,170 --> 00:33:32,670 OK, we know what's going on here. 606 00:33:32,670 --> 00:33:38,550 But is that even close to reality? 607 00:33:38,550 --> 00:33:40,590 PROFESSOR: Well, that's why I called it 608 00:33:40,590 --> 00:33:42,990 a just-so story, right? 609 00:33:42,990 --> 00:33:44,550 Should you believe it? 610 00:33:44,550 --> 00:33:50,690 Well, the engineering disciplines 611 00:33:50,690 --> 00:33:53,930 have a very long history of approximating 612 00:33:53,930 --> 00:33:58,340 extremely complicated phenomena with linear models. 613 00:33:58,340 --> 00:33:59,060 Right? 614 00:33:59,060 --> 00:34:01,910 I mean, I'm in a department of electrical engineering 615 00:34:01,910 --> 00:34:03,470 and computer science. 616 00:34:03,470 --> 00:34:06,740 And if I talk to my electrical engineering colleagues, 617 00:34:06,740 --> 00:34:09,889 they know that the world is insanely complicated. 618 00:34:09,889 --> 00:34:13,010 Nevertheless, most models in electrical engineering 619 00:34:13,010 --> 00:34:14,570 are linear models. 620 00:34:14,570 --> 00:34:16,370 And they work well enough that people 621 00:34:16,370 --> 00:34:18,650 are able to build really complicated things 622 00:34:18,650 --> 00:34:20,480 and have them work. 623 00:34:20,480 --> 00:34:23,150 So that's not a proof. 624 00:34:23,150 --> 00:34:27,560 That's an argument by history or something. 625 00:34:27,560 --> 00:34:29,540 But it's true. 626 00:34:29,540 --> 00:34:32,929 Linear models are very powerful, especially when 627 00:34:32,929 --> 00:34:36,590 you limit them to giving explanations that are local. 628 00:34:36,590 --> 00:34:41,210 Notice that this model is a very poor approximation 629 00:34:41,210 --> 00:34:45,380 to this decision boundary or this one, right? 630 00:34:45,380 --> 00:34:49,730 And so it only works to explain in the neighborhood 631 00:34:49,730 --> 00:34:53,270 of the particular example that I've chosen. 632 00:34:53,270 --> 00:34:53,770 Right? 633 00:34:53,770 --> 00:34:57,020 But it does work OK there. 634 00:34:57,020 --> 00:34:57,720 Yeah. 635 00:34:57,720 --> 00:35:00,420 AUDIENCE: [INAUDIBLE] very well there? 636 00:35:00,420 --> 00:35:10,590 [INAUDIBLE] middle of the red space then the-- 637 00:35:10,590 --> 00:35:12,390 PROFESSOR: Well, they did. 638 00:35:12,390 --> 00:35:16,000 So they sample all over the place. 639 00:35:16,000 --> 00:35:19,290 But remember that that proximity function 640 00:35:19,290 --> 00:35:23,250 says that this one is less relevant to predicting 641 00:35:23,250 --> 00:35:28,020 that decision boundary because it's far away from the point 642 00:35:28,020 --> 00:35:29,320 that I'm interested in. 643 00:35:29,320 --> 00:35:30,153 So that's the magic. 644 00:35:30,153 --> 00:35:31,528 AUDIENCE: But here they're trying 645 00:35:31,528 --> 00:35:33,570 to explain to the deep red cross, right? 646 00:35:33,570 --> 00:35:34,260 PROFESSOR: Yes. 647 00:35:34,260 --> 00:35:35,760 AUDIENCE: And they picked some point 648 00:35:35,760 --> 00:35:39,630 in the middle of the red space maybe. 649 00:35:39,630 --> 00:35:45,930 Then all the nearby ones would be red and [INAUDIBLE].. 650 00:35:45,930 --> 00:35:48,000 PROFESSOR: Well, but they would-- 651 00:35:48,000 --> 00:35:50,940 I mean, suppose they picked this point, instead. 652 00:35:50,940 --> 00:35:53,210 Then they would sample around this point 653 00:35:53,210 --> 00:35:56,490 and presumably they would find this decision boundary 654 00:35:56,490 --> 00:35:58,140 or this one or something like that 655 00:35:58,140 --> 00:36:01,740 and still be able to come up with a coherent explanation. 656 00:36:06,110 --> 00:36:10,090 OK, so in the case of text, you've 657 00:36:10,090 --> 00:36:12,400 seen this example already. 658 00:36:12,400 --> 00:36:13,660 It's pretty simple. 659 00:36:13,660 --> 00:36:17,180 For their proximity function, they use cosine distance. 660 00:36:17,180 --> 00:36:19,780 So it's a bag of words model and they just 661 00:36:19,780 --> 00:36:24,280 calculate cosine distance between different examples 662 00:36:24,280 --> 00:36:28,570 by how much overlap there is between the words that they use 663 00:36:28,570 --> 00:36:31,690 and the frequency of words that they use. 664 00:36:31,690 --> 00:36:34,390 And then they choose k-- 665 00:36:34,390 --> 00:36:39,700 the number of words to show just as a preference. 666 00:36:39,700 --> 00:36:41,860 So it's sort of a hyperparameter. 667 00:36:41,860 --> 00:36:44,440 They say, you know, I'm interested in looking 668 00:36:44,440 --> 00:36:47,350 at the top five words or the top 10 words that 669 00:36:47,350 --> 00:36:50,860 are either positively or negatively an influence 670 00:36:50,860 --> 00:36:54,310 on the decision, but not the top 10,000 671 00:36:54,310 --> 00:37:00,630 words because I don't know what to do with 10,000 words. 672 00:37:00,630 --> 00:37:02,460 Now, what's interesting is you can also 673 00:37:02,460 --> 00:37:06,400 then apply the same idea to image interpretation. 674 00:37:06,400 --> 00:37:12,150 So here is a dog playing a guitar. 675 00:37:12,150 --> 00:37:18,910 And they say, how do we interpret this? 676 00:37:18,910 --> 00:37:22,440 And so this is one of these labeling tasks where 677 00:37:22,440 --> 00:37:26,310 you'd like to label this picture as a Labrador or maybe 678 00:37:26,310 --> 00:37:28,680 as an acoustic guitar. 679 00:37:28,680 --> 00:37:31,140 But some reason-- some labels also 680 00:37:31,140 --> 00:37:34,170 decide that it's an electric guitar. 681 00:37:34,170 --> 00:37:37,470 And so they say, well, what counts in favor 682 00:37:37,470 --> 00:37:40,350 of or against each of these? 683 00:37:40,350 --> 00:37:43,600 And the approach they take is a relatively straightforward one. 684 00:37:43,600 --> 00:37:48,810 They say let's define a super pixel 685 00:37:48,810 --> 00:37:53,550 as a region of pixels within an image that have 686 00:37:53,550 --> 00:37:55,890 roughly the same intensity. 687 00:37:55,890 --> 00:37:57,780 So if you've ever used Photoshop, 688 00:37:57,780 --> 00:38:02,580 the magic selection tool can be adjusted to say, 689 00:38:02,580 --> 00:38:07,380 find a region around this point where all the intensities are 690 00:38:07,380 --> 00:38:11,790 within some delta of the point that I've picked. 691 00:38:11,790 --> 00:38:15,730 And so it'll outline some region of the picture. 692 00:38:15,730 --> 00:38:18,990 And what they do is they break up the entire image 693 00:38:18,990 --> 00:38:20,790 into these regions. 694 00:38:20,790 --> 00:38:24,030 And then they treat those as if they were the words 695 00:38:24,030 --> 00:38:26,310 in the words style explanation. 696 00:38:28,850 --> 00:38:33,410 So they say, well, this looks like an electric guitar 697 00:38:33,410 --> 00:38:35,120 to the algorithm. 698 00:38:35,120 --> 00:38:38,760 And this looks like an acoustic guitar. 699 00:38:38,760 --> 00:38:41,030 And this looks like a Labrador. 700 00:38:41,030 --> 00:38:42,650 So some of that makes sense. 701 00:38:42,650 --> 00:38:44,540 I mean, you know, that dog's face 702 00:38:44,540 --> 00:38:47,540 does kind of look like a Lab. 703 00:38:47,540 --> 00:38:51,710 This does look kind of like part of the body and part 704 00:38:51,710 --> 00:38:53,910 of the fret work of a guitar. 705 00:38:53,910 --> 00:38:55,700 I have no idea what this stuff is 706 00:38:55,700 --> 00:38:59,990 or why this contributes to it being a dog. 707 00:38:59,990 --> 00:39:04,010 But such is-- such is the nature of these models. 708 00:39:04,010 --> 00:39:07,410 But at least it is telling you why it 709 00:39:07,410 --> 00:39:10,590 believes these various things. 710 00:39:10,590 --> 00:39:12,380 So then the last thing they do is 711 00:39:12,380 --> 00:39:15,190 to say, well, OK, that helps you understand 712 00:39:15,190 --> 00:39:17,520 the particular model. 713 00:39:17,520 --> 00:39:20,010 But how do you convince yourself-- 714 00:39:20,010 --> 00:39:25,230 I mean, a particular example where a model is applied to it. 715 00:39:25,230 --> 00:39:28,170 But how do you convince yourself that the model itself 716 00:39:28,170 --> 00:39:29,640 is reasonable? 717 00:39:29,640 --> 00:39:32,670 And so they say, well, the best technique we know 718 00:39:32,670 --> 00:39:35,190 is to show you a bunch of examples. 719 00:39:35,190 --> 00:39:37,860 But we want those examples to kind of cover 720 00:39:37,860 --> 00:39:41,860 the gamut of places that you might be interested in. 721 00:39:41,860 --> 00:39:45,720 And so they say, let's create this matrix-- 722 00:39:45,720 --> 00:39:50,250 an explanation matrix where these are the cases and these 723 00:39:50,250 --> 00:39:54,990 are the various features, you know, the top words or the top 724 00:39:54,990 --> 00:39:57,990 pixel elements or something, and then we'll 725 00:39:57,990 --> 00:40:03,450 fill in the element of the matrix that tells me 726 00:40:03,450 --> 00:40:07,980 how strongly this feature is correlated or anti-correlated 727 00:40:07,980 --> 00:40:11,950 with the classification for that model. 728 00:40:11,950 --> 00:40:14,130 And then it becomes a kind of set covering 729 00:40:14,130 --> 00:40:18,120 issue of find a set of models that gives me 730 00:40:18,120 --> 00:40:21,180 the best coverage of explanations 731 00:40:21,180 --> 00:40:23,610 across that set of features. 732 00:40:23,610 --> 00:40:26,670 And then with that, I can convince myself 733 00:40:26,670 --> 00:40:29,610 that the model is reasonable. 734 00:40:29,610 --> 00:40:34,170 So they have this thing called the sub modular pick algorithm. 735 00:40:34,170 --> 00:40:37,660 And you know, probably if you're interested, 736 00:40:37,660 --> 00:40:40,050 you should read the paper. 737 00:40:40,050 --> 00:40:43,020 But what they're doing is essentially 738 00:40:43,020 --> 00:40:47,160 doing a kind of greedy search that says, 739 00:40:47,160 --> 00:40:49,950 what features should I add in order 740 00:40:49,950 --> 00:40:55,890 to get the best coverage in that space of features by documents? 741 00:41:02,920 --> 00:41:04,870 And then they did a bunch of experiments 742 00:41:04,870 --> 00:41:07,570 where they said, OK, let's compare 743 00:41:07,570 --> 00:41:10,750 the results of these explanations 744 00:41:10,750 --> 00:41:14,860 of these simplified models to two sentiment analysis 745 00:41:14,860 --> 00:41:18,040 tasks of 2,000 instances each. 746 00:41:18,040 --> 00:41:22,180 Bag of words as features-- they compared it to decision trees, 747 00:41:22,180 --> 00:41:24,310 logistic regression, nearest neighbors, 748 00:41:24,310 --> 00:41:28,030 SVM with the radial basis function, kernel, 749 00:41:28,030 --> 00:41:32,410 or random forests that use word to vacuum beddings-- 750 00:41:32,410 --> 00:41:35,650 highly non-explainable-- 751 00:41:35,650 --> 00:41:39,360 with 1,000 trees and K equal 10. 752 00:41:39,360 --> 00:41:43,450 So they chose 10 features to explain 753 00:41:43,450 --> 00:41:46,180 for each of these models. 754 00:41:46,180 --> 00:41:51,070 They then did a side calculation that said, 755 00:41:51,070 --> 00:41:58,090 what are the 10 most suggestive features for each case? 756 00:41:58,090 --> 00:42:03,250 And then they said, does that covering algorithm 757 00:42:03,250 --> 00:42:06,880 identify those features correctly? 758 00:42:06,880 --> 00:42:14,960 And so what they show here is that their method line does 759 00:42:14,960 --> 00:42:20,240 better in every case than a random sampling-- 760 00:42:20,240 --> 00:42:22,190 that's not very surprising-- 761 00:42:22,190 --> 00:42:26,390 or a greedy sampling or a partisan sampling, which 762 00:42:26,390 --> 00:42:28,370 I don't know the details of. 763 00:42:28,370 --> 00:42:32,390 But in any case, there's what this graph is showing 764 00:42:32,390 --> 00:42:34,400 is that of the features that they 765 00:42:34,400 --> 00:42:38,540 decided were important in each of these cases, 766 00:42:38,540 --> 00:42:39,920 they're recovering. 767 00:42:39,920 --> 00:42:45,480 So their recall is up around 90, 90-plus percent. 768 00:42:45,480 --> 00:42:50,720 So in fact, the algorithm is identifying the right cases 769 00:42:50,720 --> 00:42:53,300 to give you a broad coverage across all 770 00:42:53,300 --> 00:42:55,460 the important features that matter 771 00:42:55,460 --> 00:42:58,730 in classifying these cases. 772 00:42:58,730 --> 00:43:03,760 They then also did a bunch of human experiments where 773 00:43:03,760 --> 00:43:09,280 they said, OK, we're going to ask users to choose which 774 00:43:09,280 --> 00:43:13,450 of two classifiers they think is going to generalize better. 775 00:43:13,450 --> 00:43:17,260 So this is like the picture I showed you of the Christianity 776 00:43:17,260 --> 00:43:24,190 versus atheism algorithm, where presumably if you were 777 00:43:24,190 --> 00:43:28,120 a Mechanical Turker and somebody showed you an algorithm that 778 00:43:28,120 --> 00:43:32,860 has very high accuracy but that depends on things like finding 779 00:43:32,860 --> 00:43:38,080 the word NNTP in a classifier for atheism 780 00:43:38,080 --> 00:43:41,860 versus Christianity, you would say, well, maybe that algorithm 781 00:43:41,860 --> 00:43:43,900 isn't good to generalize very well, 782 00:43:43,900 --> 00:43:47,650 because it's depending on something random that 783 00:43:47,650 --> 00:43:50,770 may be correlated with this particular data set. 784 00:43:50,770 --> 00:43:52,840 But if I try it on a different data set, 785 00:43:52,840 --> 00:43:55,060 it's unlikely to work. 786 00:43:55,060 --> 00:43:58,100 So that was one of the tasks. 787 00:43:58,100 --> 00:44:02,260 And then they asked them to identify features 788 00:44:02,260 --> 00:44:05,440 like that that looked bad. 789 00:44:05,440 --> 00:44:12,580 They then ran this Christianity versus atheism test 790 00:44:12,580 --> 00:44:17,560 and had a separate test set of about 800 additional web 791 00:44:17,560 --> 00:44:21,340 pages from this website. 792 00:44:21,340 --> 00:44:24,910 The underlying model was a support vector machine 793 00:44:24,910 --> 00:44:29,320 with RBF kernels trained on the 20 newsgroup data-- 794 00:44:29,320 --> 00:44:31,330 I don't know if you know that data set, 795 00:44:31,330 --> 00:44:35,680 but it's a well-known, publicly available data set. 796 00:44:35,680 --> 00:44:40,890 They got 100 Mechanical Turkers and they said, OK, we're 797 00:44:40,890 --> 00:44:44,100 going to present each of them six documents 798 00:44:44,100 --> 00:44:50,370 and six features per document in order to ask them to make this. 799 00:44:50,370 --> 00:44:55,080 And then they did an auxiliary experiment in which they said, 800 00:44:55,080 --> 00:45:01,260 if you see words that are no good in this experiment, just 801 00:45:01,260 --> 00:45:02,790 strike them out. 802 00:45:02,790 --> 00:45:06,090 And that will tell us which of the features 803 00:45:06,090 --> 00:45:12,170 were bad in this method. 804 00:45:12,170 --> 00:45:18,340 And what they found was that the human subjects choosing 805 00:45:18,340 --> 00:45:22,840 between two classifiers were pretty 806 00:45:22,840 --> 00:45:28,150 good at figuring out which was the better classifier. 807 00:45:28,150 --> 00:45:32,360 Now, this is better by their judgment. 808 00:45:32,360 --> 00:45:36,440 And so they said, OK, this submodular pick algorithm-- 809 00:45:36,440 --> 00:45:38,920 which is the one that I didn't describe in detail, 810 00:45:38,920 --> 00:45:41,770 but it's this set covering algorithm-- 811 00:45:41,770 --> 00:45:45,760 gives you better results than a random pick algorithm that 812 00:45:45,760 --> 00:45:47,590 just says pick random features. 813 00:45:47,590 --> 00:45:49,240 Again, not totally surprising. 814 00:45:52,150 --> 00:45:54,430 And the other thing that's interesting 815 00:45:54,430 --> 00:45:59,020 is if you do the feature engineering experiment, 816 00:45:59,020 --> 00:46:06,740 it shows that as the Turkers interacted with the system, 817 00:46:06,740 --> 00:46:08,800 the system became better. 818 00:46:08,800 --> 00:46:12,250 So they started off with real world accuracy 819 00:46:12,250 --> 00:46:14,440 of just under 60%. 820 00:46:14,440 --> 00:46:17,740 And using the better of their algorithms, 821 00:46:17,740 --> 00:46:23,360 they reached about 75% after three rounds of interaction. 822 00:46:23,360 --> 00:46:27,320 So the users could say, I don't like this feature. 823 00:46:27,320 --> 00:46:31,570 And then the system would give them better features. 824 00:46:31,570 --> 00:46:34,660 Now, they tried a similar thing with images. 825 00:46:34,660 --> 00:46:38,760 And so this one is a little funny. 826 00:46:38,760 --> 00:46:42,750 So they trained a deliberately lousy classifier 827 00:46:42,750 --> 00:46:45,240 to classify between wolves and huskies. 828 00:46:49,870 --> 00:46:51,370 This is a famous example. 829 00:46:51,370 --> 00:46:56,860 Also it turns out that huskies live in Alaska and so-- 830 00:46:56,860 --> 00:47:01,720 and wolves-- I guess some wolves do, but most wolves don't. 831 00:47:01,720 --> 00:47:04,990 And so the data set on which that-- 832 00:47:04,990 --> 00:47:09,520 which was used in that original problem formulation, 833 00:47:09,520 --> 00:47:15,850 there was an extremely accurate classifier that was trained. 834 00:47:15,850 --> 00:47:18,730 And when they went to look to see what it had learned, 835 00:47:18,730 --> 00:47:22,490 basically it had learned to look for snow. 836 00:47:22,490 --> 00:47:26,060 And if it saw snow in the picture, it said it's a husky. 837 00:47:26,060 --> 00:47:29,750 And if it didn't see snow in the picture, it said it's a wolf. 838 00:47:29,750 --> 00:47:32,990 So that turns out to be pretty accurate for the sample 839 00:47:32,990 --> 00:47:34,020 that they had. 840 00:47:34,020 --> 00:47:39,230 But of course, it's not a very sophisticated classification 841 00:47:39,230 --> 00:47:43,160 algorithm because it's possible to put 842 00:47:43,160 --> 00:47:45,590 a wolf in a snowy picture and it's 843 00:47:45,590 --> 00:47:49,580 possible to have your Husky indoors with no snow. 844 00:47:49,580 --> 00:47:53,540 And then you're just missing the boat on this classification. 845 00:47:53,540 --> 00:47:58,400 So these guys built a particularly bad classifier 846 00:47:58,400 --> 00:48:01,760 by having all wolves in the training set 847 00:48:01,760 --> 00:48:04,670 had snow in the picture and none of the huskies did. 848 00:48:07,350 --> 00:48:11,340 And then they presented cases to graduate students like you guys 849 00:48:11,340 --> 00:48:14,530 with machine learning backgrounds. 850 00:48:14,530 --> 00:48:16,830 10 balance test predictions. 851 00:48:16,830 --> 00:48:19,630 But they put one ringer in each category. 852 00:48:19,630 --> 00:48:23,280 So they put in one husky in snow and one wolf 853 00:48:23,280 --> 00:48:25,260 who was not in snow. 854 00:48:25,260 --> 00:48:29,370 And the comparison was between pre and post experiment 855 00:48:29,370 --> 00:48:31,380 trust and understanding. 856 00:48:31,380 --> 00:48:34,530 And so before the experiment, they 857 00:48:34,530 --> 00:48:37,590 said that 10 of the 27 students said 858 00:48:37,590 --> 00:48:42,480 they trusted this bad model that they trained. 859 00:48:42,480 --> 00:48:46,830 And afterwards, only 3 out of 27 trusted it. 860 00:48:46,830 --> 00:48:50,070 So this is a kind of sociological experiment 861 00:48:50,070 --> 00:48:54,000 that says, yes, we can actually change people's minds 862 00:48:54,000 --> 00:48:57,750 about whether a model is a good or a bad one based 863 00:48:57,750 --> 00:48:59,790 on an experiment. 864 00:48:59,790 --> 00:49:03,780 Before only 12 out of 27 students 865 00:49:03,780 --> 00:49:08,610 mentioned snow as a potential feature in this classifier, 866 00:49:08,610 --> 00:49:11,770 whereas afterwards almost everybody did. 867 00:49:11,770 --> 00:49:17,160 So again, this tells you that the method is providing 868 00:49:17,160 --> 00:49:20,310 some useful information. 869 00:49:20,310 --> 00:49:26,120 Now this paper set off a lot of work, including 870 00:49:26,120 --> 00:49:27,860 a lot of critiques of the work. 871 00:49:27,860 --> 00:49:31,830 And so this is one particular one from just a few months ago, 872 00:49:31,830 --> 00:49:33,870 the end of December. 873 00:49:33,870 --> 00:49:42,350 And what these guys say is that that distance function, which 874 00:49:42,350 --> 00:49:46,580 includes a sigma, which is sort of the scale of distance 875 00:49:46,580 --> 00:49:49,670 that we're willing to go, is pretty arbitrary. 876 00:49:49,670 --> 00:49:53,780 In the experiments that the original authors did, 877 00:49:53,780 --> 00:49:58,760 they set that distance to 75% of the square root 878 00:49:58,760 --> 00:50:01,316 of the dimensionality of the data set. 879 00:50:01,316 --> 00:50:03,050 And you go, OK. 880 00:50:03,050 --> 00:50:04,820 I mean, that's a number. 881 00:50:04,820 --> 00:50:07,490 But it's not obvious that that's the best 882 00:50:07,490 --> 00:50:10,280 number or the right number. 883 00:50:10,280 --> 00:50:14,720 And so these guys argue that it's 884 00:50:14,720 --> 00:50:17,750 important to tune the size of the neighborhood 885 00:50:17,750 --> 00:50:20,720 according to how far z, the point that you're 886 00:50:20,720 --> 00:50:24,180 trying to explain, is from the boundary. 887 00:50:24,180 --> 00:50:26,430 So if it's close to the boundary, 888 00:50:26,430 --> 00:50:29,540 then you ought to take a smaller region 889 00:50:29,540 --> 00:50:31,640 for your proximity measure. 890 00:50:31,640 --> 00:50:33,350 And if it's far from the boundary, 891 00:50:33,350 --> 00:50:35,210 this addresses the question you guys 892 00:50:35,210 --> 00:50:37,970 were asking about what happens if you 893 00:50:37,970 --> 00:50:39,930 pick a point in the middle. 894 00:50:39,930 --> 00:50:43,070 And so they show some nice examples 895 00:50:43,070 --> 00:50:48,680 of places where, for instance, if you compare this explaining 896 00:50:48,680 --> 00:50:52,520 this green point, you get a nice green line that 897 00:50:52,520 --> 00:50:54,680 follows the local boundary. 898 00:50:54,680 --> 00:50:56,690 But explaining the blue point, which 899 00:50:56,690 --> 00:51:01,220 is close to a corner of the actual decision boundary, 900 00:51:01,220 --> 00:51:05,030 you got a line that's not very different from the green one. 901 00:51:05,030 --> 00:51:08,080 And similarly for the red point. 902 00:51:08,080 --> 00:51:10,170 And so they say, well, we really need 903 00:51:10,170 --> 00:51:12,660 to work on that distance function. 904 00:51:12,660 --> 00:51:18,250 And so they come up with a method 905 00:51:18,250 --> 00:51:23,350 that they call LEAFAGE, which basically says, remember, 906 00:51:23,350 --> 00:51:29,380 what LINE did is it sampled nonexistent cases, 907 00:51:29,380 --> 00:51:32,350 simplified nonexistent cases. 908 00:51:32,350 --> 00:51:35,320 But here they're going to sample existing cases. 909 00:51:35,320 --> 00:51:38,440 So they're going to learn from the training-- 910 00:51:38,440 --> 00:51:40,580 the original training set. 911 00:51:40,580 --> 00:51:45,790 But they're going to sample it by proximity to the example 912 00:51:45,790 --> 00:51:49,400 that they're trying to explain. 913 00:51:49,400 --> 00:51:52,790 And they argue that this is a good idea because, for example, 914 00:51:52,790 --> 00:51:56,240 in law, the notion of precedent is 915 00:51:56,240 --> 00:52:00,170 that you get to argue that this case is very similar to some 916 00:52:00,170 --> 00:52:02,990 previously decided case, and therefore it 917 00:52:02,990 --> 00:52:05,060 should be decided the same way. 918 00:52:05,060 --> 00:52:08,780 I mean, Supreme Court arguments are always all about that. 919 00:52:08,780 --> 00:52:11,870 Lower court arguments are sometimes 920 00:52:11,870 --> 00:52:15,540 more driven by what the law actually says. 921 00:52:15,540 --> 00:52:19,820 But case law has been well established in British law, 922 00:52:19,820 --> 00:52:23,510 and then by inheritance in American law 923 00:52:23,510 --> 00:52:27,200 for many, many centuries. 924 00:52:27,200 --> 00:52:30,230 So they say, well, case-based reasoning normally 925 00:52:30,230 --> 00:52:32,960 involves retrieving a similar case, 926 00:52:32,960 --> 00:52:38,330 adapting it, and then learning that as a new precedent. 927 00:52:38,330 --> 00:52:42,140 And they also argue for contrastive justification, 928 00:52:42,140 --> 00:52:45,410 which is not only why did you choose x, but why 929 00:52:45,410 --> 00:52:49,310 did you choose x rather than y as giving 930 00:52:49,310 --> 00:52:52,790 a more satisfying and a more insightful 931 00:52:52,790 --> 00:52:56,450 explanation of how some model is working? 932 00:52:56,450 --> 00:52:58,730 So they say, OK, similar setup. 933 00:52:58,730 --> 00:53:02,090 f solves the classification problem 934 00:53:02,090 --> 00:53:06,080 where x is the data and y is some binary classifier, 935 00:53:06,080 --> 00:53:09,410 you know 0, 1, if you like. 936 00:53:09,410 --> 00:53:12,110 The training set is a bunch of x's. 937 00:53:12,110 --> 00:53:16,340 y sub true is the actual answer. y predicted 938 00:53:16,340 --> 00:53:20,930 is what f predicts on that x. 939 00:53:20,930 --> 00:53:26,910 And to explain f of z equals some particular outcome, 940 00:53:26,910 --> 00:53:32,850 you can define the allies of a case 941 00:53:32,850 --> 00:53:36,410 as ones that come up with the same answer. 942 00:53:36,410 --> 00:53:39,290 And you can define the enemies as one 943 00:53:39,290 --> 00:53:43,560 that wants to come up with a different answer. 944 00:53:43,560 --> 00:53:48,450 So now you're going to sample both the allies and the enemies 945 00:53:48,450 --> 00:53:51,740 according to a new distance function. 946 00:53:51,740 --> 00:53:55,390 And the intuition they had is that the reason 947 00:53:55,390 --> 00:53:59,570 that the distance function in the original line work 948 00:53:59,570 --> 00:54:02,090 wasn't working very well is because it 949 00:54:02,090 --> 00:54:04,550 was a spherical distance function 950 00:54:04,550 --> 00:54:06,740 in n dimensional space. 951 00:54:06,740 --> 00:54:09,470 And so they're going to bias it by saying 952 00:54:09,470 --> 00:54:12,560 that the distance, this b, is going 953 00:54:12,560 --> 00:54:17,480 to be some combination of the difference 954 00:54:17,480 --> 00:54:22,490 in the linear predictions plus the difference in the two 955 00:54:22,490 --> 00:54:24,020 points. 956 00:54:24,020 --> 00:54:27,890 And so the contour lines of the first term 957 00:54:27,890 --> 00:54:29,840 are these circular contour lines. 958 00:54:29,840 --> 00:54:31,720 This is what lime was doing. 959 00:54:31,720 --> 00:54:34,400 The contour lines of the second term 960 00:54:34,400 --> 00:54:37,730 are these linear gradients. 961 00:54:37,730 --> 00:54:42,230 And they add them to get sort of oval-shaped things. 962 00:54:42,230 --> 00:54:46,310 And this is what gives you that desired feature 963 00:54:46,310 --> 00:54:50,060 of being more sensitive to how close this point is 964 00:54:50,060 --> 00:54:53,020 to the decision boundary. 965 00:54:53,020 --> 00:54:58,810 Again, there are a lot of relatively hairy details, which 966 00:54:58,810 --> 00:55:01,690 I'm going to elide in the class today. 967 00:55:01,690 --> 00:55:04,870 But they're definitely in the paper. 968 00:55:04,870 --> 00:55:09,520 So they also did a user study on some very simple prediction 969 00:55:09,520 --> 00:55:10,580 models. 970 00:55:10,580 --> 00:55:14,350 So this was how much is your house worth based on things 971 00:55:14,350 --> 00:55:18,580 like how big is it and what year was it built in 972 00:55:18,580 --> 00:55:22,640 and what's some subjective quality judgment of it? 973 00:55:22,640 --> 00:55:28,330 And so what they show is that you 974 00:55:28,330 --> 00:55:34,540 can find examples that are the allies and the enemies 975 00:55:34,540 --> 00:55:39,070 of this house in order to do the prediction. 976 00:55:39,070 --> 00:55:41,020 So then they apply their algorithm. 977 00:55:41,020 --> 00:55:43,210 And it works. 978 00:55:43,210 --> 00:55:45,120 It gives you better answers. 979 00:55:45,120 --> 00:55:48,230 I'll have to go find that slide somewhere. 980 00:55:48,230 --> 00:55:48,730 All right. 981 00:55:48,730 --> 00:55:57,580 So that's all I'm going to say about this idea of using 982 00:55:57,580 --> 00:56:00,670 simplified models in the local neighborhood 983 00:56:00,670 --> 00:56:05,940 of individual cases in order to explain something. 984 00:56:05,940 --> 00:56:09,040 I wanted to talk about two other topics. 985 00:56:09,040 --> 00:56:12,120 So this was a paper by some of my students 986 00:56:12,120 --> 00:56:17,250 recently in which they're looking at medical images 987 00:56:17,250 --> 00:56:20,460 and trying to generate radiology reports 988 00:56:20,460 --> 00:56:23,010 from those medical images. 989 00:56:23,010 --> 00:56:24,990 I mean, you know, machine learning 990 00:56:24,990 --> 00:56:27,120 can solve all problems. 991 00:56:27,120 --> 00:56:29,510 I give you a collection of images 992 00:56:29,510 --> 00:56:32,040 and a collection of radiology reports, 993 00:56:32,040 --> 00:56:36,810 should be straightforward to build a model that now takes 994 00:56:36,810 --> 00:56:39,810 new radiological images and produces 995 00:56:39,810 --> 00:56:45,130 new radiology reports that are understandable, accurate, et 996 00:56:45,130 --> 00:56:45,760 cetera. 997 00:56:45,760 --> 00:56:47,940 I'm joking, of course. 998 00:56:51,820 --> 00:56:54,830 But the approach they took was kind of interesting. 999 00:56:54,830 --> 00:56:57,980 So they've taken a standard image decoder. 1000 00:56:57,980 --> 00:56:59,920 And then before the pooling layer, 1001 00:56:59,920 --> 00:57:05,820 they take essentially an image embedding from the next 1002 00:57:05,820 --> 00:57:11,430 to last layer of this image encoding algorithm. 1003 00:57:11,430 --> 00:57:16,260 And then they feed that into a word decoder and word 1004 00:57:16,260 --> 00:57:18,030 generator. 1005 00:57:18,030 --> 00:57:21,540 And the idea is to get things that 1006 00:57:21,540 --> 00:57:26,610 appear in the image that correspond to words that appear 1007 00:57:26,610 --> 00:57:32,490 in the report to wind up in the same place in the embedding 1008 00:57:32,490 --> 00:57:34,350 space. 1009 00:57:34,350 --> 00:57:36,340 And so again, there's a lot of hair. 1010 00:57:36,340 --> 00:57:42,030 It's an LSDM based encoder. 1011 00:57:42,030 --> 00:57:45,330 And it's modeled as a sentence decoder. 1012 00:57:45,330 --> 00:57:47,840 And within that, there is a word decoder, 1013 00:57:47,840 --> 00:57:51,840 and then there's a generator that generates these reports. 1014 00:57:51,840 --> 00:57:54,210 And it uses reinforcement learning. 1015 00:57:54,210 --> 00:57:57,360 And you know, tons of hair. 1016 00:57:57,360 --> 00:58:03,510 But here's what I wanted to show you, which is interesting. 1017 00:58:03,510 --> 00:58:08,570 So the encoder takes a bunch of spatial image features. 1018 00:58:08,570 --> 00:58:13,160 The sentence decoder uses these image features in addition 1019 00:58:13,160 --> 00:58:19,340 to the linguistic features, the word embeddings that 1020 00:58:19,340 --> 00:58:21,290 are fed into it. 1021 00:58:21,290 --> 00:58:28,080 And then for ground truth annotation, 1022 00:58:28,080 --> 00:58:32,010 they also use a remote annotation method, which 1023 00:58:32,010 --> 00:58:36,000 is this chexpert program, which is a rule-based program out 1024 00:58:36,000 --> 00:58:39,210 of Stanford that reads radiology reports 1025 00:58:39,210 --> 00:58:43,320 and identifies features in the report that it thinks 1026 00:58:43,320 --> 00:58:45,840 are important and correct. 1027 00:58:45,840 --> 00:58:50,250 So it's not always correct, of course. 1028 00:58:50,250 --> 00:58:57,150 But that's used in order to guide the generator. 1029 00:58:57,150 --> 00:59:00,370 So here's an example. 1030 00:59:00,370 --> 00:59:06,250 So this is an image of a chest and the ground truth-- 1031 00:59:06,250 --> 00:59:08,940 so this is the actual radiology report-- 1032 00:59:08,940 --> 00:59:10,950 says cardiomegalia is moderate. 1033 00:59:10,950 --> 00:59:14,080 Bibasilar atelectasis is mild. 1034 00:59:14,080 --> 00:59:16,710 There's no pneumothoraxal or cervical spinal 1035 00:59:16,710 --> 00:59:18,990 fusion is partially visualized. 1036 00:59:18,990 --> 00:59:22,470 Healed right rib fractures are incidentally noted. 1037 00:59:22,470 --> 00:59:26,220 By the way, I've stared at hundreds of radiological images 1038 00:59:26,220 --> 00:59:27,240 like this. 1039 00:59:27,240 --> 00:59:35,800 I could never figure out that this image says that. 1040 00:59:35,800 --> 00:59:39,610 But that's why radiologists train for many, many years 1041 00:59:39,610 --> 00:59:42,210 to become good at this stuff. 1042 00:59:42,210 --> 00:59:44,450 So there was a previous program done 1043 00:59:44,450 --> 00:59:50,150 by others called TieNet which generates the following report. 1044 00:59:50,150 --> 00:59:52,940 It says AP portable upright view of the chest. 1045 00:59:52,940 --> 00:59:56,330 There's no call no focal consolidation, effusion, 1046 00:59:56,330 --> 00:59:57,680 or pneumothorax. 1047 00:59:57,680 --> 01:00:01,850 The cardio mediastinal silhouette is normal. 1048 01:00:01,850 --> 01:00:04,860 Imaged osseous structures are intact. 1049 01:00:04,860 --> 01:00:07,310 So if you compare this to that, you 1050 01:00:07,310 --> 01:00:11,240 say, well, if the cardio mediastinal silhouette 1051 01:00:11,240 --> 01:00:19,340 is normal, then where would the lower cervical spinal 1052 01:00:19,340 --> 01:00:23,120 fusion, being partially visualized, because that's 1053 01:00:23,120 --> 01:00:24,860 along the middle. 1054 01:00:24,860 --> 01:00:27,770 And so these are not quite consistent. 1055 01:00:27,770 --> 01:00:30,920 So the system that these students built 1056 01:00:30,920 --> 01:00:33,760 says there's mild enlargement of the cardiac silhouette. 1057 01:00:33,760 --> 01:00:37,280 There is no pleural effusion or pneumothorax. 1058 01:00:37,280 --> 01:00:40,890 And there's no acute osseous abnormalities. 1059 01:00:40,890 --> 01:00:44,870 So it also missed the healed right rib fractures 1060 01:00:44,870 --> 01:00:46,940 that were incidentally noted. 1061 01:00:46,940 --> 01:00:50,780 But anyway, it's-- you know, the remarkable thing about 1062 01:00:50,780 --> 01:00:54,800 a singing dog is not how well it sings but the fact that it 1063 01:00:54,800 --> 01:00:55,610 sings at all. 1064 01:00:58,360 --> 01:01:00,270 And the reason I included this work 1065 01:01:00,270 --> 01:01:02,630 is not to convince you that this is 1066 01:01:02,630 --> 01:01:07,830 going to replace radiologists anytime soon, 1067 01:01:07,830 --> 01:01:12,030 but that it had an interesting explanation facility. 1068 01:01:12,030 --> 01:01:15,180 And the explanation facility uses 1069 01:01:15,180 --> 01:01:18,570 attention, which is part of its model, 1070 01:01:18,570 --> 01:01:22,800 to say, hey, when we reach some conclusion, 1071 01:01:22,800 --> 01:01:26,130 we can point back into the image and say 1072 01:01:26,130 --> 01:01:28,560 what part of the image corresponds 1073 01:01:28,560 --> 01:01:31,320 to that part of the conclusion. 1074 01:01:31,320 --> 01:01:32,980 And so this is pretty interesting. 1075 01:01:32,980 --> 01:01:37,620 You say in upright and lateral views of the chest in red, 1076 01:01:37,620 --> 01:01:41,870 well, that's kind of the chest in red. 1077 01:01:41,870 --> 01:01:47,250 There's moderate cardiomegaly, so here the green 1078 01:01:47,250 --> 01:01:50,570 certainly shows you where your heart is. 1079 01:01:50,570 --> 01:01:51,820 OK. 1080 01:01:51,820 --> 01:01:55,270 About there and a little bit to the left. 1081 01:01:55,270 --> 01:01:58,150 And there's no pleural effusion or pneumothorax. 1082 01:01:58,150 --> 01:01:59,890 This one is kind of funny. 1083 01:01:59,890 --> 01:02:02,020 That's the blue region. 1084 01:02:02,020 --> 01:02:08,010 So how do you show me that there isn't something? 1085 01:02:08,010 --> 01:02:11,310 And we were surprised, actually, the way 1086 01:02:11,310 --> 01:02:14,070 it showed us that there isn't something 1087 01:02:14,070 --> 01:02:17,640 is to highlight everything outside of anything 1088 01:02:17,640 --> 01:02:20,330 that you might be interested in, which 1089 01:02:20,330 --> 01:02:26,300 is not exactly convincing that there's no pleural effusion. 1090 01:02:26,300 --> 01:02:28,410 And here's another example. 1091 01:02:28,410 --> 01:02:32,220 There is no relevant change, tracheostomy tube in place, 1092 01:02:32,220 --> 01:02:36,360 so that roughly is showing a little too wide. 1093 01:02:36,360 --> 01:02:39,630 But it's showing roughly where a tracheostomy tube might be. 1094 01:02:43,860 --> 01:02:47,305 Bilateral pleural effusion and compressive atelectasis. 1095 01:02:47,305 --> 01:02:51,480 Atelectasis is when your lung tissues stick together. 1096 01:02:51,480 --> 01:02:54,920 And so that does often happen in the lower part of the lung. 1097 01:02:54,920 --> 01:02:58,410 And again, the negative shows you everything 1098 01:02:58,410 --> 01:03:02,100 that's not part of the action. 1099 01:03:02,100 --> 01:03:03,172 Yeah? 1100 01:03:03,172 --> 01:03:04,465 AUDIENCE: [INAUDIBLE]. 1101 01:03:08,060 --> 01:03:08,685 PROFESSOR: Yes. 1102 01:03:08,685 --> 01:03:12,917 AUDIENCE: [INAUDIBLE] 1103 01:03:12,917 --> 01:03:13,500 PROFESSOR: No. 1104 01:03:13,500 --> 01:03:15,600 It's trying to predict the whole model-- 1105 01:03:15,600 --> 01:03:16,413 the whole node. 1106 01:03:16,413 --> 01:03:19,080 AUDIENCE: And it's not easier to have, like, one node for, like, 1107 01:03:19,080 --> 01:03:19,883 each [INAUDIBLE]? 1108 01:03:19,883 --> 01:03:20,550 PROFESSOR: Yeah. 1109 01:03:20,550 --> 01:03:22,290 But these guys were ambitious. 1110 01:03:22,290 --> 01:03:28,050 You know, they-- what was it? 1111 01:03:28,050 --> 01:03:31,500 Jeff Hinton said a few years ago that he wouldn't 1112 01:03:31,500 --> 01:03:33,690 want his children to become radiologists 1113 01:03:33,690 --> 01:03:37,650 because that field is going to be replaced by computers. 1114 01:03:37,650 --> 01:03:40,650 I think that was a stupid thing to say, especially 1115 01:03:40,650 --> 01:03:43,320 when you look at the state of the art of how 1116 01:03:43,320 --> 01:03:45,090 well these things work. 1117 01:03:45,090 --> 01:03:47,520 But if that were true, then you would, in fact, 1118 01:03:47,520 --> 01:03:50,820 want something that is able to produce an entire radiology 1119 01:03:50,820 --> 01:03:51,750 report. 1120 01:03:51,750 --> 01:03:53,760 So the motivation is there. 1121 01:03:53,760 --> 01:03:56,010 Now, after this work was done, we 1122 01:03:56,010 --> 01:04:02,020 ran into this interesting paper from Northeastern, which says-- 1123 01:04:02,020 --> 01:04:06,930 but listen guys-- attention is not explanation. 1124 01:04:06,930 --> 01:04:07,750 OK. 1125 01:04:07,750 --> 01:04:10,090 So attention is clearly a mechanism 1126 01:04:10,090 --> 01:04:16,640 that's very useful in all kinds of machine learning methods. 1127 01:04:16,640 --> 01:04:20,110 But you shouldn't confuse it with an explanation. 1128 01:04:20,110 --> 01:04:24,160 So they say, well, assumption-- it's the assumption 1129 01:04:24,160 --> 01:04:27,400 that the input units are accorded high attention-- that 1130 01:04:27,400 --> 01:04:29,830 are accorded high attention weights are 1131 01:04:29,830 --> 01:04:32,560 responsible for the model outputs. 1132 01:04:32,560 --> 01:04:34,610 And that may not be true. 1133 01:04:34,610 --> 01:04:37,540 And so what they did is they did a bunch of experiments 1134 01:04:37,540 --> 01:04:40,090 where they studied the correlation 1135 01:04:40,090 --> 01:04:48,820 between the attention weights and the gradients of the model 1136 01:04:48,820 --> 01:04:53,230 parameters to see whether, in fact, the words that 1137 01:04:53,230 --> 01:04:56,410 had high attention were the ones that 1138 01:04:56,410 --> 01:05:00,980 were most decisive in making a decision in the model. 1139 01:05:00,980 --> 01:05:04,700 And they found that the evidence that correlation 1140 01:05:04,700 --> 01:05:08,660 between intuitive feature importance measures, including 1141 01:05:08,660 --> 01:05:11,360 gradient and feature erasure approaches-- so this 1142 01:05:11,360 --> 01:05:15,440 is ablation studies and learn detention weights is weak. 1143 01:05:15,440 --> 01:05:17,930 And so they did a bunch of experiments. 1144 01:05:17,930 --> 01:05:22,200 There are a lot of controversies about this particular study. 1145 01:05:22,200 --> 01:05:27,800 But what you find is that if you calculate the concordance, 1146 01:05:27,800 --> 01:05:32,750 you know, on different data sets using different models, 1147 01:05:32,750 --> 01:05:37,080 you see that, for example, the concordance is not very high. 1148 01:05:37,080 --> 01:05:40,790 It's less than a half for this data set. 1149 01:05:40,790 --> 01:05:46,000 And you know, some of it below 0, 1150 01:05:46,000 --> 01:05:48,190 so the opposite for this data set. 1151 01:05:50,980 --> 01:05:55,690 Interestingly, things like diabetes, 1152 01:05:55,690 --> 01:05:59,890 which come from the mimic data, have narrower bounds 1153 01:05:59,890 --> 01:06:01,100 than some of the others. 1154 01:06:01,100 --> 01:06:05,710 So they seem to have a more definitive conclusion, at least 1155 01:06:05,710 --> 01:06:06,415 for the study. 1156 01:06:10,760 --> 01:06:12,450 OK. 1157 01:06:12,450 --> 01:06:17,460 Let me finish off by talking about the opposite idea. 1158 01:06:17,460 --> 01:06:20,130 So rather than building a complicated model 1159 01:06:20,130 --> 01:06:23,100 and then trying to explain it in simple ways, 1160 01:06:23,100 --> 01:06:26,250 what if we just built a simple model? 1161 01:06:26,250 --> 01:06:29,190 And Cynthia Rudin, who's now at Duke, 1162 01:06:29,190 --> 01:06:32,460 used to be at the Sloan School at MIT, 1163 01:06:32,460 --> 01:06:35,890 has been championing this idea for many years. 1164 01:06:35,890 --> 01:06:40,440 And so she has come up with a bunch of different ideas 1165 01:06:40,440 --> 01:06:42,890 for how to build simple models that 1166 01:06:42,890 --> 01:06:45,750 trade off maybe a little bit of accuracy in order 1167 01:06:45,750 --> 01:06:47,580 to be explainable. 1168 01:06:47,580 --> 01:06:51,780 And one of her favorites is this thing called a falling rule 1169 01:06:51,780 --> 01:06:52,560 list. 1170 01:06:52,560 --> 01:06:59,130 So this is an example for a mammographic mass data set. 1171 01:06:59,130 --> 01:07:05,340 So it says, if some lump has an irregular shape 1172 01:07:05,340 --> 01:07:08,250 and the patient is over 60 years old, 1173 01:07:08,250 --> 01:07:13,050 then there's an 85% chance of malignancy risk, 1174 01:07:13,050 --> 01:07:16,500 and there are 230 cases in which that happened. 1175 01:07:19,450 --> 01:07:23,810 If this is not the case, then if the lump has 1176 01:07:23,810 --> 01:07:25,270 the speculated margin-- 1177 01:07:25,270 --> 01:07:28,330 so it has little spikes coming out of it-- 1178 01:07:28,330 --> 01:07:31,900 and the patient is over 45, then there's 1179 01:07:31,900 --> 01:07:34,930 a 78% chance of malignancy. 1180 01:07:34,930 --> 01:07:38,770 And otherwise, if the margin is kind of fuzzy, the edge of it 1181 01:07:38,770 --> 01:07:42,860 is kind of fuzzy, and the patient is over 60, 1182 01:07:42,860 --> 01:07:46,340 then there's a 69% chance. 1183 01:07:46,340 --> 01:07:48,820 And if it has an irregular shape, 1184 01:07:48,820 --> 01:07:51,590 then there's a 63% chance. 1185 01:07:51,590 --> 01:07:55,040 And if it's lobular and the density is high, 1186 01:07:55,040 --> 01:07:58,010 then there's a 39% chance. 1187 01:07:58,010 --> 01:08:01,060 And if it's round and the patient is over 60, 1188 01:08:01,060 --> 01:08:03,520 then there's a 26% chance. 1189 01:08:03,520 --> 01:08:07,300 Otherwise, there's a 10% chance. 1190 01:08:07,300 --> 01:08:13,420 And the argument is that that description of the model, 1191 01:08:13,420 --> 01:08:16,600 of the decision-making model, is simple enough 1192 01:08:16,600 --> 01:08:20,615 that even doctors can understand it. 1193 01:08:20,615 --> 01:08:21,850 You're supposed to laugh. 1194 01:08:25,029 --> 01:08:26,870 Now, there are still some problems. 1195 01:08:26,870 --> 01:08:29,680 So one of them is-- notice some of these 1196 01:08:29,680 --> 01:08:33,100 are age greater than 60, age greater than 45, 1197 01:08:33,100 --> 01:08:34,930 age greater than 60. 1198 01:08:34,930 --> 01:08:39,460 It's not quite obvious what categories that's defining. 1199 01:08:39,460 --> 01:08:42,700 And in principle, it could be different ages 1200 01:08:42,700 --> 01:08:44,620 in different ones. 1201 01:08:44,620 --> 01:08:46,420 But here's how they build it. 1202 01:08:46,420 --> 01:08:48,850 So this is a very simple model that's 1203 01:08:48,850 --> 01:08:52,609 built by a very complicated process. 1204 01:08:52,609 --> 01:08:56,189 So the simple model is the one I've just showed you. 1205 01:08:56,189 --> 01:08:59,300 There's a Bayesian approach, a Bayesian generative approach, 1206 01:08:59,300 --> 01:09:03,109 where they have a bunch of hyper parameters, falling rule list 1207 01:09:03,109 --> 01:09:04,939 parameters, theta-- 1208 01:09:04,939 --> 01:09:07,010 they calculate a likelihood, which 1209 01:09:07,010 --> 01:09:10,100 is given a particular theta, how likely 1210 01:09:10,100 --> 01:09:14,090 are you to get the answers that are actually in your data given 1211 01:09:14,090 --> 01:09:17,450 the model that you generate? 1212 01:09:17,450 --> 01:09:21,260 And they start with a possible set of if clauses. 1213 01:09:21,260 --> 01:09:25,040 So they do frequent clause mining 1214 01:09:25,040 --> 01:09:29,779 to say what conditions, what binary conditions occur 1215 01:09:29,779 --> 01:09:32,552 frequently together in the database. 1216 01:09:32,552 --> 01:09:34,010 And those are the only ones they're 1217 01:09:34,010 --> 01:09:36,229 going to consider because, of course, 1218 01:09:36,229 --> 01:09:39,229 the number of possible clauses is vast 1219 01:09:39,229 --> 01:09:42,140 and they don't want to have to iterate through those. 1220 01:09:42,140 --> 01:09:46,960 And then for each set of-- for each clause, 1221 01:09:46,960 --> 01:09:51,109 they calculate a risk score which 1222 01:09:51,109 --> 01:09:56,750 is generated by a probability distribution 1223 01:09:56,750 --> 01:10:02,240 under the constraint that the risk score for the next clause 1224 01:10:02,240 --> 01:10:06,020 is lower or equal to the risk score for the previous clause. 1225 01:10:15,110 --> 01:10:16,370 There are lots of details. 1226 01:10:16,370 --> 01:10:20,570 So there is this frequent itemset mining algorithm. 1227 01:10:20,570 --> 01:10:25,070 It turns out that choosing r sub l 1228 01:10:25,070 --> 01:10:29,480 to be the logs of products of real numbers 1229 01:10:29,480 --> 01:10:32,390 is an important step in order to guarantee 1230 01:10:32,390 --> 01:10:37,460 that monotonicity constraint in a simple way. 1231 01:10:37,460 --> 01:10:40,160 l, the number of clauses, is drawn 1232 01:10:40,160 --> 01:10:42,440 from a Poisson distribution. 1233 01:10:42,440 --> 01:10:44,540 And you give it a kind of scale that 1234 01:10:44,540 --> 01:10:47,300 says roughly how many clauses would you 1235 01:10:47,300 --> 01:10:54,350 be willing to tolerate in your following rule list? 1236 01:10:54,350 --> 01:10:58,160 And then there's a lot of computational hair 1237 01:10:58,160 --> 01:11:00,350 where they do-- 1238 01:11:00,350 --> 01:11:04,460 they get mean a posteriori probability estimation 1239 01:11:04,460 --> 01:11:08,600 by using a simulated annealing algorithm. 1240 01:11:08,600 --> 01:11:13,190 So they basically generate some clauses 1241 01:11:13,190 --> 01:11:17,930 and then they use swap, replace, add, and delete operators 1242 01:11:17,930 --> 01:11:21,260 in order to try different variations. 1243 01:11:21,260 --> 01:11:24,600 And they're doing hill climbing in that space. 1244 01:11:24,600 --> 01:11:26,480 There's also some Gibbs sampling, 1245 01:11:26,480 --> 01:11:29,540 because once you have one of these models, 1246 01:11:29,540 --> 01:11:34,060 simply calculating how accurate it is is not straightforward. 1247 01:11:34,060 --> 01:11:36,110 There's not a closed form way of doing it. 1248 01:11:36,110 --> 01:11:40,730 And so they're doing sampling in order to try to generate that. 1249 01:11:40,730 --> 01:11:42,620 So it's a bunch of hair. 1250 01:11:42,620 --> 01:11:45,870 And again, the paper describes it all. 1251 01:11:45,870 --> 01:11:50,320 But what's interesting is that on a 30 day hospital 1252 01:11:50,320 --> 01:11:55,030 readmission data set with about 8,000 patients, 1253 01:11:55,030 --> 01:11:59,920 they used about 34 features, like impaired mental status, 1254 01:11:59,920 --> 01:12:04,540 difficult behavior, chronic pain, feels unsafe, et cetera. 1255 01:12:04,540 --> 01:12:08,950 They mind rules or clauses with support more than 5% 1256 01:12:08,950 --> 01:12:13,150 of the database and no more than two conditions. 1257 01:12:13,150 --> 01:12:16,600 They set the expected length of the decision list 1258 01:12:16,600 --> 01:12:18,820 to be eight clauses. 1259 01:12:18,820 --> 01:12:21,520 And then they compared the decision model 1260 01:12:21,520 --> 01:12:25,600 they got to SVM's random force logistic regression 1261 01:12:25,600 --> 01:12:29,470 cart and an inductive logic programming approach. 1262 01:12:29,470 --> 01:12:33,410 And shockingly to me, their method-- 1263 01:12:33,410 --> 01:12:35,440 the following rule list method-- 1264 01:12:35,440 --> 01:12:41,830 got an AUC of about 0.8, whereas all the others did like 0.79, 1265 01:12:41,830 --> 01:12:47,410 0.75 logistic regression, as usual 1266 01:12:47,410 --> 01:12:50,460 outperformed the one they got slightly. 1267 01:12:50,460 --> 01:12:51,250 Right? 1268 01:12:51,250 --> 01:12:54,160 But this is interesting, because their argument 1269 01:12:54,160 --> 01:12:58,180 is that this representation of the model 1270 01:12:58,180 --> 01:13:02,470 is much more easy to understand than even a logistic regression 1271 01:13:02,470 --> 01:13:06,700 model for most human users. 1272 01:13:06,700 --> 01:13:09,700 And also, if you look at-- 1273 01:13:09,700 --> 01:13:13,690 these are just various runs and the different models. 1274 01:13:13,690 --> 01:13:18,610 And their model has a pretty decent AUC up here. 1275 01:13:18,610 --> 01:13:22,750 I think the green one is the logistic regression one. 1276 01:13:22,750 --> 01:13:28,870 And it's slightly better because it outperforms their best model 1277 01:13:28,870 --> 01:13:33,160 in the region of low false positive rates, which may 1278 01:13:33,160 --> 01:13:34,480 be where you want to operate. 1279 01:13:34,480 --> 01:13:37,060 So that may actually be a better model. 1280 01:13:42,250 --> 01:13:45,990 So here's their readmission rule list. 1281 01:13:45,990 --> 01:13:49,190 And it says if the patient has bed sores 1282 01:13:49,190 --> 01:13:53,120 and has a history of not showing up for appointments, 1283 01:13:53,120 --> 01:13:55,910 then there's a 33% probability that they'll 1284 01:13:55,910 --> 01:13:59,410 be readmitted within 30 days. 1285 01:13:59,410 --> 01:14:04,820 If-- I think some note says poor prognosis and maximum care, 1286 01:14:04,820 --> 01:14:05,510 et cetera. 1287 01:14:05,510 --> 01:14:08,870 So this is the result that they came up with. 1288 01:14:08,870 --> 01:14:12,650 Now, by the way, we've talked a little bit about 30 day 1289 01:14:12,650 --> 01:14:15,780 readmission predictions. 1290 01:14:15,780 --> 01:14:21,360 And getting over about 70% is not bad in that domain 1291 01:14:21,360 --> 01:14:24,690 because it's just not that easily predictable who's 1292 01:14:24,690 --> 01:14:28,060 going to wind up back in the hospital within 30 days. 1293 01:14:28,060 --> 01:14:31,300 So these models are actually doing quite well, 1294 01:14:31,300 --> 01:14:35,740 and certainly understandable in these terms. 1295 01:14:35,740 --> 01:14:39,750 They also tried on a variety of University 1296 01:14:39,750 --> 01:14:44,470 of California-Irvine machine learning data sets. 1297 01:14:44,470 --> 01:14:47,500 These are just random public data sets. 1298 01:14:47,500 --> 01:14:49,987 And they tried building these falling rule 1299 01:14:49,987 --> 01:14:52,890 list models to make predictions. 1300 01:14:52,890 --> 01:14:56,130 And what you see is that the AUCs are pretty good. 1301 01:14:56,130 --> 01:14:59,700 So on the spam detection data set, 1302 01:14:59,700 --> 01:15:02,820 their system gets about 91. 1303 01:15:02,820 --> 01:15:06,030 Logistic regression, again, gets 97. 1304 01:15:06,030 --> 01:15:11,010 So you know, part of the unfortunate lesson that we 1305 01:15:11,010 --> 01:15:14,460 teach in almost every example in this class 1306 01:15:14,460 --> 01:15:17,550 is that simple models like logistic regression 1307 01:15:17,550 --> 01:15:19,240 often do quite well. 1308 01:15:19,240 --> 01:15:23,040 But remember, here they're optimizing for explainability 1309 01:15:23,040 --> 01:15:27,250 rather than for getting the right answer. 1310 01:15:27,250 --> 01:15:32,310 So they're willing to sacrifice some accuracy in their model 1311 01:15:32,310 --> 01:15:35,160 in order to develop a result that 1312 01:15:35,160 --> 01:15:37,590 is easy to explain to people. 1313 01:15:37,590 --> 01:15:42,150 So again, there are many variations on this type of work 1314 01:15:42,150 --> 01:15:44,910 where people have different notions of what counts 1315 01:15:44,910 --> 01:15:48,740 as a simple, explainable model. 1316 01:15:48,740 --> 01:15:51,020 But that's a very different approach 1317 01:15:51,020 --> 01:15:54,710 than the LIME approach, which says build the hairy model 1318 01:15:54,710 --> 01:16:00,020 and then produce local explanations for why 1319 01:16:00,020 --> 01:16:04,110 it makes certain decisions on particular cases. 1320 01:16:04,110 --> 01:16:04,610 All right. 1321 01:16:04,610 --> 01:16:08,150 I think that's all I'm going to say about explainability. 1322 01:16:08,150 --> 01:16:10,460 This is a very hot topic at the moment, 1323 01:16:10,460 --> 01:16:12,440 and so there are lots of papers. 1324 01:16:12,440 --> 01:16:14,720 I think there's-- I just saw a call for a conference 1325 01:16:14,720 --> 01:16:18,810 on explainable machine learning models. 1326 01:16:18,810 --> 01:16:23,550 So there's more and more work in this area. 1327 01:16:23,550 --> 01:16:28,050 So with that, we come to the end of our course. 1328 01:16:28,050 --> 01:16:29,300 And I just wanted-- 1329 01:16:29,300 --> 01:16:35,120 I just went through the front page of the course website 1330 01:16:35,120 --> 01:16:36,530 and listed all the topics. 1331 01:16:36,530 --> 01:16:41,670 So we've covered quite a lot of stuff, right? 1332 01:16:41,670 --> 01:16:45,070 You know, what makes health care different? 1333 01:16:45,070 --> 01:16:48,510 And we talked about what clinical care is all about 1334 01:16:48,510 --> 01:16:53,070 and what clinical data is like and risk stratification, 1335 01:16:53,070 --> 01:16:56,970 survival modeling, physiological time series, how 1336 01:16:56,970 --> 01:17:00,510 to interpret clinical text in a couple of lectures, 1337 01:17:00,510 --> 01:17:03,240 translating technology into the clinic. 1338 01:17:03,240 --> 01:17:06,450 The italicized ones were guest lectures, so 1339 01:17:06,450 --> 01:17:08,580 machine learning for cardiology and machine 1340 01:17:08,580 --> 01:17:11,010 learning for differential diagnosis, 1341 01:17:11,010 --> 01:17:14,730 machine learning for pathology, for mammography. 1342 01:17:14,730 --> 01:17:17,550 David gave a couple of lectures on causal inference 1343 01:17:17,550 --> 01:17:21,270 and reinforcement learning where David and a guest-- 1344 01:17:21,270 --> 01:17:24,270 which I didn't note here-- 1345 01:17:24,270 --> 01:17:27,030 disease progression and sub typing. 1346 01:17:27,030 --> 01:17:29,130 We talked about precision medicine 1347 01:17:29,130 --> 01:17:33,270 and the role of genetics, automated clinical workflows, 1348 01:17:33,270 --> 01:17:36,990 the lecture on regulation, and then recently fairness, 1349 01:17:36,990 --> 01:17:40,800 robustness to data set shift, and interpretability. 1350 01:17:40,800 --> 01:17:42,840 So that's quite a lot. 1351 01:17:42,840 --> 01:17:48,810 I think we're-- we the staff are pretty happy with how the class 1352 01:17:48,810 --> 01:17:50,100 has gone. 1353 01:17:50,100 --> 01:17:53,770 It was our first time as this crew teaching it. 1354 01:17:53,770 --> 01:17:56,910 And we hope to do it again. 1355 01:17:56,910 --> 01:18:03,150 I can't stop without giving an immense vote of gratitude 1356 01:18:03,150 --> 01:18:06,060 to Irene and Willy, without whom we 1357 01:18:06,060 --> 01:18:08,976 would have been totally sunk. 1358 01:18:08,976 --> 01:18:12,380 [APPLAUSE] 1359 01:18:16,060 --> 01:18:18,970 And I also want to acknowledge David's vision in putting 1360 01:18:18,970 --> 01:18:20,960 this course together. 1361 01:18:20,960 --> 01:18:25,750 He taught a sort of half-size version of a class like this 1362 01:18:25,750 --> 01:18:27,880 a couple of years ago and thought 1363 01:18:27,880 --> 01:18:31,330 that it would be a good idea to expand it into a full semester 1364 01:18:31,330 --> 01:18:36,610 regular course and got me on board to work with him. 1365 01:18:36,610 --> 01:18:39,440 And I want to thank you all for your hard work. 1366 01:18:39,440 --> 01:18:42,000 And I'm looking forward to--