1
00:00:15,088 --> 00:00:16,880
DAVID SONTAG: So I'll
begin today's lecture

2
00:00:16,880 --> 00:00:20,860
by giving a brief recap
of risk stratification.

3
00:00:20,860 --> 00:00:24,048
We didn't get to finish talking
survival modeling on Thursday,

4
00:00:24,048 --> 00:00:25,840
and so I'll go a little
bit more into that,

5
00:00:25,840 --> 00:00:27,590
and I'll answer some
of the questions that

6
00:00:27,590 --> 00:00:30,735
arose during our discussions
and on Piazza since.

7
00:00:30,735 --> 00:00:32,860
And then the vast majority
of today's lecture we'll

8
00:00:32,860 --> 00:00:35,100
be talking about a new topic--

9
00:00:35,100 --> 00:00:37,600
in particular, physiological
time series modeling.

10
00:00:37,600 --> 00:00:40,400
I'll give two examples of
physiological time series

11
00:00:40,400 --> 00:00:43,810
modeling-- the first one
coming from monitoring patients

12
00:00:43,810 --> 00:00:46,050
in intensive care units,
and the second one

13
00:00:46,050 --> 00:00:47,800
asking a very different
type of question--

14
00:00:47,800 --> 00:00:53,620
that of diagnosing patients'
heart conditions using EKGs.

15
00:00:53,620 --> 00:00:55,390
And both of these
correspond to readings

16
00:00:55,390 --> 00:00:57,040
that you had for
today's lecture,

17
00:00:57,040 --> 00:00:59,200
and we'll go into much
more depth in these--

18
00:00:59,200 --> 00:01:01,640
of those papers today, and
I'll provide much more color

19
00:01:01,640 --> 00:01:02,140
around them.

20
00:01:05,379 --> 00:01:07,862
So just to briefly remind you
where we were on Thursday,

21
00:01:07,862 --> 00:01:10,320
we talked about how one could
formalize risk stratification

22
00:01:10,320 --> 00:01:12,840
instead of as a classification
problem of what would happen,

23
00:01:12,840 --> 00:01:15,570
let's say, in some
predefined time period,

24
00:01:15,570 --> 00:01:17,640
rather thinking about
risk stratification

25
00:01:17,640 --> 00:01:21,390
as a regression question,
or regression task.

26
00:01:21,390 --> 00:01:23,910
Given what you know about
a patient at time zero,

27
00:01:23,910 --> 00:01:26,380
predicting time to event--

28
00:01:26,380 --> 00:01:29,790
so for example, here the
event might be death, divorce,

29
00:01:29,790 --> 00:01:31,150
college graduation.

30
00:01:31,150 --> 00:01:35,850
And patient one-- that event
happened at time step nine.

31
00:01:35,850 --> 00:01:38,340
Patient two, that event
happened at time step 12.

32
00:01:38,340 --> 00:01:42,510
And for patient four, we don't
know when that event happened,

33
00:01:42,510 --> 00:01:45,240
because it was censored.

34
00:01:45,240 --> 00:01:47,710
In particular, after
time step seven,

35
00:01:47,710 --> 00:01:50,340
we no longer get to view
any of the patients' data,

36
00:01:50,340 --> 00:01:52,960
and so we don't know when
that red dot would be--

37
00:01:52,960 --> 00:01:55,360
sometime in the future or never.

38
00:01:55,360 --> 00:01:57,550
So this is what we mean by
right censor data, which

39
00:01:57,550 --> 00:02:01,443
is precisely what survival
modeling is aiming to solve.

40
00:02:01,443 --> 00:02:03,235
Are there questions
about this setup first?

41
00:02:06,358 --> 00:02:08,259
AUDIENCE: You flipped the x on--

42
00:02:08,259 --> 00:02:09,759
DAVID SONTAG: Yeah,
I realized that.

43
00:02:09,759 --> 00:02:11,860
I flipped the x and the o
in today's presentation,

44
00:02:11,860 --> 00:02:14,720
but that's not relevant.

45
00:02:14,720 --> 00:02:18,370
So f of t is the
probability of death,

46
00:02:18,370 --> 00:02:20,742
or the event occurring
at time step t.

47
00:02:20,742 --> 00:02:22,450
And although in this
slide I'm showing it

48
00:02:22,450 --> 00:02:24,300
as an unconditional
model, in general,

49
00:02:24,300 --> 00:02:25,875
you should think about this
as a conditional density.

50
00:02:25,875 --> 00:02:28,333
So you might be conditioning
on some covariates or features

51
00:02:28,333 --> 00:02:31,810
that you have for that
patient at baseline.

52
00:02:31,810 --> 00:02:34,515
And very important
for survival modeling

53
00:02:34,515 --> 00:02:35,890
and for the next
things I'll tell

54
00:02:35,890 --> 00:02:39,790
you are the survival function,
to note it as capital S of t.

55
00:02:39,790 --> 00:02:45,170
And that's simply 1 minus the
cumulative density function.

56
00:02:45,170 --> 00:02:48,040
So it's the probability
that the event occurring,

57
00:02:48,040 --> 00:02:49,120
which is time--

58
00:02:49,120 --> 00:02:51,640
which is denoted
here as capital T,

59
00:02:51,640 --> 00:02:54,430
occurs greater
than some little t.

60
00:02:54,430 --> 00:02:56,770
So it's this function,
which is simply

61
00:02:56,770 --> 00:02:58,450
given to you by
the integral from 0

62
00:02:58,450 --> 00:03:01,370
to infinity of the density.

63
00:03:01,370 --> 00:03:04,520
So in pictures,
this is the density.

64
00:03:04,520 --> 00:03:06,230
On the x-axis is time.

65
00:03:06,230 --> 00:03:08,040
The y-axis is the
density function.

66
00:03:08,040 --> 00:03:11,660
And this black curve is
what I'm denoting as f of t.

67
00:03:11,660 --> 00:03:18,390
And this white area is capital s
of c, the survival probability,

68
00:03:18,390 --> 00:03:21,250
or survival function.

69
00:03:21,250 --> 00:03:21,990
Yes?

70
00:03:21,990 --> 00:03:23,532
AUDIENCE: So I just
want to be clear.

71
00:03:23,532 --> 00:03:26,446
So if you were to
integrate the entire curve,

72
00:03:26,446 --> 00:03:31,717
[INAUDIBLE] by infinity you're
going to be [INAUDIBLE]..

73
00:03:31,717 --> 00:03:33,550
DAVID SONTAG: In the
way that I described it

74
00:03:33,550 --> 00:03:38,950
to here, yes, because we're
talking about the time

75
00:03:38,950 --> 00:03:41,050
to event.

76
00:03:41,050 --> 00:03:44,680
But often we might be in
scenarios where the event may

77
00:03:44,680 --> 00:03:47,662
never occur, and so that--

78
00:03:47,662 --> 00:03:49,870
you can formalize that in
a couple of different ways.

79
00:03:49,870 --> 00:03:52,060
You could put that at point
mass at s of infinity,

80
00:03:52,060 --> 00:03:55,270
or you could simply say that
the integral from 0 to infinity

81
00:03:55,270 --> 00:03:57,730
is some quantity less than 1.

82
00:03:57,730 --> 00:03:59,440
And in the readings
that I'm referencing

83
00:03:59,440 --> 00:04:00,850
in the very bottom of
those slides-- it shows you

84
00:04:00,850 --> 00:04:03,292
how you can very easily
modify all of the frameworks

85
00:04:03,292 --> 00:04:05,500
I'm telling you about here
to deal with that scenario

86
00:04:05,500 --> 00:04:07,120
where the event may never occur.

87
00:04:07,120 --> 00:04:09,702
But for the purposes
of my presentation,

88
00:04:09,702 --> 00:04:11,410
you can assume that
the event will always

89
00:04:11,410 --> 00:04:13,000
occur at some point.

90
00:04:13,000 --> 00:04:17,350
It's a very minor modification
where you, in essence, divide

91
00:04:17,350 --> 00:04:21,700
the densities by a constant,
which accounts for the fact

92
00:04:21,700 --> 00:04:26,170
that it wouldn't integrate
to one otherwise.

93
00:04:26,170 --> 00:04:30,100
Now, a key question
that has to be solved

94
00:04:30,100 --> 00:04:33,400
when trying to use a parametric
approach to survivor modeling

95
00:04:33,400 --> 00:04:35,240
is, what should that
f of t look like?

96
00:04:35,240 --> 00:04:38,070
What should that density
function look like?

97
00:04:38,070 --> 00:04:42,040
And what I'm showing you here
is a table of some very commonly

98
00:04:42,040 --> 00:04:43,990
used density functions.

99
00:04:43,990 --> 00:04:46,060
What you see in
these two columns--

100
00:04:46,060 --> 00:04:49,600
on the right hand column is the
density function f of t itself.

101
00:04:49,600 --> 00:04:52,180
Lambda denotes some
parameter of the model.

102
00:04:52,180 --> 00:04:54,520
t is the time.

103
00:04:54,520 --> 00:04:57,940
And on this second middle
column is the survival function.

104
00:04:57,940 --> 00:05:01,510
So this is obtained for these
particular parametric forms

105
00:05:01,510 --> 00:05:06,790
by an analytical solution
solving that integral from t

106
00:05:06,790 --> 00:05:08,320
to infinity.

107
00:05:08,320 --> 00:05:11,090
This is the analytic
solution for that.

108
00:05:11,090 --> 00:05:13,600
And so these go by common
names of exponential,

109
00:05:13,600 --> 00:05:15,880
weeble, log-normal, and so on.

110
00:05:15,880 --> 00:05:17,950
And critically, all of
these have support only

111
00:05:17,950 --> 00:05:21,192
on the positive real numbers,
because the event can ever

112
00:05:21,192 --> 00:05:22,150
occur at negative time.

113
00:05:24,690 --> 00:05:27,900
Now, we live in a
day and age where

114
00:05:27,900 --> 00:05:32,340
we no longer have to make
standard parametric assumptions

115
00:05:32,340 --> 00:05:33,600
for densities.

116
00:05:33,600 --> 00:05:36,990
We could, for example, try
to formalize the density

117
00:05:36,990 --> 00:05:41,110
as some output of some
deep neural network.

118
00:05:41,110 --> 00:05:44,833
But if we don't use a
parametric approach,

119
00:05:44,833 --> 00:05:46,500
so there are two ways
to try to do that.

120
00:05:46,500 --> 00:05:48,470
One way to do that would
be to say that we're

121
00:05:48,470 --> 00:05:50,120
going to model the post--

122
00:05:50,120 --> 00:05:56,660
the distribution, f of t, as one
of these things, where lambda

123
00:05:56,660 --> 00:05:58,640
or whatever the
parameters of distribution

124
00:05:58,640 --> 00:06:00,470
are given to by the
output of, let's

125
00:06:00,470 --> 00:06:03,890
say, a deep neural network
on the covariate x.

126
00:06:03,890 --> 00:06:05,413
So that would be one approach.

127
00:06:05,413 --> 00:06:06,830
A very different
approach would be

128
00:06:06,830 --> 00:06:10,130
a non-parametric distribution
where you say, OK, I'm

129
00:06:10,130 --> 00:06:12,200
going to define f of
t extremely flexibly,

130
00:06:12,200 --> 00:06:15,980
not as one of these forms.

131
00:06:15,980 --> 00:06:18,500
And there one runs into a
slightly different challenge,

132
00:06:18,500 --> 00:06:20,720
because as I'll show
you in the next slide,

133
00:06:20,720 --> 00:06:22,520
to do maximum
likelihood estimation

134
00:06:22,520 --> 00:06:24,860
of these distributions
from censor data,

135
00:06:24,860 --> 00:06:27,680
one needs to get-- one needs
to make use of this survival

136
00:06:27,680 --> 00:06:29,820
function, s of t.

137
00:06:29,820 --> 00:06:33,060
And so if you're
f if t is complex,

138
00:06:33,060 --> 00:06:37,153
and you don't have a nice
analytic solution for s of t,

139
00:06:37,153 --> 00:06:38,820
then you're going to
have to somehow use

140
00:06:38,820 --> 00:06:41,400
a numerical approximation
of s of t during limiting.

141
00:06:41,400 --> 00:06:43,200
So it's definitely
possible, but it's going

142
00:06:43,200 --> 00:06:44,492
to be a little bit more effort.

143
00:06:46,510 --> 00:06:49,010
So now here's where I'm going
to get into maximum likelihood

144
00:06:49,010 --> 00:06:51,530
estimation of these
distributions,

145
00:06:51,530 --> 00:06:54,350
and to define for you
the likelihood function,

146
00:06:54,350 --> 00:06:57,150
I'm going to break it down
into two different settings.

147
00:06:57,150 --> 00:06:59,300
The first setting
is an observation

148
00:06:59,300 --> 00:07:03,530
which is uncensored, meaning
we do observe when the event--

149
00:07:03,530 --> 00:07:05,710
death, for example-- occurs.

150
00:07:05,710 --> 00:07:08,720
And in that case, the
probability of the event--

151
00:07:08,720 --> 00:07:09,450
it's very simple.

152
00:07:09,450 --> 00:07:13,345
It's just probability of the
event occurring at capital--

153
00:07:13,345 --> 00:07:15,230
at capital T, random
variable T, equals

154
00:07:15,230 --> 00:07:16,580
a little t-- is just f or t.

155
00:07:16,580 --> 00:07:17,080
Done.

156
00:07:19,600 --> 00:07:22,870
However, what happens
if, for this data point,

157
00:07:22,870 --> 00:07:26,650
you don't observe when the event
occurred because of censoring?

158
00:07:26,650 --> 00:07:30,120
Well, of course, you could just
throw away that data point,

159
00:07:30,120 --> 00:07:32,380
not use it in your
estimation, but that's

160
00:07:32,380 --> 00:07:34,930
precisely what we mentioned
at the very beginning

161
00:07:34,930 --> 00:07:38,625
of last week's lecture-- was
the goal of survival modeling

162
00:07:38,625 --> 00:07:40,000
to not do that,
because if we did

163
00:07:40,000 --> 00:07:44,440
that, it would introduce bias
into our estimation procedure.

164
00:07:44,440 --> 00:07:48,110
So we would like to be able
to use that observation

165
00:07:48,110 --> 00:07:50,770
that this data
point was censored,

166
00:07:50,770 --> 00:07:53,530
but the only information we
can get from that observation

167
00:07:53,530 --> 00:07:57,040
is that capital
T, the event time,

168
00:07:57,040 --> 00:08:00,490
must have occurred
some time larger

169
00:08:00,490 --> 00:08:03,040
than the observed-- the
time of censoring, which

170
00:08:03,040 --> 00:08:05,410
is little t here.

171
00:08:05,410 --> 00:08:09,130
So we don't know precisely
when capital T was, but we

172
00:08:09,130 --> 00:08:12,560
know it's something larger than
the observed centering time

173
00:08:12,560 --> 00:08:13,660
little of t.

174
00:08:13,660 --> 00:08:17,830
And that, remember, is precisely
what the survival function

175
00:08:17,830 --> 00:08:19,000
is capturing.

176
00:08:19,000 --> 00:08:20,680
So for a censored
observation, we're

177
00:08:20,680 --> 00:08:24,253
going to use capital S of
t within the likelihood.

178
00:08:24,253 --> 00:08:26,920
So now we can then combine these
two for censored and uncensored

179
00:08:26,920 --> 00:08:30,820
data, and what we get is the
following likelihood objective.

180
00:08:30,820 --> 00:08:33,880
This is-- I'm showing you here
the log likelihood objective.

181
00:08:33,880 --> 00:08:38,320
Recall from last week that
little b of i simply denotes

182
00:08:38,320 --> 00:08:40,570
is this observation
censored or not?

183
00:08:40,570 --> 00:08:44,290
So if bi is 1, it means
the time that you're given

184
00:08:44,290 --> 00:08:47,200
is the time of the
censoring event.

185
00:08:47,200 --> 00:08:49,540
And if bi is 0, it means
the time you're given

186
00:08:49,540 --> 00:08:51,787
is the time that
the event occurs.

187
00:08:51,787 --> 00:08:54,370
So here what we're going to do
is now sum over all of the data

188
00:08:54,370 --> 00:08:56,320
points in your data
set from little i

189
00:08:56,320 --> 00:09:02,500
equals 1 to little n of bi
times log of probability

190
00:09:02,500 --> 00:09:06,660
under the censored model
plus 1 minus bi times log

191
00:09:06,660 --> 00:09:08,410
of probability under
the uncensored model.

192
00:09:08,410 --> 00:09:10,510
And so this bi is just going
to switch on which of these two

193
00:09:10,510 --> 00:09:12,760
you're going to use for
that given data point.

194
00:09:12,760 --> 00:09:15,550
So the learning objective for
maximum likelihood estimation

195
00:09:15,550 --> 00:09:18,070
here is very similar
to what you're used to

196
00:09:18,070 --> 00:09:21,640
in learning distributions
with the big difference

197
00:09:21,640 --> 00:09:23,470
that, for censored
data, we're going

198
00:09:23,470 --> 00:09:29,080
to use the survival function
to estimate its probability.

199
00:09:29,080 --> 00:09:30,313
Are there any questions?

200
00:09:35,150 --> 00:09:37,270
And this, of course,
could then be

201
00:09:37,270 --> 00:09:39,760
optimized via your
favorite algorithm,

202
00:09:39,760 --> 00:09:42,400
whether it be stochastic
gradient descent,

203
00:09:42,400 --> 00:09:43,920
or second order
method, and so on.

204
00:09:43,920 --> 00:09:44,420
Yep?

205
00:09:44,420 --> 00:09:45,785
AUDIENCE: I have a
question about the a kind

206
00:09:45,785 --> 00:09:46,452
of side project.

207
00:09:46,452 --> 00:09:48,620
You mentioned that we
could use [INAUDIBLE]..

208
00:09:48,620 --> 00:09:49,370
DAVID SONTAG: Yes.

209
00:09:49,370 --> 00:09:51,310
AUDIENCE: And then combine it
with the parametric approach.

210
00:09:51,310 --> 00:09:51,730
DAVID SONTAG: Yes.

211
00:09:51,730 --> 00:09:53,563
AUDIENCE: So is that
true that we just still

212
00:09:53,563 --> 00:09:55,705
have the parametric
assumption that we kind of map

213
00:09:55,705 --> 00:09:57,250
the input to the parameters?

214
00:09:57,250 --> 00:09:58,385
DAVID SONTAG: Exactly.

215
00:09:58,385 --> 00:09:59,260
That's exactly right.

216
00:09:59,260 --> 00:10:04,980
So consider the following
picture where for--

217
00:10:04,980 --> 00:10:08,500
this is time, t.

218
00:10:08,500 --> 00:10:11,812
And this is f of t.

219
00:10:11,812 --> 00:10:13,670
You can imagine
for any one patient

220
00:10:13,670 --> 00:10:15,907
you might have a
different function.

221
00:10:15,907 --> 00:10:18,490
You might-- but they might all
be of the same parametric form.

222
00:10:18,490 --> 00:10:21,332
So they might be
like that, or maybe

223
00:10:21,332 --> 00:10:22,540
they're shifted a little bit.

224
00:10:25,130 --> 00:10:27,070
So you think about
each of these three

225
00:10:27,070 --> 00:10:30,355
things as being from the
same parametric family

226
00:10:30,355 --> 00:10:33,310
of distributions, but
with different means.

227
00:10:33,310 --> 00:10:35,380
And in this case,
then the mean is

228
00:10:35,380 --> 00:10:37,603
given to as the output of
the deep neural network.

229
00:10:37,603 --> 00:10:39,520
And so that would be the
way it would be used,

230
00:10:39,520 --> 00:10:41,812
and then one could just back
propagate in the usual way

231
00:10:41,812 --> 00:10:43,106
to do learning.

232
00:10:43,106 --> 00:10:43,692
Yep?

233
00:10:43,692 --> 00:10:45,400
AUDIENCE: Can you
repeat what b sub i is?

234
00:10:45,400 --> 00:10:45,780
DAVID SONTAG: Excuse me?

235
00:10:45,780 --> 00:10:47,572
AUDIENCE: Could you
repeat what b sub i is?

236
00:10:47,572 --> 00:10:50,740
DAVID SONTAG: b sub i is just an
indicator whether the i-th data

237
00:10:50,740 --> 00:10:54,510
point was censored
or not censored.

238
00:10:54,510 --> 00:10:55,020
Yes?

239
00:10:55,020 --> 00:10:59,200
AUDIENCE: So [INAUDIBLE] equal
it's more a probability density

240
00:10:59,200 --> 00:11:00,130
function [INAUDIBLE].

241
00:11:00,130 --> 00:11:01,880
DAVID SONTAG: Cumulative
density function.

242
00:11:01,880 --> 00:11:05,750
AUDIENCE: Yeah, but
[INAUDIBLE] probability.

243
00:11:05,750 --> 00:11:10,350
No, for the [INAUDIBLE] it's
probability density function.

244
00:11:10,350 --> 00:11:11,820
DAVID SONTAG Yes, so just to--

245
00:11:11,820 --> 00:11:13,200
AUDIENCE: [INAUDIBLE]

246
00:11:13,200 --> 00:11:15,113
DAVID SONTAG: Excuse me?

247
00:11:15,113 --> 00:11:16,530
AUDIENCE: Will
that be any problem

248
00:11:16,530 --> 00:11:18,440
to combine those
two types there?

249
00:11:18,440 --> 00:11:20,190
DAVID SONTAG: That's
a very good question.

250
00:11:20,190 --> 00:11:24,550
So the observation was that
you have two different types

251
00:11:24,550 --> 00:11:27,412
of probabilities used here.

252
00:11:27,412 --> 00:11:28,870
In this case, we're
using something

253
00:11:28,870 --> 00:11:32,200
like the cumulative
density, whereas here we're

254
00:11:32,200 --> 00:11:35,380
using the probability
density function.

255
00:11:35,380 --> 00:11:38,890
The question was, are these
two on different scales?

256
00:11:38,890 --> 00:11:40,540
Does it make sense
to combine them

257
00:11:40,540 --> 00:11:43,360
in this type of linear fashion
with the same weighting?

258
00:11:43,360 --> 00:11:45,490
And I think it does make sense.

259
00:11:45,490 --> 00:11:59,430
So think about a setting where
you have a very small time

260
00:11:59,430 --> 00:12:00,210
range.

261
00:12:00,210 --> 00:12:02,273
You're not exactly sure
when this event occurs.

262
00:12:02,273 --> 00:12:03,690
It's something in
this time range.

263
00:12:06,690 --> 00:12:10,440
In the setting of
the censored data,

264
00:12:10,440 --> 00:12:14,180
where that time range could
potentially be very large,

265
00:12:14,180 --> 00:12:17,650
your model is providing--

266
00:12:21,150 --> 00:12:23,670
your log probability
is somehow going

267
00:12:23,670 --> 00:12:28,590
to be much more flat,
because you're covering

268
00:12:28,590 --> 00:12:29,715
much more probability mass.

269
00:12:32,930 --> 00:12:34,830
And so that
observation, I think,

270
00:12:34,830 --> 00:12:37,200
intuitively is likely
to have a much--

271
00:12:37,200 --> 00:12:41,640
a bit of a smaller effect on
the overall learning algorithm.

272
00:12:41,640 --> 00:12:44,590
These observations-- you know
precisely where they are,

273
00:12:44,590 --> 00:12:51,280
and so as you deviate from that,
you incur the corresponding log

274
00:12:51,280 --> 00:12:53,740
loss penalty.

275
00:12:53,740 --> 00:12:55,918
But I do think
that it makes sense

276
00:12:55,918 --> 00:12:57,210
to have them in the same scale.

277
00:12:57,210 --> 00:12:59,977
If anyone in the room has done
work with [INAUDIBLE] modeling

278
00:12:59,977 --> 00:13:02,310
and has a different answer
to that, I'd love to hear it.

279
00:13:06,450 --> 00:13:09,510
Not today, but maybe
someone in the future

280
00:13:09,510 --> 00:13:11,260
will answer this
question differently.

281
00:13:11,260 --> 00:13:14,250
I'm going to move on for now.

282
00:13:14,250 --> 00:13:18,480
So the remaining question that
I want to talk about today

283
00:13:18,480 --> 00:13:22,020
is how one evaluates
survival models.

284
00:13:22,020 --> 00:13:27,240
So we talked about binary
classification a lot

285
00:13:27,240 --> 00:13:29,090
in the context of
risk stratification

286
00:13:29,090 --> 00:13:31,590
in the beginning, and we talked
about how area under the ROC

287
00:13:31,590 --> 00:13:34,900
curve is one measure of
classification performance,

288
00:13:34,900 --> 00:13:36,630
but here we're doing more--

289
00:13:36,630 --> 00:13:40,120
something more akin to
regression, not classification.

290
00:13:40,120 --> 00:13:43,180
A standard measure that's
used to measure performance

291
00:13:43,180 --> 00:13:46,240
is known as the C-statistic,
or concordance index.

292
00:13:46,240 --> 00:13:48,130
Those are one in the same--

293
00:13:48,130 --> 00:13:49,630
and is defined as follows.

294
00:13:49,630 --> 00:13:52,050
And it has a very
intuitive definition.

295
00:13:52,050 --> 00:13:55,300
It sums over pairs
of data points

296
00:13:55,300 --> 00:13:59,120
that can be compared
to one another,

297
00:13:59,120 --> 00:14:04,860
and it says, OK, what
is the likelihood

298
00:14:04,860 --> 00:14:08,550
of the event happening
for an event that

299
00:14:08,550 --> 00:14:11,100
occurs before an event--

300
00:14:11,100 --> 00:14:11,820
another event.

301
00:14:11,820 --> 00:14:16,020
And what you want is that the
likelihood of the event that,

302
00:14:16,020 --> 00:14:18,060
on average, in essence,
should occur later

303
00:14:18,060 --> 00:14:21,215
should be larger than the event
that should occur earlier.

304
00:14:21,215 --> 00:14:23,340
I'm going to first illustrate
it with this picture,

305
00:14:23,340 --> 00:14:24,840
and then I'll work
through the math.

306
00:14:24,840 --> 00:14:28,510
So here's the picture, and
then we'll talk about the math.

307
00:14:28,510 --> 00:14:31,960
So what I'm showing you here
are every single observation

308
00:14:31,960 --> 00:14:34,130
in your data set,
and they're sorted

309
00:14:34,130 --> 00:14:40,900
by either the censoring
time or the event time.

310
00:14:40,900 --> 00:14:45,130
So by black, I'm illustrating
uncensored data points.

311
00:14:45,130 --> 00:14:50,010
And by red, I'm denoting
censored data points.

312
00:14:50,010 --> 00:14:54,140
Now, here we see that
this data point--

313
00:14:54,140 --> 00:14:58,593
the event happened before this
data point's censoring event.

314
00:14:58,593 --> 00:15:00,260
Now, since this data
point was censored,

315
00:15:00,260 --> 00:15:03,030
it means it's true event
time you could think about as

316
00:15:03,030 --> 00:15:05,700
sometime into the far future.

317
00:15:05,700 --> 00:15:11,330
So what we would want
is that the model

318
00:15:11,330 --> 00:15:20,050
gives that the probability that
this event happens by this time

319
00:15:20,050 --> 00:15:24,010
should be larger
than the probability

320
00:15:24,010 --> 00:15:29,190
that this event
happens by this time,

321
00:15:29,190 --> 00:15:31,320
because this actually
occurred first.

322
00:15:31,320 --> 00:15:33,842
And these two are comparable
together-- to each other.

323
00:15:33,842 --> 00:15:35,550
On the other hand, it
wouldn't make sense

324
00:15:35,550 --> 00:15:39,450
to compare y2 and y4,
because both of these

325
00:15:39,450 --> 00:15:41,610
were censored data
points, and we don't know

326
00:15:41,610 --> 00:15:43,090
precisely when they occurred.

327
00:15:43,090 --> 00:15:45,090
So for example, it could
have very well happened

328
00:15:45,090 --> 00:15:50,280
that the event 2
happened after event 4.

329
00:15:50,280 --> 00:15:53,250
So what I'm showing you here
with each of these lines

330
00:15:53,250 --> 00:15:54,750
are the pairwise
comparisons that

331
00:15:54,750 --> 00:15:56,325
are actually possible to make.

332
00:15:56,325 --> 00:15:58,200
You can make pairwise
comparisons, of course,

333
00:15:58,200 --> 00:16:00,730
between any pair of events
that actually did occur,

334
00:16:00,730 --> 00:16:02,272
and you can make
pairwise comparisons

335
00:16:02,272 --> 00:16:06,690
between censored events and
events that occurred before it.

336
00:16:06,690 --> 00:16:11,500
Now, if you now look at this
formula, the formula in this

337
00:16:11,500 --> 00:16:15,100
indicate-- this is looking
at an indicator of survival

338
00:16:15,100 --> 00:16:18,080
functions between pairs of data
points, and which pairs of data

339
00:16:18,080 --> 00:16:18,580
points?

340
00:16:18,580 --> 00:16:21,170
It was precisely those
pairs of data points,

341
00:16:21,170 --> 00:16:24,700
which I'm showing comparisons
of with these blue lines here.

342
00:16:24,700 --> 00:16:28,210
So we're going to sum over i
such that bi is equal to 0,

343
00:16:28,210 --> 00:16:32,830
and remember that means it
is an uncensored data point.

344
00:16:32,830 --> 00:16:35,650
And then we look at--

345
00:16:35,650 --> 00:16:41,050
we look at yi compared to
all other yj that's great--

346
00:16:41,050 --> 00:16:45,520
that has a value greater than--
both censored and uncensored.

347
00:16:45,520 --> 00:16:51,860
Now, if your data had no
sensor data points in it,

348
00:16:51,860 --> 00:16:56,310
then you can verify that,
in fact, this corresponds--

349
00:16:56,310 --> 00:16:58,310
so there's one other
assumption one has to make,

350
00:16:58,310 --> 00:16:59,610
which is that--

351
00:16:59,610 --> 00:17:02,817
suppose that your
outcome is binary.

352
00:17:02,817 --> 00:17:05,109
And so if you might wonder
how you get a binary outcome

353
00:17:05,109 --> 00:17:09,760
from this, imagine that
your density function

354
00:17:09,760 --> 00:17:13,359
looked a little bit like this,
where it could occur either

355
00:17:13,359 --> 00:17:18,490
at time 1 or time 2.

356
00:17:18,490 --> 00:17:21,160
So something like that.

357
00:17:24,130 --> 00:17:29,300
So if the event can
occur at only two times,

358
00:17:29,300 --> 00:17:31,770
not a whole range
of times, then this

359
00:17:31,770 --> 00:17:35,570
is analogous to
a binary outcome.

360
00:17:35,570 --> 00:17:37,210
And so if you have
a binary outcome

361
00:17:37,210 --> 00:17:40,630
like this and no censoring,
then, in fact, that C-statistic

362
00:17:40,630 --> 00:17:42,763
is exactly equal to the
area under the ROC curve.

363
00:17:42,763 --> 00:17:44,930
So that just connects it a
little bit back to things

364
00:17:44,930 --> 00:17:45,400
we're used to.

365
00:17:45,400 --> 00:17:45,850
Yep?

366
00:17:45,850 --> 00:17:47,767
AUDIENCE: Just to make
sure that I understand.

367
00:17:47,767 --> 00:17:50,370
So y1 is going to be
we observed an event,

368
00:17:50,370 --> 00:17:53,920
and y2 is going to be we
know that no event occurred

369
00:17:53,920 --> 00:17:55,210
until that day?

370
00:17:55,210 --> 00:17:58,320
DAVID SONTAG: Every dot
corresponds to one event,

371
00:17:58,320 --> 00:17:59,370
either censored or not.

372
00:17:59,370 --> 00:18:00,070
AUDIENCE: Thank you.

373
00:18:00,070 --> 00:18:01,445
DAVID SONTAG: And
they're sorted.

374
00:18:01,445 --> 00:18:04,110
In this figure, they're
sorted by the time

375
00:18:04,110 --> 00:18:08,475
of either the censoring
or the event occurring.

376
00:18:14,420 --> 00:18:16,570
So I talked to--

377
00:18:16,570 --> 00:18:18,190
when I talked about
C-statistic, it--

378
00:18:18,190 --> 00:18:21,730
that's one way to measure
performance of your survival

379
00:18:21,730 --> 00:18:23,780
modeling, but you
might remember that I--

380
00:18:23,780 --> 00:18:25,780
that when we talked about
binary classification,

381
00:18:25,780 --> 00:18:27,363
we said how area
under there ROC curve

382
00:18:27,363 --> 00:18:29,080
in itself is very
limiting, and so we

383
00:18:29,080 --> 00:18:30,340
should think through
other performance

384
00:18:30,340 --> 00:18:31,215
metrics of relevance.

385
00:18:31,215 --> 00:18:33,652
So here are a few other
things that you could do.

386
00:18:33,652 --> 00:18:35,110
One thing you could
do is you could

387
00:18:35,110 --> 00:18:38,330
use the mean squared error.

388
00:18:38,330 --> 00:18:41,410
So again, thinking about
this as a regression problem.

389
00:18:41,410 --> 00:18:43,090
But of course, that
only makes sense

390
00:18:43,090 --> 00:18:45,430
for uncensored data points.

391
00:18:45,430 --> 00:18:47,563
So focus just in the
uncensored data points,

392
00:18:47,563 --> 00:18:49,480
look to see how well
we're doing at predicting

393
00:18:49,480 --> 00:18:51,410
when the event occurs.

394
00:18:51,410 --> 00:18:55,270
The second thing one could
do, since you have the ability

395
00:18:55,270 --> 00:18:58,490
to define the likelihood
of an observation,

396
00:18:58,490 --> 00:19:02,260
censored or not censored,
one could hold out data,

397
00:19:02,260 --> 00:19:05,080
and look at the held-out
likelihood or log likelihood

398
00:19:05,080 --> 00:19:07,170
of that held-out data.

399
00:19:07,170 --> 00:19:08,760
And the third thing
you could do is

400
00:19:08,760 --> 00:19:12,270
you can-- after learning
using this survival modeling

401
00:19:12,270 --> 00:19:17,250
framework, one could then turn
it into a binary classification

402
00:19:17,250 --> 00:19:19,380
problem by, for
example, artificially

403
00:19:19,380 --> 00:19:24,060
choosing time ranges, like
greater than three months is 1.

404
00:19:24,060 --> 00:19:25,830
Less than three months is 0.

405
00:19:25,830 --> 00:19:27,460
That would be one
crude definition.

406
00:19:27,460 --> 00:19:29,010
And then once you've
done a reduction

407
00:19:29,010 --> 00:19:30,468
to a binary
classification problem,

408
00:19:30,468 --> 00:19:32,763
you could use all of
the existing performance

409
00:19:32,763 --> 00:19:35,430
metrics they're used to thinking
about for binary classification

410
00:19:35,430 --> 00:19:37,020
to evaluate the
performance there--

411
00:19:37,020 --> 00:19:40,550
things like positive
predictive value, for example.

412
00:19:40,550 --> 00:19:42,990
And you could, of course,
choose different reductions

413
00:19:42,990 --> 00:19:44,970
and get different
performance statistics out.

414
00:19:44,970 --> 00:19:47,700
So this is just a
small subset of ways

415
00:19:47,700 --> 00:19:49,710
to try to evaluate
survivor modeling,

416
00:19:49,710 --> 00:19:51,812
but it's a very,
very rich literature.

417
00:19:51,812 --> 00:19:53,520
And again, on the
bottom of these slides,

418
00:19:53,520 --> 00:19:54,978
I pointed you to
several references

419
00:19:54,978 --> 00:19:57,640
that you could go
to to learn more.

420
00:19:57,640 --> 00:19:59,710
The final comment
I wanted to make

421
00:19:59,710 --> 00:20:02,470
is that I only told
you about one estimator

422
00:20:02,470 --> 00:20:05,830
in today's lecture, and that's
known as the likelihood based

423
00:20:05,830 --> 00:20:07,030
estimator.

424
00:20:07,030 --> 00:20:09,307
But there is a whole
other estimation approach

425
00:20:09,307 --> 00:20:11,890
for survival modelings, which
is very important to know about,

426
00:20:11,890 --> 00:20:14,218
that are called partial
likelihood estimators.

427
00:20:14,218 --> 00:20:16,510
And for those of you who have
heard of Cox proportional

428
00:20:16,510 --> 00:20:18,468
hazards models-- and I
know they were discussed

429
00:20:18,468 --> 00:20:19,750
in Friday's recitation--

430
00:20:19,750 --> 00:20:21,700
that's an example
of a class of model

431
00:20:21,700 --> 00:20:26,040
that's commonly used within this
partial likelihood estimator.

432
00:20:26,040 --> 00:20:28,540
Now, at a very intuitive level,
what this partial likelihood

433
00:20:28,540 --> 00:20:31,600
estimator is doing is it's
working with something

434
00:20:31,600 --> 00:20:33,100
like the C-statistic.

435
00:20:33,100 --> 00:20:38,890
So notice how the C-statistic
only looks at relative

436
00:20:38,890 --> 00:20:40,810
orderings of events--

437
00:20:40,810 --> 00:20:44,130
of their event occurrences.

438
00:20:44,130 --> 00:20:47,330
It doesn't care about exactly
when the event occurred or not.

439
00:20:47,330 --> 00:20:52,100
In some sense,
there's a constant.

440
00:20:52,100 --> 00:20:55,910
There's-- in this
survival function,

441
00:20:55,910 --> 00:21:01,400
which could be divided out from
both sides of this inequality,

442
00:21:01,400 --> 00:21:05,330
and it wouldn't affect
anything about the statistic.

443
00:21:05,330 --> 00:21:07,488
And so one could think
about other ways of learning

444
00:21:07,488 --> 00:21:09,030
these models by
saying, well, we want

445
00:21:09,030 --> 00:21:10,770
to learn a survival
function such

446
00:21:10,770 --> 00:21:14,620
that it gets the ordering
correct between data points.

447
00:21:14,620 --> 00:21:17,760
Now, such a survival function
wouldn't do a very good job.

448
00:21:17,760 --> 00:21:20,400
There's no reason it would
do any good at getting

449
00:21:20,400 --> 00:21:23,700
the precise time of
when an event occurs,

450
00:21:23,700 --> 00:21:28,980
but if your goal were
to just figure out

451
00:21:28,980 --> 00:21:31,405
what is the sorted order
of patients by risk

452
00:21:31,405 --> 00:21:33,780
so that you're going to do an
intervention on the 10 most

453
00:21:33,780 --> 00:21:37,710
risky people, then getting that
order incorrect is going to be

454
00:21:37,710 --> 00:21:40,290
enough, and that's
precisely the intuition used

455
00:21:40,290 --> 00:21:42,190
behind these partial
likelihood estimators--

456
00:21:42,190 --> 00:21:44,940
so they focus on something
which is a little bit less

457
00:21:44,940 --> 00:21:47,040
than the original
goal, but in doing

458
00:21:47,040 --> 00:21:49,570
so, they can have much better
statistical complexity,

459
00:21:49,570 --> 00:21:51,445
meaning the amount of
data they need in order

460
00:21:51,445 --> 00:21:52,780
to fit this models well.

461
00:21:52,780 --> 00:21:54,390
And again, this is
a very rich topic.

462
00:21:54,390 --> 00:21:56,100
All I wanted to do is
give you a pointer to it

463
00:21:56,100 --> 00:21:57,940
so that you can go read more
about it if this is something

464
00:21:57,940 --> 00:21:58,732
of interest to you.

465
00:22:01,910 --> 00:22:06,590
So now moving on
into the recap, one

466
00:22:06,590 --> 00:22:09,110
of the most important points
that we discussed last week

467
00:22:09,110 --> 00:22:10,910
was about non-stationarity.

468
00:22:10,910 --> 00:22:13,250
And there was a question
posted to Piazza,

469
00:22:13,250 --> 00:22:14,990
which was really interesting,
which is how do you actually

470
00:22:14,990 --> 00:22:16,115
deal with non-stationarity.

471
00:22:16,115 --> 00:22:18,080
And I spoke a lot
about it existing,

472
00:22:18,080 --> 00:22:19,700
and I talked about
how to test for it,

473
00:22:19,700 --> 00:22:23,527
but I didn't say what
to do if you have it.

474
00:22:23,527 --> 00:22:25,610
So I thought this was such
an interesting question

475
00:22:25,610 --> 00:22:28,190
that I would also talk about
it a bit during lecture.

476
00:22:28,190 --> 00:22:32,540
So the short answer is, if
you have to have a solution

477
00:22:32,540 --> 00:22:36,280
that you deploy
tomorrow, then here's

478
00:22:36,280 --> 00:22:38,442
the hack that sometimes works.

479
00:22:38,442 --> 00:22:40,900
You take your most recent data,
like the last three months'

480
00:22:40,900 --> 00:22:42,358
data, and you hope
that there's not

481
00:22:42,358 --> 00:22:45,490
much non-stationarity
within last three months.

482
00:22:45,490 --> 00:22:47,800
You throw out all
the historical data,

483
00:22:47,800 --> 00:22:51,410
and you just train using
the most recent data.

484
00:22:51,410 --> 00:22:55,110
So a bit unsatisfying,
because you

485
00:22:55,110 --> 00:22:57,950
might have now extremely
little data left to learn with,

486
00:22:57,950 --> 00:23:02,885
but if you have enough volume,
it might be good enough.

487
00:23:02,885 --> 00:23:04,260
But the real
interesting question

488
00:23:04,260 --> 00:23:06,460
from a research perspective
is how could you optimally use

489
00:23:06,460 --> 00:23:07,540
that historical data.

490
00:23:07,540 --> 00:23:10,150
So here are three
different ways.

491
00:23:10,150 --> 00:23:14,420
So one way has to
do with imputation.

492
00:23:14,420 --> 00:23:17,713
Imagine that the way in which
your data was non-stationary

493
00:23:17,713 --> 00:23:19,130
was because there
were, let's say,

494
00:23:19,130 --> 00:23:24,110
parts of time when certain
features were just unavailable.

495
00:23:24,110 --> 00:23:27,452
I gave you this example last
week of laboratory test results

496
00:23:27,452 --> 00:23:29,660
across time, and I showed
you how there are sometimes

497
00:23:29,660 --> 00:23:31,202
these really big
blocks of time where

498
00:23:31,202 --> 00:23:34,810
no lab tests are available,
or very few are available.

499
00:23:34,810 --> 00:23:37,387
Well, luckily we live in a world
with high dimensional data,

500
00:23:37,387 --> 00:23:39,720
and what that means is there's
often a lot of redundancy

501
00:23:39,720 --> 00:23:40,930
in the data.

502
00:23:40,930 --> 00:23:45,840
So what you could imagine
doing is imputing features

503
00:23:45,840 --> 00:23:48,390
that you observed
to be missing, such

504
00:23:48,390 --> 00:23:50,520
that the missingness
properties, in fact,

505
00:23:50,520 --> 00:23:54,177
aren't changing as much
across time after imputation.

506
00:23:54,177 --> 00:23:56,010
And if you do that as
a pre-processing step,

507
00:23:56,010 --> 00:23:57,810
it may allow you
to make use of much

508
00:23:57,810 --> 00:24:00,690
more of the historical data.

509
00:24:00,690 --> 00:24:03,570
A different approach, which
is intimately tied to that,

510
00:24:03,570 --> 00:24:05,490
has to do with
transforming the data.

511
00:24:05,490 --> 00:24:07,770
Instead of imputing
it, transforming it

512
00:24:07,770 --> 00:24:10,710
into another representation
altogether, such that

513
00:24:10,710 --> 00:24:15,102
that presentation is
invariant across time.

514
00:24:15,102 --> 00:24:16,560
And here I'm giving
you a reference

515
00:24:16,560 --> 00:24:19,380
to this paper by Ganin et al
from the Journal of Machine

516
00:24:19,380 --> 00:24:21,660
Learning Research 2016,
which talks about how

517
00:24:21,660 --> 00:24:24,815
to do domain and variant
learning of neural networks,

518
00:24:24,815 --> 00:24:26,190
and that's one
approach to do so.

519
00:24:26,190 --> 00:24:28,482
And I view those two as being
very similar-- imputation

520
00:24:28,482 --> 00:24:30,210
and transformations.

521
00:24:30,210 --> 00:24:32,970
A second approach is
to re-weight the data

522
00:24:32,970 --> 00:24:36,230
to look like the current data.

523
00:24:36,230 --> 00:24:38,170
So imagine that you
go back in time,

524
00:24:38,170 --> 00:24:39,670
and you say, you know what?

525
00:24:39,670 --> 00:24:43,050
I ICD-10 codes, for
some very weird reason--

526
00:24:43,050 --> 00:24:44,530
this is not true, by the way--

527
00:24:44,530 --> 00:24:47,260
ICD-10 codes in
this untrue world

528
00:24:47,260 --> 00:24:51,870
happen to be used between
March and April of 2003.

529
00:24:51,870 --> 00:24:57,190
And then they weren't
used again until 2015.

530
00:24:57,190 --> 00:24:59,630
So instead of throwing away
all of the previous data,

531
00:24:59,630 --> 00:25:02,630
we're going to
recognize that those--

532
00:25:02,630 --> 00:25:04,740
that three month
interval 10 years ago

533
00:25:04,740 --> 00:25:07,680
was actually drawn from a very
similar distribution as what

534
00:25:07,680 --> 00:25:09,200
we're going to be
testing on today.

535
00:25:09,200 --> 00:25:12,230
So we're going to weight those
data points up very much,

536
00:25:12,230 --> 00:25:14,330
and down weight the
data points that are

537
00:25:14,330 --> 00:25:16,760
less like the ones from today.

538
00:25:16,760 --> 00:25:19,790
That's the intuition behind
these re-weighting approaches,

539
00:25:19,790 --> 00:25:22,010
and we're going to talk
much more about that

540
00:25:22,010 --> 00:25:23,900
in the context of
causal inference,

541
00:25:23,900 --> 00:25:25,953
not because these two have
to do with each other,

542
00:25:25,953 --> 00:25:28,370
but they have-- they end up
using a very similar technique

543
00:25:28,370 --> 00:25:32,600
for how to deal with datas
that shift, or covariate shift.

544
00:25:32,600 --> 00:25:34,910
And the final technique
that I'll mention

545
00:25:34,910 --> 00:25:37,710
is based on online
learning algorithms.

546
00:25:37,710 --> 00:25:44,420
So the idea there is that there
might be cut points, change

547
00:25:44,420 --> 00:25:47,760
points across time.

548
00:25:47,760 --> 00:25:52,275
So maybe the data looks one
way up until this change point,

549
00:25:52,275 --> 00:25:53,900
and then suddenly
the data looks really

550
00:25:53,900 --> 00:25:55,590
different until
this change point,

551
00:25:55,590 --> 00:25:57,132
and then suddenly
the data looks very

552
00:25:57,132 --> 00:25:59,440
different on into the future.

553
00:25:59,440 --> 00:26:01,940
So here I'm showing you there
are two change points in which

554
00:26:01,940 --> 00:26:04,610
data set shift happens.

555
00:26:04,610 --> 00:26:06,900
What these online learning
algorithms do is they say,

556
00:26:06,900 --> 00:26:09,350
OK, suppose we were
forced to make predictions

557
00:26:09,350 --> 00:26:11,360
throughout this
time period using

558
00:26:11,360 --> 00:26:13,400
only the historical
data to make predictions

559
00:26:13,400 --> 00:26:15,200
at each point in time.

560
00:26:15,200 --> 00:26:18,650
Well, if we could
somehow recognize

561
00:26:18,650 --> 00:26:21,172
that there might
be these shifts,

562
00:26:21,172 --> 00:26:22,880
we could design
algorithms that are going

563
00:26:22,880 --> 00:26:25,910
to be robust to those shifts.

564
00:26:25,910 --> 00:26:28,040
And then one could try to
analyze-- mathematically

565
00:26:28,040 --> 00:26:30,350
analyze those algorithms
based on the amount of regret

566
00:26:30,350 --> 00:26:33,770
they would have to, for example,
an algorithm that knew exactly

567
00:26:33,770 --> 00:26:35,098
when those changes were.

568
00:26:35,098 --> 00:26:36,890
And of course, we don't
know precisely when

569
00:26:36,890 --> 00:26:38,970
those changes were.

570
00:26:38,970 --> 00:26:41,660
And so there's a whole field of
algorithms trying to do that,

571
00:26:41,660 --> 00:26:44,930
and here I'm just give me one
citation for a recent work.

572
00:26:47,680 --> 00:26:49,240
So to conclude risk
stratification--

573
00:26:49,240 --> 00:26:51,970
this is the last slide here.

574
00:26:51,970 --> 00:26:55,080
Maybe ask your
question after class.

575
00:26:55,080 --> 00:26:56,490
We've talked about
two approaches

576
00:26:56,490 --> 00:26:58,890
for formalizing risk
stratification-- first

577
00:26:58,890 --> 00:27:00,000
as binary classification.

578
00:27:00,000 --> 00:27:02,430
Second as regression.

579
00:27:02,430 --> 00:27:04,950
And in the regression
framework, one

580
00:27:04,950 --> 00:27:06,772
has to think about
censoring, which is why

581
00:27:06,772 --> 00:27:07,980
we call it survival modeling.

582
00:27:11,090 --> 00:27:16,550
Second, in our examples,
and again in your homework

583
00:27:16,550 --> 00:27:20,990
assignment that's
coming up next week,

584
00:27:20,990 --> 00:27:25,010
we'll see that
often the variables,

585
00:27:25,010 --> 00:27:29,850
the features that are most
predictive make a lot of sense.

586
00:27:29,850 --> 00:27:32,480
In the diabetes case, we said--

587
00:27:32,480 --> 00:27:36,740
we saw how patients having
comorbidities of diabetes,

588
00:27:36,740 --> 00:27:39,080
like hypertension, or
patients being obese

589
00:27:39,080 --> 00:27:42,370
were very predictive of
patients getting diabetes.

590
00:27:42,370 --> 00:27:46,120
So you might ask yourself, is
there something causal there?

591
00:27:46,120 --> 00:27:49,870
Are those features that are very
predictive in fact causing--

592
00:27:49,870 --> 00:27:52,180
what's causing the patient
to develop type 2 diabetes?

593
00:27:52,180 --> 00:27:55,580
Like, for example,
obesity causing diabetes.

594
00:27:55,580 --> 00:27:58,290
And this is where I
want to caution you.

595
00:27:58,290 --> 00:28:02,190
You shouldn't interpret these
very predictive features

596
00:28:02,190 --> 00:28:04,950
in a causal fashion,
particularly

597
00:28:04,950 --> 00:28:07,650
not when one starts to work
with high dimensional data,

598
00:28:07,650 --> 00:28:12,680
as we do in this course.

599
00:28:12,680 --> 00:28:15,290
The reason for that
is very subtle,

600
00:28:15,290 --> 00:28:18,200
and we'll talk about that in
the causal inference lectures,

601
00:28:18,200 --> 00:28:20,180
but I just wanted to
give you a pointer

602
00:28:20,180 --> 00:28:22,500
now that you shouldn't
think about it in that way.

603
00:28:22,500 --> 00:28:26,540
And you'll understand
why in just a few weeks.

604
00:28:26,540 --> 00:28:31,820
And finally we talked about ways
of dealing with missing data.

605
00:28:31,820 --> 00:28:35,620
I gave you one
feature representation

606
00:28:35,620 --> 00:28:39,407
for the diabetes case,
which was designed

607
00:28:39,407 --> 00:28:40,490
to deal with missing data.

608
00:28:40,490 --> 00:28:46,890
It said, was there any
diagnosis code 250.01

609
00:28:46,890 --> 00:28:49,280
in the last three months?

610
00:28:49,280 --> 00:28:50,650
And if there was, you have a 1.

611
00:28:50,650 --> 00:28:51,317
If you don't, 0.

612
00:28:51,317 --> 00:28:53,178
So it's designed to
recognize that you

613
00:28:53,178 --> 00:28:55,720
don't have information, perhaps,
for some large chunk of time

614
00:28:55,720 --> 00:28:58,100
in that window.

615
00:28:58,100 --> 00:29:01,520
But that missing data
could also be dangerous

616
00:29:01,520 --> 00:29:05,690
if that missingness itself has
caused you to non-stationarity,

617
00:29:05,690 --> 00:29:09,290
which is then going to result in
your test distribution looking

618
00:29:09,290 --> 00:29:11,490
different from your
training distribution.

619
00:29:11,490 --> 00:29:14,450
And that's where approaches
that are based on imputation

620
00:29:14,450 --> 00:29:17,840
could actually be very valuable,
not because they improve

621
00:29:17,840 --> 00:29:20,240
your predictive accuracy
when everything goes right,

622
00:29:20,240 --> 00:29:22,820
but because they might improve
your predictive accuracy when

623
00:29:22,820 --> 00:29:24,565
things go wrong.

624
00:29:24,565 --> 00:29:26,690
And so one of your readings
for last week's lecture

625
00:29:26,690 --> 00:29:29,510
was actually an example of
that, where they used a Gaussian

626
00:29:29,510 --> 00:29:34,190
process model to impute much of
the missing data in a patient's

627
00:29:34,190 --> 00:29:36,200
continuous vital
signs, and then they

628
00:29:36,200 --> 00:29:38,210
used a recurrent neural
network to predict

629
00:29:38,210 --> 00:29:41,480
based on that imputed data.

630
00:29:41,480 --> 00:29:46,050
So in that case, there are
really two things going on.

631
00:29:46,050 --> 00:29:48,345
First is this robustness
to data set shift,

632
00:29:48,345 --> 00:29:49,720
but there's a
second thing, which

633
00:29:49,720 --> 00:29:51,220
is going on as well,
which has to do

634
00:29:51,220 --> 00:29:54,370
with a trade-off between
the amount of data you have

635
00:29:54,370 --> 00:29:58,960
and the complexity of
the prediction problem.

636
00:29:58,960 --> 00:30:00,520
By doing imputations,
sometimes you

637
00:30:00,520 --> 00:30:02,320
make your problem
look a bit simpler,

638
00:30:02,320 --> 00:30:05,170
and simpler algorithms might
succeed where otherwise they

639
00:30:05,170 --> 00:30:07,665
would fail because not
having enough data.

640
00:30:07,665 --> 00:30:09,040
And that's something
that you saw

641
00:30:09,040 --> 00:30:12,150
in that last week's reading.

642
00:30:12,150 --> 00:30:14,730
So I'm done with
risk stratification.

643
00:30:14,730 --> 00:30:18,030
I'll take a one minute breather
for everyone in the room,

644
00:30:18,030 --> 00:30:20,765
and then we'll start
with the main topic

645
00:30:20,765 --> 00:30:22,890
of this lecture, which is
physiological time-series

646
00:30:22,890 --> 00:30:23,390
modeling.

647
00:30:27,870 --> 00:30:28,720
Let's say started.

648
00:30:37,047 --> 00:30:38,880
So here's a baby that's
not doing very well.

649
00:30:42,050 --> 00:30:44,180
This baby is in the
intensive care unit.

650
00:30:48,050 --> 00:30:51,230
Maybe it was a premature infant.

651
00:30:51,230 --> 00:30:56,360
Maybe it's a baby who
has some chronic disease,

652
00:30:56,360 --> 00:30:59,510
and, of course, parents
are very worried.

653
00:30:59,510 --> 00:31:02,410
This baby is getting
very close monitoring.

654
00:31:02,410 --> 00:31:04,220
It's connected to lots
of different probes.

655
00:31:07,031 --> 00:31:10,160
In number one here, it's
illustrating a three probe--

656
00:31:10,160 --> 00:31:13,670
three lead ECG, which we'll be
talking about much more, which

657
00:31:13,670 --> 00:31:17,270
is measuring its heart, how
the baby's heart is doing.

658
00:31:17,270 --> 00:31:21,170
Over here, this number
three is something

659
00:31:21,170 --> 00:31:24,206
attached to the baby's foot,
which is measuring its--

660
00:31:24,206 --> 00:31:27,563
it's a pulse oximeter, which
is measuring the baby's oxygen

661
00:31:27,563 --> 00:31:29,480
saturation, the amount
of oxygen in the blood.

662
00:31:32,780 --> 00:31:35,900
Number four is a probe which
is measuring the baby's

663
00:31:35,900 --> 00:31:37,040
temperature and so on.

664
00:31:37,040 --> 00:31:40,168
And so we're really taking
really close measurements

665
00:31:40,168 --> 00:31:41,960
of this baby, because
we want to understand

666
00:31:41,960 --> 00:31:43,400
how is this baby doing.

667
00:31:43,400 --> 00:31:47,120
We recognize that there might
be really sudden changes

668
00:31:47,120 --> 00:31:49,010
in the baby's state
of health that we

669
00:31:49,010 --> 00:31:52,650
want to be able to recognize
as early as possible.

670
00:31:52,650 --> 00:31:56,240
And so behind the scenes,
next to this baby,

671
00:31:56,240 --> 00:31:58,790
you'll, of course, have a
huge number of monitors,

672
00:31:58,790 --> 00:32:00,915
each of the monitors showing
the readouts from each

673
00:32:00,915 --> 00:32:03,200
of these different signals.

674
00:32:03,200 --> 00:32:07,870
And this type of data is really
prevalent in intensive care

675
00:32:07,870 --> 00:32:10,600
units, but you'll also
see in today's lecture

676
00:32:10,600 --> 00:32:12,760
how some aspects of
this data are now

677
00:32:12,760 --> 00:32:15,040
starting to make its way
to the home, as well.

678
00:32:15,040 --> 00:32:20,590
So for example, EKGs are now
available on Apple and Samsung

679
00:32:20,590 --> 00:32:24,250
watches to help understand--

680
00:32:24,250 --> 00:32:27,010
help to help with
diagnosis of arrhythmias,

681
00:32:27,010 --> 00:32:29,290
even for people at home.

682
00:32:29,290 --> 00:32:30,790
And so from this
type of data, there

683
00:32:30,790 --> 00:32:34,210
are a number of really important
use cases to think about.

684
00:32:34,210 --> 00:32:36,210
The first one is to
recognize that often we're

685
00:32:36,210 --> 00:32:39,030
getting really noisy
data, and we want to try

686
00:32:39,030 --> 00:32:40,710
to infer the true signal.

687
00:32:40,710 --> 00:32:43,170
So imagine, for example,
the temperature probe.

688
00:32:43,170 --> 00:32:47,100
The baby's true
temperature might be 98.5,

689
00:32:47,100 --> 00:32:50,640
but for whatever reason-- we'll
see a few reasons here today--

690
00:32:50,640 --> 00:32:53,197
maybe you're getting
an observation of 93.

691
00:32:53,197 --> 00:32:54,030
And you didn't know.

692
00:32:54,030 --> 00:32:56,190
Is that actually the
true baby temperature?

693
00:32:56,190 --> 00:32:57,300
In which case we--

694
00:32:57,300 --> 00:32:59,250
it would be in a lot of trouble.

695
00:32:59,250 --> 00:33:01,080
Or is that an anomalous reading?

696
00:33:01,080 --> 00:33:03,288
So we like t be able to
distinguish between those two

697
00:33:03,288 --> 00:33:04,440
things.

698
00:33:04,440 --> 00:33:09,090
And in other cases, we are
interested in not necessarily

699
00:33:09,090 --> 00:33:12,030
fully understanding what's going
on with the baby along each

700
00:33:12,030 --> 00:33:15,660
of those axes, but we
just want to use that data

701
00:33:15,660 --> 00:33:17,880
for predictive purposes,
for risk stratification,

702
00:33:17,880 --> 00:33:19,367
for example.

703
00:33:19,367 --> 00:33:21,200
And so the type of
machine learning approach

704
00:33:21,200 --> 00:33:25,520
that we'll take here will depend
on the following three factors.

705
00:33:25,520 --> 00:33:28,350
First, do we have
label data available?

706
00:33:28,350 --> 00:33:30,470
For example, do we
know the ground truth

707
00:33:30,470 --> 00:33:34,130
of what the baby's
true temperature was,

708
00:33:34,130 --> 00:33:38,630
at least for a few of the
babies in the training set?

709
00:33:38,630 --> 00:33:39,680
Second.

710
00:33:39,680 --> 00:33:43,310
Do we have a good mechanistic
or statistical model

711
00:33:43,310 --> 00:33:46,113
of how this data might
evolve across time?

712
00:33:46,113 --> 00:33:47,780
We know a lot about
hearts, for example.

713
00:33:47,780 --> 00:33:49,655
Cardiology is one of
those fields of medicine

714
00:33:49,655 --> 00:33:51,500
where it's really well studied.

715
00:33:51,500 --> 00:33:53,360
There are good
simulators of hearts,

716
00:33:53,360 --> 00:33:54,950
and how they beat
across time, and how

717
00:33:54,950 --> 00:34:01,150
that affects the electrical
stimulation across the body.

718
00:34:01,150 --> 00:34:03,970
And if we have these
good mechanistic

719
00:34:03,970 --> 00:34:05,770
or statistical
models, that can often

720
00:34:05,770 --> 00:34:08,800
allow one to trade off not
having much label data,

721
00:34:08,800 --> 00:34:11,540
or just not having
much data period.

722
00:34:11,540 --> 00:34:13,850
And it's really
these three points

723
00:34:13,850 --> 00:34:16,429
which I want to illustrate
the extremes of in today's

724
00:34:16,429 --> 00:34:16,955
lecture--

725
00:34:16,955 --> 00:34:18,830
what do you do when you
don't have much data,

726
00:34:18,830 --> 00:34:20,000
and what you do
when-- what you can

727
00:34:20,000 --> 00:34:21,395
do when you have a ton of data.

728
00:34:21,395 --> 00:34:24,054
And I think it's going to
be really informative for us

729
00:34:24,054 --> 00:34:26,179
as we go out into the world
and will have to tackle

730
00:34:26,179 --> 00:34:27,304
each of those two settings.

731
00:34:30,159 --> 00:34:33,500
So here's an example of two
different babies with very

732
00:34:33,500 --> 00:34:35,150
different trajectories.

733
00:34:35,150 --> 00:34:38,449
One in the x-axis here
is time in seconds.

734
00:34:38,449 --> 00:34:41,688
The y-axis here--

735
00:34:41,688 --> 00:34:42,980
I think seconds, maybe minutes.

736
00:34:42,980 --> 00:34:46,130
The y-axis here is beats per
minute of the baby's heart

737
00:34:46,130 --> 00:34:50,630
rate, and you see in
some cases it's really

738
00:34:50,630 --> 00:34:51,938
fluctuating a lot up and down.

739
00:34:51,938 --> 00:34:54,230
In some cases, it's sort of
going in a similar-- in one

740
00:34:54,230 --> 00:34:58,700
direction, and in all cases,
the short term observations

741
00:34:58,700 --> 00:35:03,562
are very different from the
long range trajectories.

742
00:35:03,562 --> 00:35:05,020
So the first problem
that I want us

743
00:35:05,020 --> 00:35:10,450
to think about is one
of trying to understand,

744
00:35:10,450 --> 00:35:13,972
how do we deconvolve between the
truth of what's going on with,

745
00:35:13,972 --> 00:35:15,680
for example, the
patient's blood pressure

746
00:35:15,680 --> 00:35:20,163
or oxygen versus interventions
that are happening to them?

747
00:35:20,163 --> 00:35:21,580
So on the bottom
here, I'm showing

748
00:35:21,580 --> 00:35:24,750
examples of interventions.

749
00:35:24,750 --> 00:35:27,810
Here in this oxygen
uptake, we notice

750
00:35:27,810 --> 00:35:31,047
how between roughly 1,000
and 2,000 seconds suddenly

751
00:35:31,047 --> 00:35:32,255
there's no signal whatsoever.

752
00:35:32,255 --> 00:35:34,213
And that's an example of
what's called dropout.

753
00:35:36,520 --> 00:35:39,650
Over here, we see a
different type of--

754
00:35:39,650 --> 00:35:42,430
the effect of a
different intervention,

755
00:35:42,430 --> 00:35:44,770
which is due to a
probe recalibration.

756
00:35:44,770 --> 00:35:46,870
Now, at that time,
there was a drop out

757
00:35:46,870 --> 00:35:50,170
followed by a sudden
change in the values,

758
00:35:50,170 --> 00:35:52,720
and that's really happening
due to a recalibration step.

759
00:35:52,720 --> 00:35:55,710
And in both of
these cases, what's

760
00:35:55,710 --> 00:35:58,132
going on with the individual
might be relatively

761
00:35:58,132 --> 00:36:00,090
constant across time,
but what's being observed

762
00:36:00,090 --> 00:36:04,240
is dramatically affected
by those interventions.

763
00:36:04,240 --> 00:36:06,070
So we want to ask
the question, can we

764
00:36:06,070 --> 00:36:08,788
identify those
artifactual processes?

765
00:36:08,788 --> 00:36:11,080
Can we identify that these
interventions were happening

766
00:36:11,080 --> 00:36:12,080
at those points in time?

767
00:36:15,680 --> 00:36:18,000
And then, if we
could identify them,

768
00:36:18,000 --> 00:36:21,120
then we could potentially
subtract their effect out.

769
00:36:21,120 --> 00:36:27,210
So we could impute the
data, which we know-- now

770
00:36:27,210 --> 00:36:30,390
know to be missing, and then
have this much higher quality

771
00:36:30,390 --> 00:36:33,130
signal used for some
downstream predictive purpose,

772
00:36:33,130 --> 00:36:34,910
for example.

773
00:36:34,910 --> 00:36:37,510
And the second reason why
this can be really important

774
00:36:37,510 --> 00:36:40,660
is to tackle this problem
called alarm fatigue.

775
00:36:43,370 --> 00:36:47,030
Alarm fatigue is one of the
most important challenges facing

776
00:36:47,030 --> 00:36:48,500
medicine today.

777
00:36:48,500 --> 00:36:52,370
As we get better and better
in doing risk stratification,

778
00:36:52,370 --> 00:36:58,700
as we come up with more and
more diagnostic tools and tests,

779
00:36:58,700 --> 00:37:02,090
that means these red flags
are being raised more and more

780
00:37:02,090 --> 00:37:03,690
often.

781
00:37:03,690 --> 00:37:08,170
And each one of these has some
associated false positive rate

782
00:37:08,170 --> 00:37:09,800
for it.

783
00:37:09,800 --> 00:37:13,510
And so the more tests you have--

784
00:37:13,510 --> 00:37:15,250
suppose the false
positive rate is

785
00:37:15,250 --> 00:37:18,160
kept constant-- the more tests
you have, the more likely

786
00:37:18,160 --> 00:37:20,140
it is that the union
of all of those

787
00:37:20,140 --> 00:37:24,568
is going to be some error.

788
00:37:24,568 --> 00:37:27,540
And so when you're in
an intensive care unit,

789
00:37:27,540 --> 00:37:29,500
there are alarms going
off all the time.

790
00:37:29,500 --> 00:37:31,630
And something that happens
is that nurses end up

791
00:37:31,630 --> 00:37:35,110
starting to ignore those
alarms, because so often

792
00:37:35,110 --> 00:37:37,480
those alarms are
false positives,

793
00:37:37,480 --> 00:37:39,700
are due to, for
example, artifacts

794
00:37:39,700 --> 00:37:41,835
like what I'm showing you here.

795
00:37:41,835 --> 00:37:43,960
And so if we had techniques,
such as the ones we'll

796
00:37:43,960 --> 00:37:47,680
talk about right now,
which could recognize when,

797
00:37:47,680 --> 00:37:50,470
for example, the sudden drop
in a patient's heart rate

798
00:37:50,470 --> 00:37:54,940
is due to an artifact and not
due to the patient's true heart

799
00:37:54,940 --> 00:37:56,958
rate dropping--

800
00:37:56,958 --> 00:37:58,500
if we had enough
confidence in that--

801
00:37:58,500 --> 00:37:59,958
in distinguishing
those two things,

802
00:37:59,958 --> 00:38:03,150
then we might not decide
to raise that red flag.

803
00:38:03,150 --> 00:38:06,430
And that might reduce the
amount of false alarms,

804
00:38:06,430 --> 00:38:09,150
and that then might reduce
the amount of alarm fatigue.

805
00:38:09,150 --> 00:38:11,850
And that could have a very
big impact on health care.

806
00:38:15,980 --> 00:38:19,150
So the technique which
we'll talk about today

807
00:38:19,150 --> 00:38:24,170
goes by the name of switching
linear dynamical systems.

808
00:38:24,170 --> 00:38:25,820
Who here has seen
a picture like this

809
00:38:25,820 --> 00:38:29,630
on-- this picture on
the bottom before.

810
00:38:29,630 --> 00:38:32,173
About half of the room.

811
00:38:32,173 --> 00:38:33,590
So for the other
half of the room,

812
00:38:33,590 --> 00:38:36,620
I'm going to give
a bit of a recap

813
00:38:36,620 --> 00:38:38,960
into probabilistic modeling.

814
00:38:38,960 --> 00:38:43,830
All of you are now familiar
with general probabilities.

815
00:38:43,830 --> 00:38:48,230
So you're used to thinking
about, for example,

816
00:38:48,230 --> 00:38:51,230
univariate Gaussian
distributions.

817
00:38:51,230 --> 00:38:54,050
We talked about how one
could model survival, which

818
00:38:54,050 --> 00:38:57,440
was an example of
such a distribution,

819
00:38:57,440 --> 00:38:59,088
but for today's
lecture, we're going

820
00:38:59,088 --> 00:39:01,130
to be thinking now about
multivariate probability

821
00:39:01,130 --> 00:39:01,820
distributions.

822
00:39:01,820 --> 00:39:05,870
In particular, we'll be thinking
about how a patient's state--

823
00:39:05,870 --> 00:39:08,120
let's say their true
blood pressure--

824
00:39:08,120 --> 00:39:09,990
evolves across time.

825
00:39:09,990 --> 00:39:14,570
And so now we're interested in
not just the random variable

826
00:39:14,570 --> 00:39:16,740
at one point in time, but
that same random variable

827
00:39:16,740 --> 00:39:18,782
at the second point in
time, third point in time,

828
00:39:18,782 --> 00:39:21,488
fourth point in time, fifth
point in time, and so on.

829
00:39:21,488 --> 00:39:23,030
So what I'm showing
you here is known

830
00:39:23,030 --> 00:39:26,270
as a graphical model, also
known as a Bayesian network.

831
00:39:26,270 --> 00:39:29,050
And it's one way of illustrating
a multivariate probability

832
00:39:29,050 --> 00:39:31,460
distribution that has particular
conditional independence

833
00:39:31,460 --> 00:39:33,490
properties.

834
00:39:33,490 --> 00:39:40,690
Specifically, in
this model, one node

835
00:39:40,690 --> 00:39:42,260
corresponds to one
random variable.

836
00:39:42,260 --> 00:39:46,840
So this is describing a
joint distribution on x1

837
00:39:46,840 --> 00:39:55,117
through x6, y1 through y6.

838
00:39:55,117 --> 00:39:56,700
So it's this
multivariate distribution

839
00:39:56,700 --> 00:40:00,570
on 12 random variables.

840
00:40:00,570 --> 00:40:03,600
The fact that this
is shaded in simply

841
00:40:03,600 --> 00:40:07,110
denotes that, at test time, when
we use these models, typically

842
00:40:07,110 --> 00:40:09,780
these y variables are observed.

843
00:40:09,780 --> 00:40:13,410
Whereas our goal is usually
to infer the x variables.

844
00:40:13,410 --> 00:40:16,950
Those are typically unobserved,
meaning that our typical task

845
00:40:16,950 --> 00:40:20,340
is one of doing posterior
inference to infer

846
00:40:20,340 --> 00:40:22,725
the x's given the y's.

847
00:40:25,470 --> 00:40:28,860
Now, associated with
this graph, I already

848
00:40:28,860 --> 00:40:31,740
told you the nodes correspond
to random variables.

849
00:40:31,740 --> 00:40:36,330
The graph tells us how is this
joint distribution factorized.

850
00:40:36,330 --> 00:40:41,130
In particular, it's
going to be factorized

851
00:40:41,130 --> 00:40:42,240
in the following way--

852
00:40:42,240 --> 00:40:45,210
as the product over
random variables

853
00:40:45,210 --> 00:40:49,000
of the probability of
the i-th random variable.

854
00:40:49,000 --> 00:40:51,840
I'm going to use z to just
denote a random variable.

855
00:40:51,840 --> 00:40:55,680
Think of z as the
union of x and y.

856
00:40:55,680 --> 00:40:59,610
zi conditioned on the parents--

857
00:40:59,610 --> 00:41:01,800
the values of the parents of zi.

858
00:41:05,820 --> 00:41:10,080
So I'm going to assume
this factorization,

859
00:41:10,080 --> 00:41:13,800
and in particular for this
graphical model, which

860
00:41:13,800 --> 00:41:15,870
goes by the name
of a Markov model,

861
00:41:15,870 --> 00:41:18,810
it has a very specific
factorization.

862
00:41:18,810 --> 00:41:22,180
And we're just going to read
it off from this definition.

863
00:41:22,180 --> 00:41:26,340
So we're going to go in
order-- first x1, then y1,

864
00:41:26,340 --> 00:41:28,410
then x2, then y2,
and so on, which

865
00:41:28,410 --> 00:41:36,630
is going based on
a root to children

866
00:41:36,630 --> 00:41:39,340
transversal of this graph.

867
00:41:39,340 --> 00:41:44,410
So the first random
variable is x1.

868
00:41:44,410 --> 00:41:50,230
Second variable is y2, and
what are the parents of y--

869
00:41:50,230 --> 00:41:51,757
sorry, what are
the parents of y1.

870
00:41:51,757 --> 00:41:52,840
Everyone can say out loud.

871
00:41:52,840 --> 00:41:54,070
AUDIENCE: x1.

872
00:41:54,070 --> 00:41:55,090
DAVID SONTAG: x1.

873
00:41:55,090 --> 00:42:01,450
So y1 in this factorization
is only going to depend on x1.

874
00:42:01,450 --> 00:42:02,740
Next we have x2.

875
00:42:02,740 --> 00:42:03,940
What are the parents of x2?

876
00:42:03,940 --> 00:42:05,390
Everyone say out loud?

877
00:42:05,390 --> 00:42:06,370
AUDIENCE: x1.

878
00:42:06,370 --> 00:42:07,840
DAVID SONTAG: x1.

879
00:42:07,840 --> 00:42:09,790
Then we have y2.

880
00:42:09,790 --> 00:42:11,633
What are the parents of y2.

881
00:42:11,633 --> 00:42:12,550
Everyone say out loud.

882
00:42:12,550 --> 00:42:14,080
AUDIENCE: x2.

883
00:42:14,080 --> 00:42:16,960
DAVID SONTAG: x2 and so on.

884
00:42:16,960 --> 00:42:20,920
So this joint
distribution is going

885
00:42:20,920 --> 00:42:23,560
to have a particularly
simple form, which

886
00:42:23,560 --> 00:42:26,280
is given to by this
factorization shown here.

887
00:42:26,280 --> 00:42:28,420
And this factorization
corresponds one to one

888
00:42:28,420 --> 00:42:32,400
with the particular graph in
the way that I just told you.

889
00:42:32,400 --> 00:42:35,760
And in this way, we can define
a very complex probability

890
00:42:35,760 --> 00:42:39,900
distribution by a number of much
simpler conditional probability

891
00:42:39,900 --> 00:42:41,220
distributions.

892
00:42:41,220 --> 00:42:44,740
For example, if each of the
random variables were binary,

893
00:42:44,740 --> 00:42:48,840
then to describe
probability of y1 given x1,

894
00:42:48,840 --> 00:42:50,250
we only need two numbers.

895
00:42:50,250 --> 00:42:52,840
For each value of
x1, either 0 or 1,

896
00:42:52,840 --> 00:42:55,290
we give the probability
of y1 equals 1.

897
00:42:55,290 --> 00:42:59,530
And then, of course, probably y1
equals 0 is just 1 minus that.

898
00:42:59,530 --> 00:43:02,290
So we can describe that very
complicated joint distribution

899
00:43:02,290 --> 00:43:07,200
by a number of much
smaller distributions.

900
00:43:07,200 --> 00:43:10,700
Now, the reason why I'm
drawing it in this way

901
00:43:10,700 --> 00:43:13,940
is because we're making some
really strong assumptions

902
00:43:13,940 --> 00:43:18,020
about the temporal
dynamics in this problem.

903
00:43:18,020 --> 00:43:23,360
In particular, the
fact that x3 only

904
00:43:23,360 --> 00:43:27,720
has an arrow from
x2 and not from x1

905
00:43:27,720 --> 00:43:32,540
implies that x3 is
conditionally independent of x1.

906
00:43:32,540 --> 00:43:34,400
If you knew x2's value.

907
00:43:34,400 --> 00:43:37,970
So in some sense, think
about this as cutting.

908
00:43:37,970 --> 00:43:40,700
If you're to take
x2 out of the model

909
00:43:40,700 --> 00:43:43,040
and remove all edges
incident on it,

910
00:43:43,040 --> 00:43:46,490
then x1 and x3 are now
separated from one another.

911
00:43:46,490 --> 00:43:48,110
They're independent.

912
00:43:48,110 --> 00:43:51,740
Now, for those of you who
do know graphical models,

913
00:43:51,740 --> 00:43:54,770
you'll recognize that that type
of independent statement that I

914
00:43:54,770 --> 00:43:56,480
made is only true
for Markov models,

915
00:43:56,480 --> 00:43:58,605
and the semantics
for Bayesian networks

916
00:43:58,605 --> 00:43:59,730
are a little bit different.

917
00:43:59,730 --> 00:44:02,058
But actually for this
model, it's-- they're one

918
00:44:02,058 --> 00:44:02,600
and the same.

919
00:44:05,910 --> 00:44:08,990
So we're going to make
the following assumptions

920
00:44:08,990 --> 00:44:12,890
for the conditional
distributions shown here.

921
00:44:12,890 --> 00:44:16,850
First, we're going to suppose
that xt is given to you

922
00:44:16,850 --> 00:44:19,490
by a Gaussian distribution.

923
00:44:19,490 --> 00:44:23,570
Remember xt-- t is
denoting a time step.

924
00:44:23,570 --> 00:44:26,815
Let's say 3-- it only
depends in this picture--

925
00:44:26,815 --> 00:44:28,190
the conditional
distribution only

926
00:44:28,190 --> 00:44:30,650
depends on the previous
time step's value, x2,

927
00:44:30,650 --> 00:44:32,310
or xt minus 1.

928
00:44:32,310 --> 00:44:34,850
So you'll notice how
I'm going to say here

929
00:44:34,850 --> 00:44:36,620
xt is going to
distribute as something,

930
00:44:36,620 --> 00:44:38,690
but the only random
variables in this something

931
00:44:38,690 --> 00:44:42,680
can be xt minus 1, according
to these assumptions.

932
00:44:42,680 --> 00:44:44,180
In particular, we're
going to assume

933
00:44:44,180 --> 00:44:47,930
that it's some Gaussian
distribution, whose mean is

934
00:44:47,930 --> 00:44:51,020
some linear transformation
of xt minus 1,

935
00:44:51,020 --> 00:44:55,240
and which has a fixed
covariance matrix q.

936
00:44:55,240 --> 00:45:00,310
So at each step of this process,
the next random variable

937
00:45:00,310 --> 00:45:03,700
is some random walk from
the previous random variable

938
00:45:03,700 --> 00:45:07,833
where you're moving according
to some Gaussian distribution.

939
00:45:07,833 --> 00:45:09,250
In a very similar
way, we're going

940
00:45:09,250 --> 00:45:17,410
to assume that yt is drawn also
as a Gaussian distribution,

941
00:45:17,410 --> 00:45:20,550
but now depending on xt.

942
00:45:20,550 --> 00:45:24,120
So I want you to think
about xt as the true state

943
00:45:24,120 --> 00:45:25,410
of the patient.

944
00:45:25,410 --> 00:45:28,590
It's a vector that's
summarizing their blood

945
00:45:28,590 --> 00:45:31,200
pressure, their
oxygen saturation,

946
00:45:31,200 --> 00:45:33,150
a whole bunch of
other parameters,

947
00:45:33,150 --> 00:45:35,460
or maybe even just one of those.

948
00:45:35,460 --> 00:45:39,300
And y1 are the observations
that you do observe.

949
00:45:39,300 --> 00:45:41,890
So let's say x1 is the
patient's true blood pressure.

950
00:45:41,890 --> 00:45:43,980
y1 is the observed
blood pressure,

951
00:45:43,980 --> 00:45:47,010
what comes from your monitor.

952
00:45:47,010 --> 00:45:48,660
So then a reasonable
assumption would

953
00:45:48,660 --> 00:45:52,350
be that, well, if
all this were equal,

954
00:45:52,350 --> 00:45:53,910
if it was a true
observation, then

955
00:45:53,910 --> 00:45:55,750
y1 should be very close to x1.

956
00:45:55,750 --> 00:45:58,680
So you might assume that
this covariance matrix is--

957
00:45:58,680 --> 00:46:01,460
the covariance is-- the
variance is very, very small.

958
00:46:01,460 --> 00:46:07,280
y1 should be very close to x1
if it's a good observation.

959
00:46:07,280 --> 00:46:10,100
And of course, if it's
a noisy observation--

960
00:46:10,100 --> 00:46:15,680
like, for example, if the probe
was disconnected from the baby,

961
00:46:15,680 --> 00:46:19,790
then y1 should have
no relationship to x1.

962
00:46:19,790 --> 00:46:23,460
And that dependence on the
actual state of the world

963
00:46:23,460 --> 00:46:26,730
I'm denoting here by these
superscripts, s of t.

964
00:46:26,730 --> 00:46:28,730
I'm ignoring that right
now, and I'll bring that

965
00:46:28,730 --> 00:46:31,910
in in the next slide.

966
00:46:31,910 --> 00:46:36,230
Similarly, the relationship
between x2 and x1

967
00:46:36,230 --> 00:46:38,510
should be one which captures
some of the dynamics

968
00:46:38,510 --> 00:46:42,140
that I showed in the previous
slides, where I showed over

969
00:46:42,140 --> 00:46:46,040
here now this is the patient's
true heart rate evolving

970
00:46:46,040 --> 00:46:48,080
across time, let's say.

971
00:46:48,080 --> 00:46:51,800
Notice how, if you
look very locally,

972
00:46:51,800 --> 00:46:56,720
it looks like there are some
very, very big local dynamics.

973
00:46:56,720 --> 00:46:58,790
Whereas if you
look more globally,

974
00:46:58,790 --> 00:47:01,340
again, there's some smoothness,
but there are some-- again,

975
00:47:01,340 --> 00:47:03,590
it looks like some random
changes across time.

976
00:47:03,590 --> 00:47:10,070
And so those-- that
drift has to somehow

977
00:47:10,070 --> 00:47:13,550
be summarized in this model
by that A random variable.

978
00:47:13,550 --> 00:47:16,130
And I'll get into more detail
about that in just a moment.

979
00:47:18,750 --> 00:47:20,990
So what I just showed
you was an example

980
00:47:20,990 --> 00:47:23,360
of a linear dynamical
system, but it

981
00:47:23,360 --> 00:47:27,170
was assuming that there were
none of these events happening,

982
00:47:27,170 --> 00:47:30,082
none of these
artifacts happening.

983
00:47:30,082 --> 00:47:31,540
The actual model
that we were going

984
00:47:31,540 --> 00:47:33,040
to want to be able
to use then is

985
00:47:33,040 --> 00:47:34,330
going to also
incorporate the fact

986
00:47:34,330 --> 00:47:35,320
that there might be artifacts.

987
00:47:35,320 --> 00:47:36,640
And to model that,
we need to introduce

988
00:47:36,640 --> 00:47:38,473
additional random
variables corresponding to

989
00:47:38,473 --> 00:47:40,250
whether those artifacts
occurred or not.

990
00:47:40,250 --> 00:47:42,290
And so that's now this model.

991
00:47:42,290 --> 00:47:45,370
So I'm going to let these S's--

992
00:47:45,370 --> 00:47:47,850
these are other
random variables,

993
00:47:47,850 --> 00:47:51,310
which are denoting
artifactual events.

994
00:47:51,310 --> 00:47:52,970
They are also
evolving with time.

995
00:47:52,970 --> 00:47:55,420
For example, if there's
artifactual factual event

996
00:47:55,420 --> 00:47:57,875
at three seconds, maybe there's
also an artifactual event

997
00:47:57,875 --> 00:47:58,720
at four seconds.

998
00:47:58,720 --> 00:48:00,887
And we like to model the
relationship between those.

999
00:48:00,887 --> 00:48:02,600
That's why you
have these arrows.

1000
00:48:02,600 --> 00:48:08,180
And then the way that we
interpret the observations

1001
00:48:08,180 --> 00:48:12,620
that we do get depends
on both the true value

1002
00:48:12,620 --> 00:48:14,340
of what's going on
with the patient

1003
00:48:14,340 --> 00:48:17,612
and whether there was an
artifactual event or not.

1004
00:48:17,612 --> 00:48:19,070
And you'll notice
that there's also

1005
00:48:19,070 --> 00:48:20,780
an edge going from
the artifactual events

1006
00:48:20,780 --> 00:48:23,270
to the true values
to note the fact

1007
00:48:23,270 --> 00:48:27,680
that those interventions
might actually

1008
00:48:27,680 --> 00:48:29,030
be affecting the patient.

1009
00:48:29,030 --> 00:48:31,040
For example, if you
give them a medication

1010
00:48:31,040 --> 00:48:36,800
to change their blood
pressure, then that procedure

1011
00:48:36,800 --> 00:48:39,895
is going to affect the next time
step's value of the patient's

1012
00:48:39,895 --> 00:48:40,520
blood pressure.

1013
00:48:44,360 --> 00:48:47,917
So when one wants
to learn this model,

1014
00:48:47,917 --> 00:48:49,750
you have to ask yourself,
what types of data

1015
00:48:49,750 --> 00:48:51,167
do you have available?

1016
00:48:54,370 --> 00:48:59,680
Unfortunately, it's very hard
to get data on both the ground

1017
00:48:59,680 --> 00:49:02,210
truth, what's going
on with the patient,

1018
00:49:02,210 --> 00:49:06,530
and whether these artifacts
truly occurred or not.

1019
00:49:06,530 --> 00:49:09,530
Instead, what we actually have
are just these observations.

1020
00:49:09,530 --> 00:49:13,450
We get these very noisy blood
pressure draws across time.

1021
00:49:13,450 --> 00:49:16,500
So what this paper does is
it uses a maximum likelihood

1022
00:49:16,500 --> 00:49:18,797
estimation approach,
where it recognizes

1023
00:49:18,797 --> 00:49:20,880
that we're going to be
learning from missing data.

1024
00:49:20,880 --> 00:49:23,940
We're going to explicitly
think of these x's and the s's

1025
00:49:23,940 --> 00:49:25,875
as latent variables.

1026
00:49:25,875 --> 00:49:27,990
And we're going to
maximize the likelihood

1027
00:49:27,990 --> 00:49:31,820
of the whole entire model,
marginalizing over x and s.

1028
00:49:31,820 --> 00:49:34,485
So just maximizing the marginal
likelihood over the y's.

1029
00:49:37,240 --> 00:49:39,740
Now, for those of you who have
studied unsupervised learning

1030
00:49:39,740 --> 00:49:43,570
before, you might recognize
that as a very hard learning

1031
00:49:43,570 --> 00:49:44,070
problem.

1032
00:49:44,070 --> 00:49:47,780
In fact, it's-- that
likelihood is non-convex.

1033
00:49:47,780 --> 00:49:51,990
And one could imagine all sorts
of a heuristics for learning,

1034
00:49:51,990 --> 00:49:55,460
such as gradient descent,
or, as this paper uses,

1035
00:49:55,460 --> 00:49:59,180
expectation maximization, and
because of that non-convexity,

1036
00:49:59,180 --> 00:50:00,750
each of these
algorithms typically

1037
00:50:00,750 --> 00:50:04,040
will only reach a local
maxima of the likelihood.

1038
00:50:04,040 --> 00:50:08,420
So this paper uses EM,
which intuitively iterates

1039
00:50:08,420 --> 00:50:14,420
between inferring those missing
variables-- so imputing the x's

1040
00:50:14,420 --> 00:50:17,210
and the s's given
the current model,

1041
00:50:17,210 --> 00:50:20,300
and doing posterior inference
to infer the missing

1042
00:50:20,300 --> 00:50:22,760
variables given the
observed variables, using

1043
00:50:22,760 --> 00:50:24,140
the current model.

1044
00:50:24,140 --> 00:50:27,020
And then, once you've
imputed those variables,

1045
00:50:27,020 --> 00:50:28,910
attempting to refit the model.

1046
00:50:28,910 --> 00:50:30,920
So that's called the
m-step for maximization,

1047
00:50:30,920 --> 00:50:32,900
which updates the model and
just iterates between those two

1048
00:50:32,900 --> 00:50:33,400
things.

1049
00:50:33,400 --> 00:50:36,590
That's one learning
algorithm which

1050
00:50:36,590 --> 00:50:39,650
is guaranteed to reach a
local maxima of the likelihood

1051
00:50:39,650 --> 00:50:42,830
under some regularity
assumptions.

1052
00:50:42,830 --> 00:50:44,690
And so this paper
uses that algorithm,

1053
00:50:44,690 --> 00:50:46,520
but you need to be
asking yourself,

1054
00:50:46,520 --> 00:50:50,270
if all you ever
observe are the y's,

1055
00:50:50,270 --> 00:50:54,830
then will this algorithm
ever recover anything

1056
00:50:54,830 --> 00:50:56,600
close to the true model?

1057
00:50:56,600 --> 00:50:58,310
For example, there
might be large amounts

1058
00:50:58,310 --> 00:51:00,080
of non-identifiability here.

1059
00:51:00,080 --> 00:51:04,490
It could be that you
could swap the meaning

1060
00:51:04,490 --> 00:51:10,170
of the s's, and you'd get a
similar likelihood on the y's.

1061
00:51:10,170 --> 00:51:14,010
That's where bringing in domain
knowledge becomes critical.

1062
00:51:14,010 --> 00:51:17,670
So this is going to be an
example where we have no label

1063
00:51:17,670 --> 00:51:22,948
data or very little label data.

1064
00:51:22,948 --> 00:51:24,740
And we're going to do
unsupervised learning

1065
00:51:24,740 --> 00:51:26,282
of this model, but
we're going to use

1066
00:51:26,282 --> 00:51:28,790
a ton of domain knowledge
in order to constrain

1067
00:51:28,790 --> 00:51:31,050
the model as much as possible.

1068
00:51:31,050 --> 00:51:33,490
So what is that
domain knowledge?

1069
00:51:33,490 --> 00:51:37,730
Well, first we're
going to use the fact

1070
00:51:37,730 --> 00:51:47,200
that we know that a true heart
rate evolves in a fashion that

1071
00:51:47,200 --> 00:51:53,530
can be very well modeled by
an autoregressive process.

1072
00:51:53,530 --> 00:51:56,260
So the autoregressive process
that's used in this paper

1073
00:51:56,260 --> 00:51:58,630
is used to model the
normal heart rate dynamics.

1074
00:51:58,630 --> 00:52:01,060
In a moment, I'll tell you how
to model the abnormal heart

1075
00:52:01,060 --> 00:52:03,370
rate observations.

1076
00:52:03,370 --> 00:52:05,530
And intuitively-- I'll
first go over the intuition,

1077
00:52:05,530 --> 00:52:06,850
then I'll give you the math.

1078
00:52:06,850 --> 00:52:08,650
Intuitively what it
does is it recognizes

1079
00:52:08,650 --> 00:52:14,060
that this complicated signal can
be decomposed into two pieces.

1080
00:52:14,060 --> 00:52:18,020
The first piece shown here
is called a baseline signal,

1081
00:52:18,020 --> 00:52:20,315
and that, if you
squint your eyes

1082
00:52:20,315 --> 00:52:22,700
and you sort or ignore the
very local fluctuations,

1083
00:52:22,700 --> 00:52:24,860
this is what you get out.

1084
00:52:24,860 --> 00:52:27,230
And then you can
look at the residual

1085
00:52:27,230 --> 00:52:32,330
of subtracting this signal,
subtracting this baseline

1086
00:52:32,330 --> 00:52:33,710
from the signal.

1087
00:52:33,710 --> 00:52:36,250
And what you get
out looks like this.

1088
00:52:36,250 --> 00:52:39,770
Notice here it's around 0 mean.

1089
00:52:39,770 --> 00:52:42,585
So it's a 0 mean signal with
some random fluctuations,

1090
00:52:42,585 --> 00:52:44,210
and the fluctuations
are happening here

1091
00:52:44,210 --> 00:52:47,210
at a much faster rate than--

1092
00:52:47,210 --> 00:52:49,830
and for the original baseline.

1093
00:52:49,830 --> 00:52:56,910
And so the sum of bt and
this residual is a very--

1094
00:52:56,910 --> 00:53:00,200
it looks-- is exactly equal
to the true heart rate.

1095
00:53:00,200 --> 00:53:03,290
And each of these two things
we can model very well.

1096
00:53:03,290 --> 00:53:08,210
This we can model by
a random walk with--

1097
00:53:08,210 --> 00:53:10,970
which goes very
slowly, and this we

1098
00:53:10,970 --> 00:53:15,297
can model by a random walk
which goes very quickly.

1099
00:53:15,297 --> 00:53:17,630
And that is exactly what I'm
now going to show over here

1100
00:53:17,630 --> 00:53:19,180
on the left hand side.

1101
00:53:19,180 --> 00:53:22,880
bt, this baseline
signal, we're going

1102
00:53:22,880 --> 00:53:26,540
to model as a Gaussian
distribution, which

1103
00:53:26,540 --> 00:53:29,600
is parameterized as a function
of not just bt minus 1,

1104
00:53:29,600 --> 00:53:32,480
but also bt minus
2, and bt minus 3.

1105
00:53:32,480 --> 00:53:34,940
And so we're going to be
taking a weighted average

1106
00:53:34,940 --> 00:53:39,560
of the previous few time steps,
where we're smoothing out,

1107
00:53:39,560 --> 00:53:45,220
in essence, the observation--
the previous few observations.

1108
00:53:45,220 --> 00:53:47,970
If you were to--

1109
00:53:47,970 --> 00:53:50,310
if you're being a
keen observer, you'll

1110
00:53:50,310 --> 00:53:53,790
notice that this is no
longer a Markov model.

1111
00:54:04,870 --> 00:54:11,460
For example, if this p1
and p2 are equal to 2,

1112
00:54:11,460 --> 00:54:14,790
this then corresponds to a
second order Markov model,

1113
00:54:14,790 --> 00:54:18,600
because each random variable
depends on the previous two

1114
00:54:18,600 --> 00:54:24,530
time steps of the Markov chain.

1115
00:54:24,530 --> 00:54:31,790
And so after-- so you would
model now bt by this process,

1116
00:54:31,790 --> 00:54:34,880
and you would
probably be averaging

1117
00:54:34,880 --> 00:54:36,920
over a large number
of previous time steps

1118
00:54:36,920 --> 00:54:39,020
to get this smooth property.

1119
00:54:39,020 --> 00:54:45,620
And then you'd model xt minus bt
by this autoregressive process,

1120
00:54:45,620 --> 00:54:47,780
where you might,
for example, just

1121
00:54:47,780 --> 00:54:50,313
be looking at just the
previous couple of time steps.

1122
00:54:50,313 --> 00:54:51,980
And you recognize
that you're just doing

1123
00:54:51,980 --> 00:54:55,600
much more random fluctuations.

1124
00:54:55,600 --> 00:54:59,480
And then-- so that's how one
would now model normal heart

1125
00:54:59,480 --> 00:55:00,650
rate dynamics.

1126
00:55:00,650 --> 00:55:02,900
And again, it's just--

1127
00:55:02,900 --> 00:55:04,730
this is an example of
a statistical model.

1128
00:55:04,730 --> 00:55:06,110
There is no
mechanistic knowledge

1129
00:55:06,110 --> 00:55:08,540
of hearts being
used here, but we

1130
00:55:08,540 --> 00:55:13,710
can fit the data of normal
hearts pretty well using this.

1131
00:55:13,710 --> 00:55:15,960
But the next question and
the most interesting one

1132
00:55:15,960 --> 00:55:20,510
is, how does one now
model artifactual events?

1133
00:55:20,510 --> 00:55:26,120
So for that, that's where some
mechanistic knowledge comes in.

1134
00:55:26,120 --> 00:55:30,180
So one models that
the probe dropouts

1135
00:55:30,180 --> 00:55:35,120
are given by recognizing
that, if a probe

1136
00:55:35,120 --> 00:55:39,020
is removed from the baby, then
there should no longer be--

1137
00:55:39,020 --> 00:55:41,253
or at least if you-- after
a small amount of time,

1138
00:55:41,253 --> 00:55:42,920
there should no longer
be any dependence

1139
00:55:42,920 --> 00:55:44,450
on the true value of the baby.

1140
00:55:44,450 --> 00:55:48,080
For example, the blood pressure,
once the blood pressure probe

1141
00:55:48,080 --> 00:55:50,870
is removed, is no longer
related to the baby's true blood

1142
00:55:50,870 --> 00:55:52,910
pressure.

1143
00:55:52,910 --> 00:55:57,130
But there might be some delay
to that lack of dependence.

1144
00:55:57,130 --> 00:55:59,450
And so-- and that is going
to be encoded in some domain

1145
00:55:59,450 --> 00:55:59,950
knowledge.

1146
00:55:59,950 --> 00:56:01,840
So for example, in
the temperature probe,

1147
00:56:01,840 --> 00:56:04,480
when you remove the temperature
probe from the baby,

1148
00:56:04,480 --> 00:56:07,682
it starts heating up again--
or it starts cooling, so

1149
00:56:07,682 --> 00:56:09,640
assuming that the ambient
temperature is cooler

1150
00:56:09,640 --> 00:56:11,280
than the baby's temperature.

1151
00:56:11,280 --> 00:56:12,790
So you take it off the baby.

1152
00:56:12,790 --> 00:56:14,170
It starts cooling down.

1153
00:56:14,170 --> 00:56:15,692
How fast does it cool down?

1154
00:56:15,692 --> 00:56:17,400
Well, you could assume
that it cools down

1155
00:56:17,400 --> 00:56:20,320
with some exponential decay
from the baby's temperature.

1156
00:56:20,320 --> 00:56:22,750
And this is something
that is very reasonable,

1157
00:56:22,750 --> 00:56:24,490
and you could
imagine, maybe if you

1158
00:56:24,490 --> 00:56:26,530
had label data for just
a few of the babies,

1159
00:56:26,530 --> 00:56:28,780
you could try to fit the
parameters of the exponential

1160
00:56:28,780 --> 00:56:30,840
very quickly.

1161
00:56:30,840 --> 00:56:33,160
And in this way, now, we
parameterize the conditional

1162
00:56:33,160 --> 00:56:39,040
distribution of the temperature
probe, given both the state

1163
00:56:39,040 --> 00:56:42,220
and whether the artifact
occurred or not,

1164
00:56:42,220 --> 00:56:45,710
using this very simple
exponential decay.

1165
00:56:45,710 --> 00:56:49,957
And in this paper, they give
a very similar type of--

1166
00:56:49,957 --> 00:56:51,790
they make similar types
of-- analogous types

1167
00:56:51,790 --> 00:56:54,588
of assumptions for all of
the other artifactual probes.

1168
00:56:54,588 --> 00:56:56,380
You should think about
this as constraining

1169
00:56:56,380 --> 00:56:58,757
these conditional distributions
I showed you here.

1170
00:56:58,757 --> 00:57:01,090
They're no longer allowed to
be arbitrary distributions,

1171
00:57:01,090 --> 00:57:03,910
and so that, when one does
now expectation maximization

1172
00:57:03,910 --> 00:57:06,573
to try to maximize the marginal
likelihood of the data,

1173
00:57:06,573 --> 00:57:07,990
you've now constrained
it in a way

1174
00:57:07,990 --> 00:57:10,073
that you hopefully are
moved on to identifyability

1175
00:57:10,073 --> 00:57:11,310
of the learning problem.

1176
00:57:11,310 --> 00:57:13,330
It makes all of the
difference in learning here.

1177
00:57:18,130 --> 00:57:21,730
So in this paper,
their evaluation

1178
00:57:21,730 --> 00:57:23,830
did a little bit of fine
tuning for each baby.

1179
00:57:23,830 --> 00:57:26,650
In particular, they assumed
that the first 30 minutes

1180
00:57:26,650 --> 00:57:31,150
near the start consists
of normal dynamics

1181
00:57:31,150 --> 00:57:33,190
so that's there
are no artifacts.

1182
00:57:33,190 --> 00:57:34,750
That's, of course,
a big assumption,

1183
00:57:34,750 --> 00:57:39,100
but they use that to try to
fine tune the dynamic model

1184
00:57:39,100 --> 00:57:43,540
to fine tune it for each
baby and for themselves.

1185
00:57:43,540 --> 00:57:45,070
And then they looked
at the ability

1186
00:57:45,070 --> 00:57:47,357
to try to identify
artifactual processes.

1187
00:57:47,357 --> 00:57:49,690
Now, I want to go a little
bit slowly through this plot,

1188
00:57:49,690 --> 00:57:52,350
because it's quite interesting.

1189
00:57:52,350 --> 00:57:57,990
So what I'm showing
you here is a ROC curve

1190
00:57:57,990 --> 00:58:00,292
of the ability to
predict each of the four

1191
00:58:00,292 --> 00:58:01,500
different types of artifacts.

1192
00:58:01,500 --> 00:58:03,810
For example, at any
one point in time,

1193
00:58:03,810 --> 00:58:05,990
was there a blood sample
being taken or not?

1194
00:58:05,990 --> 00:58:07,890
At any one point
in time, was there

1195
00:58:07,890 --> 00:58:12,270
a core temperature disconnect
of the core temperature probe?

1196
00:58:12,270 --> 00:58:13,770
And to evaluate it,
they're assuming

1197
00:58:13,770 --> 00:58:18,850
that they have some label data
for evaluation purposes only.

1198
00:58:18,850 --> 00:58:22,110
And of course, you want to be
at the very far top left corner

1199
00:58:22,110 --> 00:58:23,866
up here.

1200
00:58:23,866 --> 00:58:27,820
And what we're showing here
are three different curves--

1201
00:58:27,820 --> 00:58:31,120
the very faint
dotted line, which

1202
00:58:31,120 --> 00:58:34,780
I'm going to trace out with
my cursor, is the baseline.

1203
00:58:34,780 --> 00:58:39,068
Think of that as a
much worse algorithm.

1204
00:58:41,640 --> 00:58:42,140
Sorry.

1205
00:58:42,140 --> 00:58:44,523
That's that line over there.

1206
00:58:44,523 --> 00:58:45,190
Everyone see it?

1207
00:58:49,030 --> 00:58:52,110
And this approach are
the other two lines.

1208
00:58:52,110 --> 00:58:54,800
Now, what's differentiating
those other two lines

1209
00:58:54,800 --> 00:58:57,940
corresponds to the particular
type of approximate inference

1210
00:58:57,940 --> 00:59:00,120
algorithm that's used.

1211
00:59:00,120 --> 00:59:05,640
To do this posterior
inference, to infer

1212
00:59:05,640 --> 00:59:10,290
the true value of the x's
given your noisy observations

1213
00:59:10,290 --> 00:59:14,160
in the model given here is
actually a very hard inference

1214
00:59:14,160 --> 00:59:15,920
problem.

1215
00:59:15,920 --> 00:59:18,330
Mathematically, I
think one can show

1216
00:59:18,330 --> 00:59:21,692
that it's an NP-hard
computational problem.

1217
00:59:21,692 --> 00:59:23,650
And so they have to
approximate it in some way,

1218
00:59:23,650 --> 00:59:26,010
and they use two different
approximations here.

1219
00:59:26,010 --> 00:59:28,400
The first approximation
is based on what they're

1220
00:59:28,400 --> 00:59:31,110
calling a Gaussian
sum approximation,

1221
00:59:31,110 --> 00:59:33,420
and it's a deterministic
approximation.

1222
00:59:33,420 --> 00:59:37,240
The second approximation is
based on a Monte Carlo method.

1223
00:59:37,240 --> 00:59:40,290
And what you see here is that
the Gaussian sum approximation

1224
00:59:40,290 --> 00:59:41,970
is actually dramatically better.

1225
00:59:41,970 --> 00:59:43,920
So for example, in
this blood sample one,

1226
00:59:43,920 --> 00:59:48,750
that the ROC curve looks like
this for the Gaussian sum

1227
00:59:48,750 --> 00:59:49,640
approximation.

1228
00:59:49,640 --> 00:59:51,390
Whereas for the Monte
Carlo approximation,

1229
00:59:51,390 --> 00:59:54,510
it's actually
significantly lower.

1230
00:59:54,510 --> 00:59:56,400
And this is just to
point out that, even

1231
00:59:56,400 --> 01:00:03,660
in this setting, where
we have very little data,

1232
01:00:03,660 --> 01:00:06,780
we're using a lot of domain
knowledge, the actual details

1233
01:00:06,780 --> 01:00:09,053
of how one does the
math-- in particular,

1234
01:00:09,053 --> 01:00:10,470
the proximate
inference-- can make

1235
01:00:10,470 --> 01:00:13,047
a really big difference in the
performance of this system.

1236
01:00:13,047 --> 01:00:14,880
And so it's something
that one should really

1237
01:00:14,880 --> 01:00:16,047
think deeply about, as well.

1238
01:00:18,666 --> 01:00:21,700
I'm going to skip that
slide, and then just mention

1239
01:00:21,700 --> 01:00:23,170
very briefly this one.

1240
01:00:23,170 --> 01:00:28,640
This is showing an
inference of the events.

1241
01:00:28,640 --> 01:00:34,600
So here I'm showing you
three different observations.

1242
01:00:34,600 --> 01:00:39,130
And on the bottom here,
I'm showing the prediction

1243
01:00:39,130 --> 01:00:43,950
of when artifact-- two different
artifactual events happened.

1244
01:00:43,950 --> 01:00:46,020
And these predictions
were actually quite good,

1245
01:00:46,020 --> 01:00:48,180
using this model.

1246
01:00:48,180 --> 01:00:52,210
So I'm done with that
first example, and--

1247
01:00:52,210 --> 01:00:55,380
and the-- just to recap
the important points

1248
01:00:55,380 --> 01:01:01,300
of that example, it was that
we had almost no label data.

1249
01:01:01,300 --> 01:01:05,470
We're tackling this problem
using a cleverly chosen

1250
01:01:05,470 --> 01:01:08,780
statistical model with some
domain knowledge built in,

1251
01:01:08,780 --> 01:01:12,040
and that can go really far.

1252
01:01:12,040 --> 01:01:14,500
So now we'll shift gears to
talk about a different type

1253
01:01:14,500 --> 01:01:18,340
of problem involving
physiological data,

1254
01:01:18,340 --> 01:01:22,570
and that's of detecting
atrial fibrillation.

1255
01:01:22,570 --> 01:01:26,280
So what I'm showing you
here is an AliveCore device.

1256
01:01:26,280 --> 01:01:27,850
I own one of these.

1257
01:01:27,850 --> 01:01:30,540
So if you want to drop
by my E25 545 office,

1258
01:01:30,540 --> 01:01:32,860
you can-- you can
play around with it.

1259
01:01:32,860 --> 01:01:35,930
And if you attach it
to your mobile phone,

1260
01:01:35,930 --> 01:01:43,800
it'll show you your electric
conductance through your heart

1261
01:01:43,800 --> 01:01:46,710
as measured through
your two fingers

1262
01:01:46,710 --> 01:01:48,670
touching this device
shown over here.

1263
01:01:48,670 --> 01:01:51,270
And from that, one can try to
detect whether the patient has

1264
01:01:51,270 --> 01:01:52,990
atrial fibrillation.

1265
01:01:52,990 --> 01:01:54,941
So what is atrial fibrillation?

1266
01:01:58,617 --> 01:01:59,200
Good question.

1267
01:01:59,200 --> 01:02:00,284
It's [INAUDIBLE].

1268
01:02:04,240 --> 01:02:10,270
So this is from the
American Heart Association.

1269
01:02:10,270 --> 01:02:13,810
They defined atrial fibrillation
as a quivering or irregular

1270
01:02:13,810 --> 01:02:16,450
heartbeat, also
known as arrhythmia.

1271
01:02:16,450 --> 01:02:18,220
And one of the big
challenges is that it

1272
01:02:18,220 --> 01:02:21,030
could lead to blood clot,
stroke, heart failure, and so

1273
01:02:21,030 --> 01:02:21,530
on.

1274
01:02:21,530 --> 01:02:23,980
So here is how a
patient might describe

1275
01:02:23,980 --> 01:02:26,020
having atrial fibrillation.

1276
01:02:26,020 --> 01:02:28,180
My heart flip-flops,
skips beats,

1277
01:02:28,180 --> 01:02:31,150
feels like it's banging
against my chest wall,

1278
01:02:31,150 --> 01:02:33,790
particularly when I'm
carrying stuff up my stairs

1279
01:02:33,790 --> 01:02:35,542
or bending down.

1280
01:02:35,542 --> 01:02:37,250
Now let's try to look
at a picture of it.

1281
01:02:48,040 --> 01:02:55,330
So this is a normal heartbeat.

1282
01:02:55,330 --> 01:02:59,860
Hearts move-- pumping like this.

1283
01:02:59,860 --> 01:03:03,130
And if you were to
look at the signal

1284
01:03:03,130 --> 01:03:04,810
output of the EKG of
a normal heartbeat,

1285
01:03:04,810 --> 01:03:05,620
it would look like this.

1286
01:03:05,620 --> 01:03:07,735
And it's roughly corresponding
to the different--

1287
01:03:07,735 --> 01:03:09,840
the signal is corresponding
to different cycles

1288
01:03:09,840 --> 01:03:12,420
of the heartbeat.

1289
01:03:12,420 --> 01:03:15,000
Now for a patient who
has atrial fibrillation,

1290
01:03:15,000 --> 01:03:16,290
it looks more like this.

1291
01:03:21,650 --> 01:03:25,677
So much more obviously abnormal,
at least in this figure.

1292
01:03:25,677 --> 01:03:27,510
And if you look at the
corresponding signal,

1293
01:03:27,510 --> 01:03:29,382
it also looks very different.

1294
01:03:29,382 --> 01:03:31,590
So this is just to give you
some intuition about what

1295
01:03:31,590 --> 01:03:33,577
I mean by atrial fibrillation.

1296
01:03:36,990 --> 01:03:39,930
So what we're going to try
to do now is to detect it.

1297
01:03:39,930 --> 01:03:44,090
So we're going to
take data like that

1298
01:03:44,090 --> 01:03:48,580
and try to classify it into a
number of different categories.

1299
01:03:48,580 --> 01:03:52,630
Now this is something which
has been studied for decades,

1300
01:03:52,630 --> 01:03:57,430
and last year, 2017,
there was a competition

1301
01:03:57,430 --> 01:04:01,450
run by Professor Roger
Mark, who is here

1302
01:04:01,450 --> 01:04:04,390
at MIT, which is trying
to see, well, how could--

1303
01:04:04,390 --> 01:04:06,460
how good are we at
trying to figure out

1304
01:04:06,460 --> 01:04:09,940
which patients have different
types of heart rhythms

1305
01:04:09,940 --> 01:04:11,780
based on data that
looks like this?

1306
01:04:11,780 --> 01:04:13,300
So this is a normal
rhythm, which

1307
01:04:13,300 --> 01:04:16,700
is also called a sinus rhythm.

1308
01:04:16,700 --> 01:04:18,750
And over here it's atrial--

1309
01:04:18,750 --> 01:04:22,120
this is an example one patient
who has atrial fibrillation.

1310
01:04:22,120 --> 01:04:25,200
This is another type of rhythm
that's not atrial fibrillation,

1311
01:04:25,200 --> 01:04:26,590
but is abnormal.

1312
01:04:26,590 --> 01:04:29,670
And this is a noisy recording--
for example, if a patient's--

1313
01:04:29,670 --> 01:04:32,220
doesn't really have their
two fingers very well put

1314
01:04:32,220 --> 01:04:35,180
on to the two leads
of the device.

1315
01:04:35,180 --> 01:04:41,040
So given one of these
categories, can we predict--

1316
01:04:41,040 --> 01:04:42,760
one of these signals,
could predict

1317
01:04:42,760 --> 01:04:45,355
which category it came from?

1318
01:04:45,355 --> 01:04:47,230
So if you looked at
this, you might recognize

1319
01:04:47,230 --> 01:04:48,970
that they look a bit different.

1320
01:04:48,970 --> 01:04:53,380
So could some of
you guess what might

1321
01:04:53,380 --> 01:04:55,780
be predictive features
that differentiate

1322
01:04:55,780 --> 01:04:59,440
one of these signals
from the other?

1323
01:04:59,440 --> 01:05:00,303
In the back?

1324
01:05:00,303 --> 01:05:01,720
AUDIENCE: The
presence and absence

1325
01:05:01,720 --> 01:05:07,065
of one of the peaks the QRS
complex are [INAUDIBLE]..

1326
01:05:07,065 --> 01:05:08,440
DAVID SONTAG: So
speak in English

1327
01:05:08,440 --> 01:05:10,722
for people who don't know
what these terms mean.

1328
01:05:10,722 --> 01:05:12,680
AUDIENCE: There is one
large piece, which can--

1329
01:05:12,680 --> 01:05:16,730
probably we can consider one
mV and there is another peak,

1330
01:05:16,730 --> 01:05:18,520
which is sort of like--

1331
01:05:18,520 --> 01:05:20,630
they have reverse polarity
between normal rhythm

1332
01:05:20,630 --> 01:05:21,310
and [INAUDIBLE].

1333
01:05:21,310 --> 01:05:22,102
DAVID SONTAG: Good.

1334
01:05:22,102 --> 01:05:23,820
So are you a cardiologist?

1335
01:05:23,820 --> 01:05:24,710
AUDIENCE: No.

1336
01:05:24,710 --> 01:05:26,440
DAVID SONTAG: No, OK.

1337
01:05:26,440 --> 01:05:29,050
So what the student
suggested is one

1338
01:05:29,050 --> 01:05:31,660
could look for sort
of these inversions

1339
01:05:31,660 --> 01:05:34,670
to try to describe it a
little bit differently.

1340
01:05:34,670 --> 01:05:41,290
So here you're suggesting
the lack of those inversions

1341
01:05:41,290 --> 01:05:45,430
is predictive of
an abnormal rhythm.

1342
01:05:45,430 --> 01:05:47,655
What about another feature
that could be predictive?

1343
01:05:47,655 --> 01:05:48,155
Yep?

1344
01:05:48,155 --> 01:05:49,840
AUDIENCE: The spacing
between the peaks

1345
01:05:49,840 --> 01:05:52,030
is more irregular with the AF.

1346
01:05:52,030 --> 01:05:53,740
DAVID SONTAG: The
spacing between beats

1347
01:05:53,740 --> 01:05:56,853
is more irregular
with the AF rhythm.

1348
01:05:56,853 --> 01:05:58,270
So you're sort of
looking at this.

1349
01:05:58,270 --> 01:06:00,160
You see how here
this spacing is very

1350
01:06:00,160 --> 01:06:01,538
different from this spacing.

1351
01:06:01,538 --> 01:06:03,580
Whereas in the normal
rhythm, sort of the spacing

1352
01:06:03,580 --> 01:06:05,690
looks pretty darn regular.

1353
01:06:05,690 --> 01:06:07,060
All right, good.

1354
01:06:07,060 --> 01:06:11,050
So if I was to show you
40 examples of these

1355
01:06:11,050 --> 01:06:12,940
and then ask you to
classify some new ones,

1356
01:06:12,940 --> 01:06:15,280
how well do you think
you'll be able to do?

1357
01:06:15,280 --> 01:06:15,780
Pretty well?

1358
01:06:20,970 --> 01:06:23,550
I would be surprised if
you couldn't do reasonably

1359
01:06:23,550 --> 01:06:26,250
well at least distinguishing
between normal rhythm and AF

1360
01:06:26,250 --> 01:06:30,510
rhythm, because there seem to be
some pretty clear signals here.

1361
01:06:30,510 --> 01:06:32,580
Of course, as you get
into alternatives,

1362
01:06:32,580 --> 01:06:34,848
then the story gets
much more complex.

1363
01:06:34,848 --> 01:06:36,390
But let me dig in
a little bit deeper

1364
01:06:36,390 --> 01:06:37,980
into what I mean by this.

1365
01:06:37,980 --> 01:06:39,600
So let's define
some of these terms.

1366
01:06:39,600 --> 01:06:44,430
Well, cardiologists have studied
this for a really long time,

1367
01:06:44,430 --> 01:06:46,530
and they have-- so
what I'm showing

1368
01:06:46,530 --> 01:06:49,380
you here is one heart cycle.

1369
01:06:49,380 --> 01:06:53,220
And they've-- you can put names
to each of the peaks that you

1370
01:06:53,220 --> 01:06:55,860
would see in a regular heart
cycle-- so that-- for example,

1371
01:06:55,860 --> 01:06:59,250
that very high peak is
known as the R peak.

1372
01:06:59,250 --> 01:07:03,060
And you could look at, for
example, the interval--

1373
01:07:03,060 --> 01:07:06,720
so this is one beat.

1374
01:07:06,720 --> 01:07:10,320
You could look at the interval
between the R peak of one beat

1375
01:07:10,320 --> 01:07:13,050
and the R peak of
another peak, and define

1376
01:07:13,050 --> 01:07:15,440
that to be the RR interval.

1377
01:07:15,440 --> 01:07:18,050
In a similar way,
one could take--

1378
01:07:18,050 --> 01:07:21,060
one could find different
distinctive elements

1379
01:07:21,060 --> 01:07:22,140
of the signal--

1380
01:07:22,140 --> 01:07:23,032
by the way, each--

1381
01:07:25,680 --> 01:07:28,110
each time step
corresponds to the heart

1382
01:07:28,110 --> 01:07:30,410
being in a different position.

1383
01:07:30,410 --> 01:07:33,860
For a healthy heart, these
are relatively deterministic.

1384
01:07:33,860 --> 01:07:36,330
And so you could look at
other distances and derive

1385
01:07:36,330 --> 01:07:38,010
features from those
distances, as well,

1386
01:07:38,010 --> 01:07:40,160
just like we were talking
about, both within a beat

1387
01:07:40,160 --> 01:07:42,220
and across beats.

1388
01:07:42,220 --> 01:07:42,895
Yep?

1389
01:07:42,895 --> 01:07:44,312
AUDIENCE: So what's
the difference

1390
01:07:44,312 --> 01:07:46,090
between a segment and
an interval again?

1391
01:07:48,333 --> 01:07:50,250
DAVID SONTAG: I don't
know what the difference

1392
01:07:50,250 --> 01:07:51,420
between a segment
and an interval is.

1393
01:07:51,420 --> 01:07:52,070
Does anyone else know?

1394
01:07:52,070 --> 01:07:54,070
I mean, I guess the
interval is between probably

1395
01:07:54,070 --> 01:07:56,490
the heads of peaks, whereas
segments might refer to

1396
01:07:56,490 --> 01:07:59,193
within a interval.

1397
01:07:59,193 --> 01:07:59,860
That's my guess.

1398
01:07:59,860 --> 01:08:00,902
Does someone know better?

1399
01:08:04,190 --> 01:08:05,630
For the purpose
of today's class,

1400
01:08:05,630 --> 01:08:07,366
that's a good enough
understanding.

1401
01:08:10,940 --> 01:08:14,060
The point is this
is well understood.

1402
01:08:14,060 --> 01:08:16,093
One could derive
features from this.

1403
01:08:16,093 --> 01:08:16,776
AUDIENCE: By us.

1404
01:08:16,776 --> 01:08:17,609
DAVID SONTAG: By us.

1405
01:08:20,180 --> 01:08:23,399
So what would a traditional
approach be to this problem?

1406
01:08:23,399 --> 01:08:24,020
So this is--

1407
01:08:24,020 --> 01:08:27,050
I'm pulling this figure
from a paper from 2002.

1408
01:08:27,050 --> 01:08:30,200
What it'll do is it'll
take in that signal.

1409
01:08:30,200 --> 01:08:32,960
It'll do some filtering of it.

1410
01:08:32,960 --> 01:08:35,750
Then it'll run a peak
detection logic, which

1411
01:08:35,750 --> 01:08:38,840
will find these
peaks, and then it'll

1412
01:08:38,840 --> 01:08:43,939
measure intervals between
these peaks and within a beat.

1413
01:08:43,939 --> 01:08:48,069
And it'll take
those computations

1414
01:08:48,069 --> 01:08:49,760
or make some
decision based on it.

1415
01:08:49,760 --> 01:08:51,590
So that's a
traditional algorithm,

1416
01:08:51,590 --> 01:08:54,310
and they work pretty reasonably.

1417
01:08:54,310 --> 01:08:56,560
And so what do I mean
by signal processing?

1418
01:08:56,560 --> 01:08:58,790
Well, this is an
example of that.

1419
01:08:58,790 --> 01:09:01,880
I encourage any of you to go
home today and try to code up

1420
01:09:01,880 --> 01:09:03,140
a peaked finding algorithm.

1421
01:09:03,140 --> 01:09:06,819
It's not that hard, at
least not to get an OK one.

1422
01:09:06,819 --> 01:09:11,149
You might imagine
keeping a running tab

1423
01:09:11,149 --> 01:09:13,811
of what's the highest
signal you've seen so far.

1424
01:09:13,811 --> 01:09:16,019
Then you look to see what
is the first time it drops,

1425
01:09:16,019 --> 01:09:18,394
and the second time-- and the
next time it goes up larger

1426
01:09:18,394 --> 01:09:22,064
than, let's say, the previous--

1427
01:09:22,064 --> 01:09:22,939
suppose that one of--

1428
01:09:22,939 --> 01:09:26,689
you want to look for when the
drop is-- the maximum value--

1429
01:09:26,689 --> 01:09:28,790
recent maximum
value divided by 2.

1430
01:09:28,790 --> 01:09:31,279
And then you-- then you reset.

1431
01:09:31,279 --> 01:09:33,800
And you can imagine in this
way very quickly coding up

1432
01:09:33,800 --> 01:09:37,755
a peak finding algorithm.

1433
01:09:37,755 --> 01:09:39,380
And so this is just,
again, to give you

1434
01:09:39,380 --> 01:09:43,130
some intuition behind what a
traditional approach would be.

1435
01:09:43,130 --> 01:09:46,790
And then you can very
quickly see that that--

1436
01:09:46,790 --> 01:09:49,729
once you start to look at
some intervals between peaks,

1437
01:09:49,729 --> 01:09:52,880
that alone is often good
enough for predicting

1438
01:09:52,880 --> 01:09:55,050
whether a patient has
atrial fibrillation.

1439
01:09:55,050 --> 01:09:58,940
So this is a figure
taken from paper in 2001

1440
01:09:58,940 --> 01:10:01,310
showing a single
patient's time series.

1441
01:10:01,310 --> 01:10:04,940
So the x-axis is for
that single patient,

1442
01:10:04,940 --> 01:10:07,250
their heart beats across time.

1443
01:10:07,250 --> 01:10:09,830
The y-axis is just
showing the RR interval

1444
01:10:09,830 --> 01:10:14,300
between the previous beat
and the current beat.

1445
01:10:14,300 --> 01:10:18,080
And down here in the
bottom is the ground truth

1446
01:10:18,080 --> 01:10:20,990
of whether the patient
is assessed to have--

1447
01:10:20,990 --> 01:10:27,650
to be in-- to have a normal
rhythm or atrial fibrillation,

1448
01:10:27,650 --> 01:10:30,630
which is noted as this
higher value here.

1449
01:10:30,630 --> 01:10:33,830
So these are AF rhythms.

1450
01:10:33,830 --> 01:10:34,710
This is normal.

1451
01:10:34,710 --> 01:10:36,800
This is AF again.

1452
01:10:36,800 --> 01:10:40,670
And what you can see is that
the RR interval actually

1453
01:10:40,670 --> 01:10:41,640
gets you pretty far.

1454
01:10:41,640 --> 01:10:44,210
You notice how it's
pretty high up here.

1455
01:10:44,210 --> 01:10:46,130
Suddenly it drops.

1456
01:10:46,130 --> 01:10:47,930
The RR interval
drops for a while,

1457
01:10:47,930 --> 01:10:50,450
and that's when
the patient has AF.

1458
01:10:50,450 --> 01:10:51,860
Then it goes up again.

1459
01:10:51,860 --> 01:10:54,780
Then it drops again, and so on.

1460
01:10:54,780 --> 01:10:56,780
And so it's not deterministic,
the relationship,

1461
01:10:56,780 --> 01:10:59,143
but there's definitely a lot
of signal just from that.

1462
01:10:59,143 --> 01:11:00,560
So you might say,
OK, well, what's

1463
01:11:00,560 --> 01:11:02,480
the next thing we could do
to try to clean up the signal

1464
01:11:02,480 --> 01:11:03,230
a little bit more?

1465
01:11:03,230 --> 01:11:11,210
So flash backwards from 2001 to
1970 here at MIT, studied by--

1466
01:11:11,210 --> 01:11:13,760
actually, no, this is not MIT.

1467
01:11:13,760 --> 01:11:16,070
This is somewhere else, sorry.

1468
01:11:16,070 --> 01:11:21,398
But still 1970-- where they
used a Markov model very

1469
01:11:21,398 --> 01:11:23,690
similar to the Markov models
we were just talking about

1470
01:11:23,690 --> 01:11:30,410
in the previous example to model
what a sequence of normal RR

1471
01:11:30,410 --> 01:11:34,310
intervals looks like versus
what a sequence of abnormal,

1472
01:11:34,310 --> 01:11:37,370
for example, AF RR
intervals looks like.

1473
01:11:37,370 --> 01:11:39,590
And in that way,
one can recognize

1474
01:11:39,590 --> 01:11:42,980
that, for any one
observation of an RR interval

1475
01:11:42,980 --> 01:11:45,540
might not by itself be
perfectly predictive,

1476
01:11:45,540 --> 01:11:47,480
but if you look at sort
of a sequence of them

1477
01:11:47,480 --> 01:11:50,480
for a patient with
atrial fibrillation,

1478
01:11:50,480 --> 01:11:53,420
there is some common
pattern to it.

1479
01:11:53,420 --> 01:11:56,090
And you can-- one can detect it
by just looking at likelihood

1480
01:11:56,090 --> 01:11:59,450
of that sequence under each
of these two different models,

1481
01:11:59,450 --> 01:12:01,230
normal and abnormal.

1482
01:12:01,230 --> 01:12:04,070
And that did pretty well--
even better than the previous

1483
01:12:04,070 --> 01:12:05,310
approaches for--

1484
01:12:05,310 --> 01:12:08,370
for predicting
atrial fibrillation.

1485
01:12:08,370 --> 01:12:11,790
This is the paper I
wanted to say from MIT.

1486
01:12:11,790 --> 01:12:15,880
Now 1991, this is also
from Roger Mark's group.

1487
01:12:15,880 --> 01:12:19,480
Now this is a neural network
based approach, where it says,

1488
01:12:19,480 --> 01:12:22,108
OK, we're going to take
a bunch of these things.

1489
01:12:22,108 --> 01:12:24,150
We're going to derive a
bunch of these intervals,

1490
01:12:24,150 --> 01:12:25,890
and then we're going to throw
that through a black box

1491
01:12:25,890 --> 01:12:27,432
supervised machine
learning algorithm

1492
01:12:27,432 --> 01:12:30,240
to predict whether a
patient has AF or not.

1493
01:12:30,240 --> 01:12:32,220
So these are very--

1494
01:12:32,220 --> 01:12:34,890
first of all, there are
some simple approaches here

1495
01:12:34,890 --> 01:12:36,540
that work reasonably well.

1496
01:12:36,540 --> 01:12:42,280
Using neural networks in this
domain is not a new thing,

1497
01:12:42,280 --> 01:12:44,140
but where are we as a field?

1498
01:12:44,140 --> 01:12:46,920
So as I mentioned, there was
this competition last year,

1499
01:12:46,920 --> 01:12:48,887
and what I'm showing
you here-- the citation

1500
01:12:48,887 --> 01:12:50,470
is from one of the
winning approaches.

1501
01:12:50,470 --> 01:12:52,845
And this winning approach
really brings the two paradigms

1502
01:12:52,845 --> 01:12:53,910
together.

1503
01:12:53,910 --> 01:12:57,600
It extracts a large number
of expert derived features--

1504
01:12:57,600 --> 01:12:59,342
so shown here.

1505
01:12:59,342 --> 01:13:01,050
And these are exactly
the types of things

1506
01:13:01,050 --> 01:13:06,390
you might think, like
proportion, median RR

1507
01:13:06,390 --> 01:13:11,417
interval of regular rhythms,
max RR irregularity measure.

1508
01:13:11,417 --> 01:13:13,500
And there's just a whole
range of different things

1509
01:13:13,500 --> 01:13:16,160
that you can imagine manually
deriving from the data.

1510
01:13:16,160 --> 01:13:17,910
And you throw all
of these features

1511
01:13:17,910 --> 01:13:21,840
into a machine
learning algorithm,

1512
01:13:21,840 --> 01:13:25,040
maybe a random forest, maybe a
neural network, doesn't matter.

1513
01:13:25,040 --> 01:13:27,180
And what you get out is a
slightly better algorithm

1514
01:13:27,180 --> 01:13:28,555
than what if you
had just come up

1515
01:13:28,555 --> 01:13:30,510
with a simple rule on your own.

1516
01:13:30,510 --> 01:13:33,470
That was the winning
algorithm then.

1517
01:13:33,470 --> 01:13:36,970
And in the summary paper, they
conjectured that, well, maybe

1518
01:13:36,970 --> 01:13:39,357
it's the case that they were--

1519
01:13:39,357 --> 01:13:41,440
they'd expected that
convolutional neural networks

1520
01:13:41,440 --> 01:13:42,443
would win.

1521
01:13:42,443 --> 01:13:44,860
And they were surprised that
none of the winning solutions

1522
01:13:44,860 --> 01:13:47,070
involved convolution
neural networks.

1523
01:13:47,070 --> 01:13:50,297
And they conjectured that may be
the reason why is because maybe

1524
01:13:50,297 --> 01:13:52,630
with these 8,000 patients
that they had [INAUDIBLE] that

1525
01:13:52,630 --> 01:13:56,590
just wasn't enough to give the
more complex models advantage.

1526
01:13:56,590 --> 01:14:00,370
So flip forward now to
this year and the article

1527
01:14:00,370 --> 01:14:05,840
that you read in your
readings in Nature Medicine,

1528
01:14:05,840 --> 01:14:07,420
where the Stanford
group now showed

1529
01:14:07,420 --> 01:14:10,540
how a convolutional neural
network approach, which

1530
01:14:10,540 --> 01:14:13,960
is, in many ways, extremely
naive-- all it does

1531
01:14:13,960 --> 01:14:17,870
is it takes the
sequence data in.

1532
01:14:17,870 --> 01:14:20,710
It makes no attempt at trying
to understand the underlying

1533
01:14:20,710 --> 01:14:23,800
physiology, and just
predicts from that--

1534
01:14:23,800 --> 01:14:25,647
can do really, really well.

1535
01:14:25,647 --> 01:14:27,230
And so there are
couple of differences

1536
01:14:27,230 --> 01:14:29,590
that I want to emphasize
to the previous work.

1537
01:14:29,590 --> 01:14:31,360
First, the censor is different.

1538
01:14:31,360 --> 01:14:35,580
Whereas the previous work
used this alive core censor,

1539
01:14:35,580 --> 01:14:37,420
in this paper from
Stanford, they're

1540
01:14:37,420 --> 01:14:40,870
using a different censor
called the Zio patch, which

1541
01:14:40,870 --> 01:14:44,110
is attached to the human
body and conceivably

1542
01:14:44,110 --> 01:14:45,580
much less noisy.

1543
01:14:45,580 --> 01:14:47,560
So that's one big difference.

1544
01:14:47,560 --> 01:14:49,810
The second big difference
is that there's dramatically

1545
01:14:49,810 --> 01:14:50,770
more data.

1546
01:14:50,770 --> 01:14:52,510
Instead of 8,000
patients to train from,

1547
01:14:52,510 --> 01:14:54,790
now they have over
90,000 records

1548
01:14:54,790 --> 01:14:58,060
from 50,000 different
patients to train from.

1549
01:14:58,060 --> 01:14:59,740
The third major
difference is that now,

1550
01:14:59,740 --> 01:15:02,740
rather than just trying to
classify into four categories--

1551
01:15:02,740 --> 01:15:06,723
normal, abnormal,
other, or noisy--

1552
01:15:06,723 --> 01:15:08,140
now we're going
to try to classify

1553
01:15:08,140 --> 01:15:09,880
into 14 different categories.

1554
01:15:09,880 --> 01:15:12,850
We're, in essence, breaking
apart that other class

1555
01:15:12,850 --> 01:15:15,610
into much finer grain
detail of different types

1556
01:15:15,610 --> 01:15:17,780
of abnormal rhythms.

1557
01:15:17,780 --> 01:15:20,110
And so here are some of
those other abnormal rhythms,

1558
01:15:20,110 --> 01:15:28,140
things like complete
heart block,

1559
01:15:28,140 --> 01:15:31,650
and a bunch of other
names I can't pronounce.

1560
01:15:31,650 --> 01:15:34,472
And from each one of these,
they gathered a lot of data.

1561
01:15:34,472 --> 01:15:35,430
And that actually did--

1562
01:15:35,430 --> 01:15:36,870
so it's not described
in the paper,

1563
01:15:36,870 --> 01:15:38,160
but I've talked to
the authors, and they

1564
01:15:38,160 --> 01:15:40,690
did-- they gathered this data
in a very interesting way.

1565
01:15:40,690 --> 01:15:42,720
So they sort of-- they did
their training iteratively.

1566
01:15:42,720 --> 01:15:44,460
They looked to see
where their errors were,

1567
01:15:44,460 --> 01:15:46,752
and then they went and gathered
more data from patients

1568
01:15:46,752 --> 01:15:48,180
with that subcategory.

1569
01:15:48,180 --> 01:15:51,930
So many of these
other categories

1570
01:15:51,930 --> 01:15:54,267
are very under-- might
be underrepresented

1571
01:15:54,267 --> 01:15:56,100
in the general population,
but they actually

1572
01:15:56,100 --> 01:15:57,810
gather a lot of
patients of that type

1573
01:15:57,810 --> 01:16:00,520
in their data set for
training purposes.

1574
01:16:00,520 --> 01:16:02,700
And so I think those
three things ended up

1575
01:16:02,700 --> 01:16:05,320
making a very big difference.

1576
01:16:05,320 --> 01:16:07,050
So what is their
convolutional network?

1577
01:16:07,050 --> 01:16:10,180
Well, first of all,
it's a 1-D signal.

1578
01:16:10,180 --> 01:16:12,180
So it's a little bit
different from the con nets

1579
01:16:12,180 --> 01:16:13,380
you typically see
in computer vision,

1580
01:16:13,380 --> 01:16:15,088
and I'll show you an
illustration of that

1581
01:16:15,088 --> 01:16:16,080
in the next slide.

1582
01:16:16,080 --> 01:16:17,430
It's a very deep model.

1583
01:16:17,430 --> 01:16:20,100
So it's 34 layers.

1584
01:16:20,100 --> 01:16:23,010
So the input comes in on the
very top in this picture.

1585
01:16:23,010 --> 01:16:26,730
It's passed through
a number of layers.

1586
01:16:26,730 --> 01:16:30,210
Each layer consists of
convolution followed

1587
01:16:30,210 --> 01:16:33,600
by rectified linear
units, and there is sub

1588
01:16:33,600 --> 01:16:35,790
sampling at every
other layer so that you

1589
01:16:35,790 --> 01:16:38,010
go from a very wide signal--

1590
01:16:38,010 --> 01:16:39,645
so a very long--

1591
01:16:39,645 --> 01:16:40,770
I can't remember how long--

1592
01:16:40,770 --> 01:16:43,830
1 second long signal
summarized down

1593
01:16:43,830 --> 01:16:47,165
into sort of much-- just many
smaller number of dimensions,

1594
01:16:47,165 --> 01:16:49,290
which you then have a sort
of fully connected layer

1595
01:16:49,290 --> 01:16:52,770
at the bottom to do
for your predictions.

1596
01:16:52,770 --> 01:16:55,590
And then they also have
these shortcut connections,

1597
01:16:55,590 --> 01:16:58,770
which allow you to pass
information from earlier layers

1598
01:16:58,770 --> 01:17:00,630
down to the very
end of the network,

1599
01:17:00,630 --> 01:17:02,255
or even into
intermediate layers.

1600
01:17:02,255 --> 01:17:04,380
And for those of you who
are familiar with residual

1601
01:17:04,380 --> 01:17:06,850
networks, it's the same idea.

1602
01:17:06,850 --> 01:17:08,340
So what is a 1D convolution?

1603
01:17:08,340 --> 01:17:10,270
Well, it looks a
little bit like this.

1604
01:17:10,270 --> 01:17:12,960
So this is the signal.

1605
01:17:12,960 --> 01:17:15,570
I'm going to just approximate
it by a bunch of 1's and 0's.

1606
01:17:15,570 --> 01:17:16,560
I'll say this is a 1.

1607
01:17:16,560 --> 01:17:17,360
This is a 0.

1608
01:17:17,360 --> 01:17:18,480
This is a 1, 1, so on.

1609
01:17:21,620 --> 01:17:25,280
A convolutional network has
a filter associated with it.

1610
01:17:25,280 --> 01:17:28,070
That filter is then
applied in a 1D model.

1611
01:17:28,070 --> 01:17:29,630
It's applied in
a linear fashion.

1612
01:17:29,630 --> 01:17:32,240
It's just taken a dot product
with the filter's values,

1613
01:17:32,240 --> 01:17:35,150
with the values of the
signal at each point in time.

1614
01:17:35,150 --> 01:17:38,130
So it looks a little
bit like this,

1615
01:17:38,130 --> 01:17:39,450
and this is what you get out.

1616
01:17:39,450 --> 01:17:42,330
So this is the convolution
of a single filter

1617
01:17:42,330 --> 01:17:44,760
with the whole signal.

1618
01:17:44,760 --> 01:17:47,140
And the computation I did
there-- so for example,

1619
01:17:47,140 --> 01:17:49,860
this first number came
from taking the dot product

1620
01:17:49,860 --> 01:17:51,360
of the first three numbers--

1621
01:17:51,360 --> 01:17:53,370
1, 0, 1-- with the filter.

1622
01:17:53,370 --> 01:18:01,548
So it's 1 times 2 plus 3 times
0 plus 1 times 1, which is 3.

1623
01:18:01,548 --> 01:18:03,090
And so each of the
subsequent numbers

1624
01:18:03,090 --> 01:18:04,900
was computed in the same way.

1625
01:18:04,900 --> 01:18:09,060
And I usually have you figure
out what this last one is,

1626
01:18:09,060 --> 01:18:12,440
but I'll leave that
for you to do at home.

1627
01:18:12,440 --> 01:18:14,097
And that's what a
1D convolution is.

1628
01:18:14,097 --> 01:18:16,680
And so they have-- they do this
for lots of different filters.

1629
01:18:16,680 --> 01:18:19,155
Each of those filters might
be of varying lengths,

1630
01:18:19,155 --> 01:18:21,030
and each of those will
detect different types

1631
01:18:21,030 --> 01:18:23,040
of signal patterns.

1632
01:18:23,040 --> 01:18:25,800
And in this way, after
having many layers of these,

1633
01:18:25,800 --> 01:18:28,320
one can, in an
automatic fashion,

1634
01:18:28,320 --> 01:18:31,080
extract many of the same types
of signals used in that earlier

1635
01:18:31,080 --> 01:18:32,997
work, but also be much
more flexible to detect

1636
01:18:32,997 --> 01:18:34,420
some new ones, as well.

1637
01:18:34,420 --> 01:18:37,120
Hold your question,
because I need to wrap up.

1638
01:18:37,120 --> 01:18:38,710
So in the paper
that you read, they

1639
01:18:38,710 --> 01:18:41,902
talked about how
they evaluated this.

1640
01:18:41,902 --> 01:18:44,110
And so I'm not going to go
into much depth in it now.

1641
01:18:44,110 --> 01:18:46,330
I just want to point out
two different metrics

1642
01:18:46,330 --> 01:18:47,320
that they used.

1643
01:18:47,320 --> 01:18:48,910
So the first metric
they used was

1644
01:18:48,910 --> 01:18:52,690
what they called a
sequential error metric.

1645
01:18:52,690 --> 01:18:55,990
What that looked at is you
had this very long sequence

1646
01:18:55,990 --> 01:19:00,670
for each patient, and
they labeled different one

1647
01:19:00,670 --> 01:19:02,350
second intervals
of that sequence

1648
01:19:02,350 --> 01:19:05,690
into abnormal,
normal, and so on.

1649
01:19:05,690 --> 01:19:07,113
So you could ask,
how good are we

1650
01:19:07,113 --> 01:19:08,780
at labeling each of
the different points

1651
01:19:08,780 --> 01:19:09,600
along the sequence?

1652
01:19:09,600 --> 01:19:11,720
And that's the sequence metric.

1653
01:19:11,720 --> 01:19:14,510
The different-- the second
metric is the set metric,

1654
01:19:14,510 --> 01:19:16,520
and that looks at,
if the patient has

1655
01:19:16,520 --> 01:19:19,730
something that's abnormal
anywhere, did you detect it?

1656
01:19:19,730 --> 01:19:22,040
So that's, in essence,
taking an or of

1657
01:19:22,040 --> 01:19:23,510
each of those 1
second intervals,

1658
01:19:23,510 --> 01:19:25,310
and then looking
across patients.

1659
01:19:25,310 --> 01:19:27,410
And from a clinical
diagnostic perspective,

1660
01:19:27,410 --> 01:19:29,510
the set metric might be
most useful, but then

1661
01:19:29,510 --> 01:19:31,340
when you want to
introspect and understand

1662
01:19:31,340 --> 01:19:34,370
where is that happening,
then the sequential metric is

1663
01:19:34,370 --> 01:19:35,600
important.

1664
01:19:35,600 --> 01:19:38,300
And the key take home message
from the paper is that,

1665
01:19:38,300 --> 01:19:41,240
if you compared the model's
predictions-- this is, I think,

1666
01:19:41,240 --> 01:19:44,990
using an f1 metric--

1667
01:19:44,990 --> 01:19:49,790
to what you would get from
a panel of cardiologists,

1668
01:19:49,790 --> 01:19:53,510
these models are doing as well,
if not better than these panels

1669
01:19:53,510 --> 01:19:54,500
of cardiologists.

1670
01:19:54,500 --> 01:19:56,930
So this is extremely exciting.

1671
01:19:56,930 --> 01:19:58,700
This is technology--
or variance of this

1672
01:19:58,700 --> 01:20:02,240
is technology that you're
going to see deployed now.

1673
01:20:02,240 --> 01:20:04,760
So for those of you who have
purchased these Apple watches,

1674
01:20:04,760 --> 01:20:07,220
these Samsung watches, I don't
know exactly what they're

1675
01:20:07,220 --> 01:20:08,637
using, but I
wouldn't be surprised

1676
01:20:08,637 --> 01:20:10,580
if they're using
techniques similar to this.

1677
01:20:10,580 --> 01:20:12,390
And you're going to see much
more of that in the future.

1678
01:20:12,390 --> 01:20:14,030
So this is going to be
really the first example

1679
01:20:14,030 --> 01:20:15,447
in this course so
far of something

1680
01:20:15,447 --> 01:20:18,280
that's really been deployed.

1681
01:20:18,280 --> 01:20:20,660
And so in summary,
we're very often

1682
01:20:20,660 --> 01:20:22,450
in the realm of not enough data.

1683
01:20:22,450 --> 01:20:24,860
And in this lecture today,
we gave two examples

1684
01:20:24,860 --> 01:20:26,030
how you can deal with that.

1685
01:20:26,030 --> 01:20:31,340
First, you can try to use
mechanistic and statistical

1686
01:20:31,340 --> 01:20:38,150
models to try to work
in settings where

1687
01:20:38,150 --> 01:20:39,590
you don't have much data.

1688
01:20:39,590 --> 01:20:42,333
And in other extremes,
you do have a lot of data,

1689
01:20:42,333 --> 01:20:44,000
and you can try to
ignore that, and just

1690
01:20:44,000 --> 01:20:45,292
use these black box approaches.

1691
01:20:45,292 --> 01:20:46,930
That's all for today.