1
00:00:14,825 --> 00:00:15,950
PETER SZOLOVITS: All right.

2
00:00:15,950 --> 00:00:17,900
Let's get started.

3
00:00:17,900 --> 00:00:20,120
Good afternoon.

4
00:00:20,120 --> 00:00:24,910
So last time, I started
talking about the use

5
00:00:24,910 --> 00:00:29,770
of natural language processing
to process clinical data.

6
00:00:29,770 --> 00:00:32,259
And things went a
little bit slowly.

7
00:00:32,259 --> 00:00:36,010
And so we didn't get through
a lot of the material.

8
00:00:36,010 --> 00:00:39,680
I'm going to try to
rush a bit more today.

9
00:00:39,680 --> 00:00:44,690
And as a result, I have
a lot of stuff to cover.

10
00:00:44,690 --> 00:00:49,510
So if you remember,
last time, I started

11
00:00:49,510 --> 00:00:54,010
by saying that a
lot of the NLP work

12
00:00:54,010 --> 00:00:57,040
involves coming up with
phrases that one might

13
00:00:57,040 --> 00:01:01,900
be interested in to help
identify the kinds of data

14
00:01:01,900 --> 00:01:06,320
that you want, and then just
looking for those in text.

15
00:01:06,320 --> 00:01:07,810
So that's a very simple method.

16
00:01:07,810 --> 00:01:10,460
But it's one that
works reasonably well.

17
00:01:10,460 --> 00:01:13,150
And then Kat Liao was
here to talk about some

18
00:01:13,150 --> 00:01:16,880
of the applications
of that kind of work

19
00:01:16,880 --> 00:01:20,410
in what she's been doing
in cohort selection.

20
00:01:20,410 --> 00:01:22,060
So what I want to
talk about today

21
00:01:22,060 --> 00:01:24,820
is more sophisticated
versions of that,

22
00:01:24,820 --> 00:01:29,110
and then move on to more
contemporary approaches

23
00:01:29,110 --> 00:01:31,360
to natural language processing.

24
00:01:31,360 --> 00:01:35,860
So this is a paper
that was given

25
00:01:35,860 --> 00:01:39,040
to you as one of the
optional readings last time.

26
00:01:39,040 --> 00:01:42,250
And it's work from
David Sontag's lab,

27
00:01:42,250 --> 00:01:46,250
where they said, well, how do
we make this more sophisticated?

28
00:01:46,250 --> 00:01:47,650
So they start the same way.

29
00:01:47,650 --> 00:01:52,270
They say, OK, Dr.
Liao, let's say,

30
00:01:52,270 --> 00:01:56,830
give me terms that are very
good indicators that I have

31
00:01:56,830 --> 00:01:59,950
the right kind of
patient, if I find them

32
00:01:59,950 --> 00:02:01,880
in the patient's notes.

33
00:02:01,880 --> 00:02:04,940
So these are things with
high predictive value.

34
00:02:04,940 --> 00:02:09,990
So you don't want to use a term
like sick, because that's going

35
00:02:09,990 --> 00:02:11,500
to find way too many people.

36
00:02:11,500 --> 00:02:13,090
But you want to
find something that

37
00:02:13,090 --> 00:02:17,980
is very specific but that
has a high predictive value

38
00:02:17,980 --> 00:02:20,860
that you are going to
find the right person.

39
00:02:20,860 --> 00:02:24,120
And then what they
did is they built

40
00:02:24,120 --> 00:02:28,990
a model that tries to
predict the presence

41
00:02:28,990 --> 00:02:33,640
of that word in the
text from everything

42
00:02:33,640 --> 00:02:36,600
else in the medical record.

43
00:02:36,600 --> 00:02:42,080
So now, this is an example of a
silver-standard way of training

44
00:02:42,080 --> 00:02:47,420
a model that says, well, I don't
have the energy or the time

45
00:02:47,420 --> 00:02:50,030
to get doctors to
look through thousands

46
00:02:50,030 --> 00:02:52,110
and thousands of records.

47
00:02:52,110 --> 00:02:55,350
But if I select these
anchors well enough,

48
00:02:55,350 --> 00:02:59,300
then I'm going to get a high
yield of correct responses

49
00:02:59,300 --> 00:03:00,410
from those.

50
00:03:00,410 --> 00:03:01,970
And then I train
a machine learning

51
00:03:01,970 --> 00:03:11,000
model that learns to
identify those same terms,

52
00:03:11,000 --> 00:03:14,420
or those same records that
have those terms in them.

53
00:03:14,420 --> 00:03:16,430
And by the way, from
that, we're going

54
00:03:16,430 --> 00:03:18,890
to learn a whole
bunch of other terms

55
00:03:18,890 --> 00:03:23,040
that are proxies for the
ones that we started with.

56
00:03:23,040 --> 00:03:27,410
So this is a way of
enlarging that set of terms

57
00:03:27,410 --> 00:03:30,140
automatically.

58
00:03:30,140 --> 00:03:32,930
And so there are a bunch
of technical details

59
00:03:32,930 --> 00:03:38,210
that you can find out
about by reading the paper.

60
00:03:38,210 --> 00:03:40,970
They used a relatively
simple representation,

61
00:03:40,970 --> 00:03:45,140
which is essentially a
bag-of-words representation.

62
00:03:45,140 --> 00:03:48,890
They then sort of
masked the three words

63
00:03:48,890 --> 00:03:52,760
around the word that
actually is the one they're

64
00:03:52,760 --> 00:03:54,710
trying to predict
just to get rid

65
00:03:54,710 --> 00:03:59,090
of short-term
syntactic correlations.

66
00:03:59,090 --> 00:04:02,660
And then they built an
L2-regularized logistic

67
00:04:02,660 --> 00:04:06,890
regression model that said, what
are the features that predict

68
00:04:06,890 --> 00:04:08,840
the occurrence of this word?

69
00:04:08,840 --> 00:04:12,380
And then they expanded
the search vocabulary

70
00:04:12,380 --> 00:04:14,870
to include those
features as well.

71
00:04:14,870 --> 00:04:16,610
And again, there
are tons of details

72
00:04:16,610 --> 00:04:21,079
about how to discretize
continuous values

73
00:04:21,079 --> 00:04:24,000
and things like that that
you can find out about.

74
00:04:24,000 --> 00:04:25,760
So you build a
phenotype estimator

75
00:04:25,760 --> 00:04:29,330
from the anchors and
the chosen predictors.

76
00:04:29,330 --> 00:04:32,030
They calculated a
calibration score

77
00:04:32,030 --> 00:04:34,640
for each of these
other predictors that

78
00:04:34,640 --> 00:04:37,790
told you how well it predicted.

79
00:04:37,790 --> 00:04:41,360
And then you can build
a joint estimator

80
00:04:41,360 --> 00:04:43,280
that uses all of these.

81
00:04:43,280 --> 00:04:45,920
And the bottom line is
that they did very well.

82
00:04:45,920 --> 00:04:51,440
So in order to
evaluate this, they

83
00:04:51,440 --> 00:04:55,220
looked at eight different
phenotypes for which

84
00:04:55,220 --> 00:04:58,970
they had human judgment data.

85
00:04:58,970 --> 00:05:00,980
And so this tells
you that they're

86
00:05:00,980 --> 00:05:07,130
getting AUCs of
between 0.83 and 0.95

87
00:05:07,130 --> 00:05:10,730
for these different phenotypes.

88
00:05:10,730 --> 00:05:13,170
So that's quite good.

89
00:05:13,170 --> 00:05:16,850
They, in fact, were estimating
not only these eight phenotypes

90
00:05:16,850 --> 00:05:19,040
but 40-something.

91
00:05:19,040 --> 00:05:22,890
I don't remember the exact
number, much larger number.

92
00:05:22,890 --> 00:05:25,400
But they didn't
have validated data

93
00:05:25,400 --> 00:05:27,620
against which to
test the others.

94
00:05:27,620 --> 00:05:30,470
But the expectation is that
if it does well on these,

95
00:05:30,470 --> 00:05:33,450
it probably does well
on the others as well.

96
00:05:33,450 --> 00:05:35,810
So this was a very nice idea.

97
00:05:35,810 --> 00:05:38,990
And just to illustrate, if
you start with something

98
00:05:38,990 --> 00:05:41,840
like diabetes as a
phenotype and you say,

99
00:05:41,840 --> 00:05:44,330
well, I'm going to
look for anchors

100
00:05:44,330 --> 00:05:48,710
that are a code of
250 diabetes mellitus,

101
00:05:48,710 --> 00:05:51,440
or I'm going to
look at medication

102
00:05:51,440 --> 00:05:55,110
history for diabetic therapy--

103
00:05:55,110 --> 00:06:00,170
so those are the silver-standard
goals that I'm looking at.

104
00:06:00,170 --> 00:06:03,710
And those, in fact, have a high
predictive value for somebody

105
00:06:03,710 --> 00:06:05,430
being in the cohort.

106
00:06:05,430 --> 00:06:08,870
And then they identify
all these other features

107
00:06:08,870 --> 00:06:12,230
that predict those,
and therefore, in turn,

108
00:06:12,230 --> 00:06:15,230
predict appropriate
selectors for the phenotype

109
00:06:15,230 --> 00:06:17,070
that they're interested in.

110
00:06:17,070 --> 00:06:19,910
And if you look at the
paper again, what you see

111
00:06:19,910 --> 00:06:24,380
is that this
outperforms, over time,

112
00:06:24,380 --> 00:06:28,970
the standard supervised baseline
that they're comparing against,

113
00:06:28,970 --> 00:06:32,270
where you're getting much
higher accuracy early

114
00:06:32,270 --> 00:06:35,510
in a patient's visit to
be able to identify them

115
00:06:35,510 --> 00:06:39,660
as belonging to this cohort.

116
00:06:39,660 --> 00:06:45,110
I'm going to come back later to
look at another similar attempt

117
00:06:45,110 --> 00:06:50,280
to generalize from a core using
a different set of techniques.

118
00:06:50,280 --> 00:06:54,410
So you should see that in
about 45 minutes, I hope.

119
00:06:57,380 --> 00:06:59,850
Well, context is important.

120
00:06:59,850 --> 00:07:02,810
So if you look at a sentence
like Mr. Huntington was treated

121
00:07:02,810 --> 00:07:06,290
for Huntington's disease at
Huntington Hospital, located

122
00:07:06,290 --> 00:07:10,670
on Huntington Avenue, each
of those mentions of the word

123
00:07:10,670 --> 00:07:13,250
Huntington is different.

124
00:07:13,250 --> 00:07:16,850
And for example, if you're
interested in eliminating

125
00:07:16,850 --> 00:07:19,850
personally identifiable
health information

126
00:07:19,850 --> 00:07:23,270
from a record like
this, then certainly you

127
00:07:23,270 --> 00:07:26,540
want to get rid of the
Mr. Huntington part.

128
00:07:26,540 --> 00:07:29,150
You don't want to get rid
of Huntington's disease,

129
00:07:29,150 --> 00:07:33,680
because that's a
medically relevant fact.

130
00:07:33,680 --> 00:07:37,940
And you probably do want to
get rid of Huntington Hospital

131
00:07:37,940 --> 00:07:40,850
and its location on
Huntington Avenue,

132
00:07:40,850 --> 00:07:44,390
although those are not
necessarily something

133
00:07:44,390 --> 00:07:46,580
that you're prohibited
from retaining.

134
00:07:46,580 --> 00:07:50,600
So for example, if you're
trying to do quality studies

135
00:07:50,600 --> 00:07:52,850
among different
hospitals, then it

136
00:07:52,850 --> 00:07:56,480
would make sense to retain the
name of the hospital, which

137
00:07:56,480 --> 00:08:00,230
is not considered identifying
of the individual.

138
00:08:00,230 --> 00:08:05,420
So we, in fact, did a study
back in the mid 2000s,

139
00:08:05,420 --> 00:08:10,040
where we were trying to build
an improved de-identifier.

140
00:08:12,580 --> 00:08:14,500
And here's the way
we went about it.

141
00:08:14,500 --> 00:08:17,990
This is a kind of kitchen
sink approach that says,

142
00:08:17,990 --> 00:08:23,330
OK, take the text, tokenize it.

143
00:08:23,330 --> 00:08:25,400
Look at every single token.

144
00:08:25,400 --> 00:08:27,900
And derive things from it.

145
00:08:27,900 --> 00:08:30,350
So the words that
make up the token,

146
00:08:30,350 --> 00:08:33,200
the part of speech,
how it's capitalized,

147
00:08:33,200 --> 00:08:36,169
whether there's
punctuation around it,

148
00:08:36,169 --> 00:08:40,549
which document
section is it in--

149
00:08:40,549 --> 00:08:44,059
many databases have sort
of conventional document

150
00:08:44,059 --> 00:08:44,660
structure.

151
00:08:44,660 --> 00:08:48,260
If you've looked at the
mimic discharge summaries,

152
00:08:48,260 --> 00:08:51,530
for example, there's a
kind of prototypical way

153
00:08:51,530 --> 00:08:54,620
in which that flows
from beginning to end.

154
00:08:54,620 --> 00:08:57,920
And you can use that
structural information.

155
00:08:57,920 --> 00:09:01,890
We then identified a bunch of
patterns and thesaurus terms.

156
00:09:01,890 --> 00:09:06,650
So we looked up, in the
UMLS, words and phrases

157
00:09:06,650 --> 00:09:10,580
to see if they matched some
clinically meaningful term.

158
00:09:10,580 --> 00:09:13,010
We had patterns that
identified things

159
00:09:13,010 --> 00:09:17,390
like phone numbers and social
security numbers and addresses

160
00:09:17,390 --> 00:09:19,140
and so on.

161
00:09:19,140 --> 00:09:22,620
And then we did
parsing of the text.

162
00:09:22,620 --> 00:09:24,740
So in those days,
we used something

163
00:09:24,740 --> 00:09:27,740
called the Link
Grammar Parser, which,

164
00:09:27,740 --> 00:09:30,470
doesn't make a whole lot
of difference what parser.

165
00:09:30,470 --> 00:09:34,790
But you get either a constituent
or constituency or dependency

166
00:09:34,790 --> 00:09:39,420
parse, which gives you
relationships among the words.

167
00:09:39,420 --> 00:09:44,300
And so it allows you to
include, as features,

168
00:09:44,300 --> 00:09:47,180
the way in which a word
that you're looking at

169
00:09:47,180 --> 00:09:49,920
relates to other
words around it.

170
00:09:49,920 --> 00:09:54,110
And so what we did is we
said, OK, the lexical context

171
00:09:54,110 --> 00:09:57,860
includes all of the
above kind of information

172
00:09:57,860 --> 00:10:02,720
for all of the words that
are either literally adjacent

173
00:10:02,720 --> 00:10:05,630
or within n words of the
original word that you're

174
00:10:05,630 --> 00:10:10,730
focusing on, or that are
linked by within k links

175
00:10:10,730 --> 00:10:13,590
through the parse to that word.

176
00:10:13,590 --> 00:10:17,900
So this gives you a very
large set of features.

177
00:10:17,900 --> 00:10:23,070
And of course, parsing
is not a solved problem.

178
00:10:23,070 --> 00:10:27,860
And so this is an
example from that story

179
00:10:27,860 --> 00:10:29,780
that I showed you last time.

180
00:10:29,780 --> 00:10:36,530
And if you see, it comes
up with 24 ambiguous parses

181
00:10:36,530 --> 00:10:39,710
of this sentence.

182
00:10:39,710 --> 00:10:44,960
So there are technical problems
about how to deal with that.

183
00:10:44,960 --> 00:10:47,030
Today, you could use
a different parser.

184
00:10:47,030 --> 00:10:49,700
The Stanford
Parser, for example,

185
00:10:49,700 --> 00:10:51,650
probably does a better
job than the one

186
00:10:51,650 --> 00:10:58,010
we were using 14 years
ago and gives you at least

187
00:10:58,010 --> 00:11:00,080
more definitive answers.

188
00:11:00,080 --> 00:11:02,700
And so you could
use that instead.

189
00:11:02,700 --> 00:11:04,460
And so if you look
at what we did,

190
00:11:04,460 --> 00:11:09,100
we said, well, here
is the text "Mr."

191
00:11:09,100 --> 00:11:15,080
And here are all the ways that
you can look it up in the UMLS.

192
00:11:15,080 --> 00:11:18,330
And it turns out to
be very ambiguous.

193
00:11:18,330 --> 00:11:22,960
So M-R stands not
only for mister,

194
00:11:22,960 --> 00:11:25,480
but it also stands for
Magnetic Resonance.

195
00:11:25,480 --> 00:11:28,740
And it stands for a whole
bunch of other things.

196
00:11:28,740 --> 00:11:31,820
And so you get huge
amounts of ambiguity.

197
00:11:31,820 --> 00:11:36,410
"Blind" turns out also to
give you various ambiguities.

198
00:11:36,410 --> 00:11:41,000
So it maps here to four
different concept-unique

199
00:11:41,000 --> 00:11:43,250
identifiers.

200
00:11:43,250 --> 00:11:46,010
"Is" is OK.

201
00:11:46,010 --> 00:11:49,250
"79-year-old" is OK.

202
00:11:49,250 --> 00:11:56,300
And then "male," again, maps to
five different concept-unique

203
00:11:56,300 --> 00:11:57,570
identifiers.

204
00:11:57,570 --> 00:12:00,590
So there are all these
problems of over-generation

205
00:12:00,590 --> 00:12:02,580
from this database.

206
00:12:02,580 --> 00:12:05,450
And here's some more, but
I'm going to skip over that.

207
00:12:05,450 --> 00:12:07,340
And then the learning
model, in our case,

208
00:12:07,340 --> 00:12:11,240
was a support vector machine for
this project, in which we just

209
00:12:11,240 --> 00:12:14,060
said, well, throw in all the--

210
00:12:14,060 --> 00:12:15,440
you know, it's
the kill them all,

211
00:12:15,440 --> 00:12:19,370
and God will sort them
out kind of approach.

212
00:12:19,370 --> 00:12:21,500
So we just threw in
all these features

213
00:12:21,500 --> 00:12:23,450
and said, oh, support
vector machines

214
00:12:23,450 --> 00:12:26,990
are really good at picking
out exactly what are the best

215
00:12:26,990 --> 00:12:27,950
features.

216
00:12:27,950 --> 00:12:30,320
And so we just relied on that.

217
00:12:30,320 --> 00:12:34,580
And sure enough, so you wind
up with literally millions

218
00:12:34,580 --> 00:12:36,440
of features.

219
00:12:36,440 --> 00:12:38,750
But sure enough, it
worked pretty well.

220
00:12:38,750 --> 00:12:41,550
And so Stat De-ID
was our program.

221
00:12:41,550 --> 00:12:44,600
And you see that on real
discharge summaries,

222
00:12:44,600 --> 00:12:49,370
we're getting precision
and recall on PHI

223
00:12:49,370 --> 00:12:53,510
up around 98 and
1/2%, 95 and 1/4%,

224
00:12:53,510 --> 00:12:56,480
which was much better than
the previous state of the art,

225
00:12:56,480 --> 00:13:00,680
which had been based on
rules and dictionaries

226
00:13:00,680 --> 00:13:03,560
as a way of
de-identifying things.

227
00:13:03,560 --> 00:13:08,090
So this was a successful
example of that approach.

228
00:13:08,090 --> 00:13:13,160
And of course, this is usable
not only for de-identification.

229
00:13:13,160 --> 00:13:16,910
But it's also usable
for entity recognition.

230
00:13:16,910 --> 00:13:19,400
Because instead of
selecting entities

231
00:13:19,400 --> 00:13:22,720
that are personally
identifiable health information,

232
00:13:22,720 --> 00:13:26,830
you could train it to select
entities that are diseases

233
00:13:26,830 --> 00:13:30,370
or that are medications or
that are various other things.

234
00:13:30,370 --> 00:13:35,470
And so this was, in the
2000s, a pretty typical way

235
00:13:35,470 --> 00:13:38,620
for people to approach
these kinds of problems.

236
00:13:38,620 --> 00:13:40,080
And it's still used today.

237
00:13:40,080 --> 00:13:43,060
There are tools around
that let you do this.

238
00:13:43,060 --> 00:13:45,620
And they work
reasonably effectively.

239
00:13:45,620 --> 00:13:47,980
They're not state of
the art at the moment,

240
00:13:47,980 --> 00:13:50,680
but they're simpler than
many of today's state

241
00:13:50,680 --> 00:13:54,380
of the art methods.

242
00:13:54,380 --> 00:13:56,690
So here's another approach.

243
00:13:56,690 --> 00:14:01,760
This was something we published
a few years ago, where

244
00:14:01,760 --> 00:14:06,510
we started working with
some psychiatrists and said,

245
00:14:06,510 --> 00:14:09,560
could we predict
30-day readmission

246
00:14:09,560 --> 00:14:14,780
for a psychiatric patient with
any degree of reliability?

247
00:14:14,780 --> 00:14:16,550
That's a hard prediction.

248
00:14:16,550 --> 00:14:19,340
Willie is currently
running an experiment

249
00:14:19,340 --> 00:14:23,030
where we're asking
psychiatrists to predict that.

250
00:14:23,030 --> 00:14:26,240
And it turns out, they're
barely better than chance

251
00:14:26,240 --> 00:14:27,890
at that prediction.

252
00:14:27,890 --> 00:14:30,800
So it's not an easy task.

253
00:14:30,800 --> 00:14:35,960
And what we did is we said,
well, let's use topic modeling.

254
00:14:35,960 --> 00:14:40,580
And so we had this
cohort of patients,

255
00:14:40,580 --> 00:14:42,320
close to 5,000 patients.

256
00:14:42,320 --> 00:14:45,320
About 10% of them
were readmitted

257
00:14:45,320 --> 00:14:47,390
with a psych diagnosis.

258
00:14:47,390 --> 00:14:50,720
And almost 3,000 of
them were readmitted

259
00:14:50,720 --> 00:14:52,790
with other diagnoses.

260
00:14:52,790 --> 00:14:54,860
So one thing this
tells you right away

261
00:14:54,860 --> 00:14:58,465
is that if you're dealing
with psychiatric patients,

262
00:14:58,465 --> 00:15:01,700
they come and go to the
hospital frequently.

263
00:15:01,700 --> 00:15:05,150
And this is not good for the
hospital's bottom line because

264
00:15:05,150 --> 00:15:10,710
of reimbursement policies of
insurance companies and so on.

265
00:15:10,710 --> 00:15:17,820
So of the 4,700, only 1,240 were
not readmitted within 30 days.

266
00:15:17,820 --> 00:15:21,500
So there's very
frequent bounce-back.

267
00:15:21,500 --> 00:15:27,560
So we said, well, let's try
building a baseline model using

268
00:15:27,560 --> 00:15:31,190
a support vector machine from
baseline clinical features

269
00:15:31,190 --> 00:15:34,430
like age, gender,
public health insurance

270
00:15:34,430 --> 00:15:37,460
as a proxy for
socioeconomic status.

271
00:15:37,460 --> 00:15:41,210
So if you're on Medicaid,
you're probably poor.

272
00:15:41,210 --> 00:15:44,220
And if you have
private insurance,

273
00:15:44,220 --> 00:15:48,830
then you're probably an MIT
employee and/or better off.

274
00:15:48,830 --> 00:15:53,390
So that's a frequently used
proxy, a comorbidity index

275
00:15:53,390 --> 00:15:55,640
that tells you sort
of how sick you

276
00:15:55,640 --> 00:15:59,660
are from things other than
your psychiatric problems.

277
00:15:59,660 --> 00:16:01,160
And then we said,
well, what if we

278
00:16:01,160 --> 00:16:05,270
add to that model
common words from notes.

279
00:16:05,270 --> 00:16:10,200
So we said, let's do
a TF-IDF calculation.

280
00:16:10,200 --> 00:16:14,150
So this is term frequency
divided by log of the document

281
00:16:14,150 --> 00:16:15,510
frequency.

282
00:16:15,510 --> 00:16:18,410
So it's sort of, how
specific is a term

283
00:16:18,410 --> 00:16:22,100
to identify a particular
kind of condition?

284
00:16:22,100 --> 00:16:28,580
And we take the 1,000 most
informative words, and so there

285
00:16:28,580 --> 00:16:29,440
are a lot of these.

286
00:16:29,440 --> 00:16:33,170
So if you use 1,000
most informative words

287
00:16:33,170 --> 00:16:37,400
from these nearly
5,000 patients,

288
00:16:37,400 --> 00:16:42,860
you wind up with something like
66,000 words, unique words,

289
00:16:42,860 --> 00:16:47,750
that are informative
for some patient.

290
00:16:47,750 --> 00:16:50,250
But if you limit
yourself to the top 10,

291
00:16:50,250 --> 00:16:53,180
then it only uses 18,000 words.

292
00:16:53,180 --> 00:16:55,490
And if you limit
yourself to the top one,

293
00:16:55,490 --> 00:16:58,490
then it uses about 3,000 words.

294
00:16:58,490 --> 00:17:01,550
And then we said, well, instead
of doing individual words,

295
00:17:01,550 --> 00:17:04,670
let's do a latent
Dirichlet allocation.

296
00:17:04,670 --> 00:17:09,140
So topic modeling on all of
the words, as a bag of words--

297
00:17:09,140 --> 00:17:13,980
so no sequence information,
just the collection of words.

298
00:17:13,980 --> 00:17:19,640
And so we calculated
75 topics from using

299
00:17:19,640 --> 00:17:22,400
LDA on all these notes.

300
00:17:22,400 --> 00:17:26,839
So just to remind
you, the LDA process

301
00:17:26,839 --> 00:17:30,800
is a model that says
every document consists

302
00:17:30,800 --> 00:17:34,670
of a certain mixture of topics,
and each of those topics

303
00:17:34,670 --> 00:17:38,390
probabilistically
generates certain words.

304
00:17:38,390 --> 00:17:42,680
And so you can build
a model like this,

305
00:17:42,680 --> 00:17:46,250
and then solve it using
complicated techniques.

306
00:17:46,250 --> 00:17:52,218
And you'd wind up with topics,
in this study, as follows.

307
00:17:52,218 --> 00:17:52,760
I don't know.

308
00:17:52,760 --> 00:17:54,080
Can you read these?

309
00:17:54,080 --> 00:17:57,170
They may be too small.

310
00:17:57,170 --> 00:18:01,550
So these are
unsupervised topics.

311
00:18:01,550 --> 00:18:03,290
And if you look
at the first one,

312
00:18:03,290 --> 00:18:06,650
it says patient, alcohol,
withdrawal, depression,

313
00:18:06,650 --> 00:18:11,450
drinking, and Ativan,
ETOH, drinks, medications,

314
00:18:11,450 --> 00:18:16,730
clinic inpatient, diagnosis,
days, hospital, substance,

315
00:18:16,730 --> 00:18:18,320
use treatment program, name.

316
00:18:18,320 --> 00:18:24,110
That's a de-identified
use/abuse problem number.

317
00:18:24,110 --> 00:18:26,990
And we had our experts
look at these topics.

318
00:18:26,990 --> 00:18:28,970
And they said, oh,
well, that topic

319
00:18:28,970 --> 00:18:33,380
is related to alcohol abuse,
which seems reasonable.

320
00:18:33,380 --> 00:18:36,590
And then you see, on
the bottom, psychosis,

321
00:18:36,590 --> 00:18:41,090
thought features, paranoid
psychosis, paranoia symptoms,

322
00:18:41,090 --> 00:18:43,100
psychiatric, et cetera.

323
00:18:43,100 --> 00:18:45,930
And they said, OK,
that's a psychosis topic.

324
00:18:45,930 --> 00:18:49,760
So in retrospect, you can
assign meaning to these topics.

325
00:18:49,760 --> 00:18:53,900
But in fact, they're generated
without any a priori notion

326
00:18:53,900 --> 00:18:55,010
of what they ought to be.

327
00:18:55,010 --> 00:18:58,490
They're just a
statistical summarization

328
00:18:58,490 --> 00:19:03,980
of the common co-occurrences
of words in these documents.

329
00:19:03,980 --> 00:19:11,390
But what you find is that if you
use the baseline model, which

330
00:19:11,390 --> 00:19:15,320
used just the demographic
and clinical variables,

331
00:19:15,320 --> 00:19:19,220
and you say, what's the
difference in survival,

332
00:19:19,220 --> 00:19:23,030
in this case, in
time to readmission

333
00:19:23,030 --> 00:19:28,370
between one set and
another in this cohort,

334
00:19:28,370 --> 00:19:30,920
and the answer is
they're pretty similar.

335
00:19:30,920 --> 00:19:34,130
Whereas, if you use
a model that predicts

336
00:19:34,130 --> 00:19:37,160
based on the baseline
and 75 topics,

337
00:19:37,160 --> 00:19:40,010
the 75 topics that
we identified,

338
00:19:40,010 --> 00:19:42,260
you get a much
bigger separation.

339
00:19:42,260 --> 00:19:44,990
And of course, this is
statistically significant.

340
00:19:44,990 --> 00:19:47,150
And it tells you
that this technique

341
00:19:47,150 --> 00:19:50,780
is useful for being
able to improve

342
00:19:50,780 --> 00:19:54,290
the prediction of
a cohort that's

343
00:19:54,290 --> 00:19:57,020
more likely to be readmitted
from a cohort that's

344
00:19:57,020 --> 00:19:59,320
less likely to be readmitted.

345
00:19:59,320 --> 00:20:01,020
It's not a terrific prediction.

346
00:20:01,020 --> 00:20:06,960
So the AUC for this model
was only on the order of 0.7.

347
00:20:06,960 --> 00:20:10,490
So you know, it's not like 0.99.

348
00:20:10,490 --> 00:20:16,040
But nevertheless, it
provides useful information.

349
00:20:16,040 --> 00:20:20,780
The same group of psychiatrists
that we worked with also

350
00:20:20,780 --> 00:20:25,370
did a study with a much larger
cohort but much less rich data.

351
00:20:25,370 --> 00:20:28,820
So they got all
of the discharges

352
00:20:28,820 --> 00:20:33,120
from two medical centers
over a period of 12 years.

353
00:20:33,120 --> 00:20:38,960
So they had 845,000
discharges from 458,000

354
00:20:38,960 --> 00:20:40,610
unique individuals.

355
00:20:40,610 --> 00:20:44,480
And they were looking for
suicide or other causes

356
00:20:44,480 --> 00:20:46,910
of death in these
patients to see

357
00:20:46,910 --> 00:20:49,910
if they could predict
whether somebody

358
00:20:49,910 --> 00:20:52,100
is likely to try
to harm themselves,

359
00:20:52,100 --> 00:20:54,800
or whether they're likely
to die accidentally,

360
00:20:54,800 --> 00:20:59,880
which sometimes can't be
distinguished from suicide.

361
00:20:59,880 --> 00:21:03,480
So the censoring problems
that David talked about

362
00:21:03,480 --> 00:21:05,190
are very much present in this.

363
00:21:05,190 --> 00:21:07,800
Because you lose
track of people.

364
00:21:07,800 --> 00:21:10,110
It's a highly
imbalanced data set.

365
00:21:10,110 --> 00:21:15,990
Because out of the
845,000 patients, only 235

366
00:21:15,990 --> 00:21:19,950
committed suicide, which is, of
course, probably a good thing

367
00:21:19,950 --> 00:21:22,410
from a societal point
of view but makes

368
00:21:22,410 --> 00:21:24,360
the data analysis hard.

369
00:21:24,360 --> 00:21:28,230
On the other hand, all-cause
mortality was about 18%

370
00:21:28,230 --> 00:21:30,750
during nine years
of a follow-up.

371
00:21:30,750 --> 00:21:33,090
So that's not so imbalanced.

372
00:21:33,090 --> 00:21:35,340
And then what they
did is they curated

373
00:21:35,340 --> 00:21:39,390
a list of 3,000 terms
that correspond to what,

374
00:21:39,390 --> 00:21:43,080
in the psychiatric literature,
is called positive valence.

375
00:21:43,080 --> 00:21:47,790
So this is concepts like joy
and happiness and good stuff,

376
00:21:47,790 --> 00:21:51,720
as opposed to negative valence,
like depression and sorrow

377
00:21:51,720 --> 00:21:53,610
and all that stuff.

378
00:21:53,610 --> 00:21:58,740
And they said, well, we can
use these types of terms

379
00:21:58,740 --> 00:22:02,980
in order to help distinguish
among these patients.

380
00:22:02,980 --> 00:22:07,650
And what they found is that, if
you plot the Kaplan-Meier curve

381
00:22:07,650 --> 00:22:14,280
for different quartiles of
risk for these patients,

382
00:22:14,280 --> 00:22:16,800
you see that there's a
pretty big difference

383
00:22:16,800 --> 00:22:19,020
between the different quartiles.

384
00:22:19,020 --> 00:22:23,460
And you can certainly
identify the people

385
00:22:23,460 --> 00:22:27,030
who are more likely to commit
suicide from the people who

386
00:22:27,030 --> 00:22:29,280
are less likely to do so.

387
00:22:29,280 --> 00:22:33,930
This curve is for suicide
or accidental death.

388
00:22:33,930 --> 00:22:36,660
So this is a much
larger data set,

389
00:22:36,660 --> 00:22:39,090
and therefore the
error bars are smaller.

390
00:22:39,090 --> 00:22:43,060
But you see the same
kind of separation here.

391
00:22:43,060 --> 00:22:46,290
So these are all
useful techniques.

392
00:22:46,290 --> 00:22:48,930
Now I'll to another approach.

393
00:22:48,930 --> 00:22:52,920
This was work by one of
my students, Yuon Wo,

394
00:22:52,920 --> 00:22:56,070
who was working with some
lymphoma pathologists

395
00:22:56,070 --> 00:22:57,630
at Mass General.

396
00:22:57,630 --> 00:23:00,390
And so the approach
they took was

397
00:23:00,390 --> 00:23:06,590
to say, well, if you read a
pathology report about somebody

398
00:23:06,590 --> 00:23:10,340
with lymphoma, can we
tell what type of lymphoma

399
00:23:10,340 --> 00:23:13,190
they had from the
pathology report

400
00:23:13,190 --> 00:23:16,460
if we blank out the part of
the pathology report that

401
00:23:16,460 --> 00:23:22,340
says, "I, the pathologist, think
this person has non-Hodgkin's

402
00:23:22,340 --> 00:23:24,770
lymphoma," or something?

403
00:23:24,770 --> 00:23:28,880
So from the rest of the context,
can we make that prediction?

404
00:23:28,880 --> 00:23:33,620
Now, Yuon took a kind of
interesting, slightly odd

405
00:23:33,620 --> 00:23:35,900
approach to it,
which is to treat

406
00:23:35,900 --> 00:23:38,420
this as an unsupervised
learning problem

407
00:23:38,420 --> 00:23:41,220
rather than as a supervised
learning problem.

408
00:23:41,220 --> 00:23:45,110
So he literally
masked the real answer

409
00:23:45,110 --> 00:23:48,590
and said, if we just treat
everything except what

410
00:23:48,590 --> 00:23:52,310
gives away the
answer as just data,

411
00:23:52,310 --> 00:23:57,030
can we essentially cluster that
data in some interesting way

412
00:23:57,030 --> 00:24:02,540
so that we re-identify the
different types of lymphoma?

413
00:24:02,540 --> 00:24:05,210
Now, the reason this
turns out to be important

414
00:24:05,210 --> 00:24:07,580
is because lymphoma
pathologists keep

415
00:24:07,580 --> 00:24:11,870
arguing about how to
classify lymphomas.

416
00:24:11,870 --> 00:24:15,920
And every few years, they
revise the classification rules.

417
00:24:15,920 --> 00:24:19,380
And so part of his
objective was to say,

418
00:24:19,380 --> 00:24:24,320
let's try to provide an
unbiased, data-driven method

419
00:24:24,320 --> 00:24:28,370
that may help identify
appropriate characteristics

420
00:24:28,370 --> 00:24:32,570
by which to classify
these different lymphomas.

421
00:24:32,570 --> 00:24:37,760
So his approach was a tensor
factorization approach.

422
00:24:40,265 --> 00:24:42,560
You often see data
sets like this

423
00:24:42,560 --> 00:24:47,180
that's, say, patient
by a characteristic.

424
00:24:47,180 --> 00:24:49,085
So in this case,
laboratory measurements--

425
00:24:49,085 --> 00:24:53,180
so systolic/diastolic blood
pressure, sodium, potassium,

426
00:24:53,180 --> 00:24:54,170
et cetera.

427
00:24:54,170 --> 00:24:57,980
That's a very vanilla
matrix encoding of data.

428
00:24:57,980 --> 00:25:00,350
And then if you add a
third dimension to it,

429
00:25:00,350 --> 00:25:02,600
like this is at the
time of admission,

430
00:25:02,600 --> 00:25:06,890
30 minutes later, 60 minutes
later, 90 minutes later,

431
00:25:06,890 --> 00:25:09,750
now you have a
three-dimensional tensor.

432
00:25:09,750 --> 00:25:14,180
And so just like you can
do matrix factorization, as

433
00:25:14,180 --> 00:25:19,400
in the picture above, where
we say, my matrix of data,

434
00:25:19,400 --> 00:25:26,130
I'm going to assume is generated
by a product of two matrices,

435
00:25:26,130 --> 00:25:28,890
which are smaller in dimension.

436
00:25:28,890 --> 00:25:31,610
And you can train
this by saying,

437
00:25:31,610 --> 00:25:34,940
I want entries in
these two matrices

438
00:25:34,940 --> 00:25:37,860
that minimize the
reconstruction error.

439
00:25:37,860 --> 00:25:41,510
So if I multiply these
matrices together,

440
00:25:41,510 --> 00:25:46,350
then I get back my
original matrix plus error.

441
00:25:46,350 --> 00:25:48,290
And I want to
minimize that error,

442
00:25:48,290 --> 00:25:51,860
usually root mean square, or
mean square error, or something

443
00:25:51,860 --> 00:25:53,220
like that.

444
00:25:53,220 --> 00:25:57,230
Well, you can play the
same game for a tensor

445
00:25:57,230 --> 00:26:02,900
by having a so-called core
tensor, which identifies

446
00:26:02,900 --> 00:26:14,660
the subset of characteristics
that subdivide

447
00:26:14,660 --> 00:26:18,050
that dimension of your data.

448
00:26:18,050 --> 00:26:20,630
And then what you
do is the same game.

449
00:26:20,630 --> 00:26:26,810
You have matrices corresponding
to each of the dimensions.

450
00:26:26,810 --> 00:26:29,090
And if you multiply
this core tensor

451
00:26:29,090 --> 00:26:32,240
by each of these
matrices, you reconstruct

452
00:26:32,240 --> 00:26:34,460
the original tensor.

453
00:26:34,460 --> 00:26:37,730
And you can train it again to
minimize the reconstruction

454
00:26:37,730 --> 00:26:40,100
loss.

455
00:26:40,100 --> 00:26:43,130
So there are, again,
a few more tricks.

456
00:26:43,130 --> 00:26:45,810
Because this is
dealing with language.

457
00:26:45,810 --> 00:26:50,660
And so this is a typical report
from one of these lymphoma

458
00:26:50,660 --> 00:26:55,580
pathologists that says
immunohistochemical stains show

459
00:26:55,580 --> 00:26:58,850
that the follicles-- blah,
blah, blah, blah, blah--

460
00:26:58,850 --> 00:27:01,760
so lots and lots of details.

461
00:27:01,760 --> 00:27:05,000
And so he needed a
representation that

462
00:27:05,000 --> 00:27:08,780
could be put into
this matrix tensor,

463
00:27:08,780 --> 00:27:13,610
this tensor factorization form.

464
00:27:13,610 --> 00:27:16,460
And what he did is to
say, well, let's see.

465
00:27:16,460 --> 00:27:18,770
If we look at a
statement like this,

466
00:27:18,770 --> 00:27:22,550
immuno stains show that
large atypical cells

467
00:27:22,550 --> 00:27:28,520
are strongly positive for
CD30, negative for these other

468
00:27:28,520 --> 00:27:31,590
surface expressions.

469
00:27:31,590 --> 00:27:35,480
So the sentence tells us
relationships among procedures,

470
00:27:35,480 --> 00:27:39,350
types of cells, and
immunologic factors.

471
00:27:39,350 --> 00:27:43,010
And for feature choice,
we can use words.

472
00:27:43,010 --> 00:27:45,770
Or we can use UMLS concepts.

473
00:27:45,770 --> 00:27:48,590
Or we can find various
kinds of mappings.

474
00:27:48,590 --> 00:27:53,900
But he decided that
in order to retain

475
00:27:53,900 --> 00:27:57,770
the syntactic relationships
here, what he would do

476
00:27:57,770 --> 00:28:01,760
is to use a graphical
representation that

477
00:28:01,760 --> 00:28:06,650
came out of, again, parsing
all of these sentences.

478
00:28:06,650 --> 00:28:11,780
And so what you get is that
this creates one graph that

479
00:28:11,780 --> 00:28:17,750
talks about the strongly
positive for CD30,

480
00:28:17,750 --> 00:28:20,550
large atypical cells, et cetera.

481
00:28:20,550 --> 00:28:24,470
And then you can factor
this into subgraphs.

482
00:28:24,470 --> 00:28:27,860
And then you also have
to identify frequently

483
00:28:27,860 --> 00:28:29,570
occurring subgraphs.

484
00:28:29,570 --> 00:28:32,630
So for example,
large atypical cells

485
00:28:32,630 --> 00:28:36,380
appears here, and also appears
there, and of course will

486
00:28:36,380 --> 00:28:38,230
appear in many other places.

487
00:28:38,230 --> 00:28:38,877
Yeah?

488
00:28:38,877 --> 00:28:42,595
AUDIENCE: Is this parsing
domain in language diagnostics?

489
00:28:42,595 --> 00:28:43,970
For example, did
they incorporate

490
00:28:43,970 --> 00:28:45,512
some sort of medical
information here

491
00:28:45,512 --> 00:28:47,330
or some sort of linguistic--

492
00:28:47,330 --> 00:28:49,390
PETER SZOLOVITS: So in
this particular study,

493
00:28:49,390 --> 00:28:53,620
he was using the Stanford
Parser with some tricks.

494
00:28:53,620 --> 00:28:55,780
So the Stanford
Parser doesn't know

495
00:28:55,780 --> 00:28:57,310
a lot of the medical words.

496
00:28:57,310 --> 00:29:03,640
And so he basically marked
these things as noun phrases.

497
00:29:03,640 --> 00:29:05,980
And then the
Stanford Parser also

498
00:29:05,980 --> 00:29:10,200
doesn't do well with
long lists like the set

499
00:29:10,200 --> 00:29:14,470
of immune features.

500
00:29:14,470 --> 00:29:18,100
And so he would recognize
those as a pattern,

501
00:29:18,100 --> 00:29:21,520
substitute a single
made-up word for them,

502
00:29:21,520 --> 00:29:24,850
and that made the parser
work much better on it.

503
00:29:24,850 --> 00:29:27,250
So there were a whole
bunch of little tricks

504
00:29:27,250 --> 00:29:29,500
like that in order to adapt it.

505
00:29:29,500 --> 00:29:33,160
But it was not a model
trained specifically on this.

506
00:29:33,160 --> 00:29:36,790
I think it's trained on
Wall Street Journal corpus

507
00:29:36,790 --> 00:29:37,810
or something like that.

508
00:29:37,810 --> 00:29:39,667
So it's general English.

509
00:29:39,667 --> 00:29:42,250
AUDIENCE: Those are things that
he did manually as opposed to,

510
00:29:42,250 --> 00:29:44,447
say, [INAUDIBLE]?

511
00:29:44,447 --> 00:29:45,280
PETER SZOLOVITS: No.

512
00:29:45,280 --> 00:29:47,500
He did it algorithmically,
but he didn't

513
00:29:47,500 --> 00:29:50,230
learn which algorithms to use.

514
00:29:50,230 --> 00:29:52,300
He made them up by hand.

515
00:29:52,300 --> 00:29:54,340
But then, of course,
it's a big corpus.

516
00:29:54,340 --> 00:29:56,590
And he ran these
programs over it

517
00:29:56,590 --> 00:29:58,810
that did those transformations.

518
00:29:58,810 --> 00:30:01,420
So he calls it
two-phase parsing.

519
00:30:01,420 --> 00:30:05,950
There's a reference to his
paper on the first slide

520
00:30:05,950 --> 00:30:08,560
in this section if you're
interested in the details.

521
00:30:08,560 --> 00:30:11,470
It's described there.

522
00:30:11,470 --> 00:30:16,000
So what he wound
up with is a tensor

523
00:30:16,000 --> 00:30:20,890
that has patients on one
axis, the words appearing

524
00:30:20,890 --> 00:30:23,680
in the text on another axis.

525
00:30:23,680 --> 00:30:27,730
So he's still using a
bag-of-words representation.

526
00:30:27,730 --> 00:30:30,370
But the third axis is
these language concept

527
00:30:30,370 --> 00:30:33,650
subgraphs that we
were talking about.

528
00:30:33,650 --> 00:30:36,790
And then he does tensor
factorization on this.

529
00:30:36,790 --> 00:30:40,360
And what's interesting
is that it works

530
00:30:40,360 --> 00:30:42,620
much better than I expected.

531
00:30:42,620 --> 00:30:49,540
So if you look at his technique,
which he called SANTF,

532
00:30:49,540 --> 00:30:55,450
the precision and recall
are about 0.72 and 0.854

533
00:30:55,450 --> 00:31:00,400
macro-average and
0.754 micro-average,

534
00:31:00,400 --> 00:31:04,510
which is much better than
the non-negative matrix

535
00:31:04,510 --> 00:31:09,460
factorization results, which
only use patient by word

536
00:31:09,460 --> 00:31:14,860
or patient by subgraph, or,
in fact, one where you simply

537
00:31:14,860 --> 00:31:19,180
do patient and concatenate
the subgraphs and the words

538
00:31:19,180 --> 00:31:20,740
in one dimension.

539
00:31:20,740 --> 00:31:24,160
So that means that this is
actually taking advantage

540
00:31:24,160 --> 00:31:27,090
of the three-way relationship.

541
00:31:27,090 --> 00:31:31,620
If you read papers from
about 15, 20 years ago,

542
00:31:31,620 --> 00:31:35,680
people got very excited about
the idea of bi-clustering,

543
00:31:35,680 --> 00:31:40,140
which is, in modern terms,
the equivalent of matrix

544
00:31:40,140 --> 00:31:41,590
factorization.

545
00:31:41,590 --> 00:31:45,600
So it says given two
dimensions of data,

546
00:31:45,600 --> 00:31:48,570
and I want to
cluster things, but I

547
00:31:48,570 --> 00:31:50,820
want to cluster
them in such a way

548
00:31:50,820 --> 00:31:53,160
that the clustering
of one dimension

549
00:31:53,160 --> 00:31:56,370
helps the clustering
of the other dimension.

550
00:31:56,370 --> 00:32:00,870
So this is a formal way of doing
that relatively efficiently.

551
00:32:00,870 --> 00:32:04,170
And tensor factorization is
essentially tri-clustering.

552
00:32:07,190 --> 00:32:13,320
So now I'm going to turn to
the last of today's big topics,

553
00:32:13,320 --> 00:32:15,080
which is language modeling.

554
00:32:15,080 --> 00:32:18,140
And this is really where
the action is nowadays

555
00:32:18,140 --> 00:32:21,210
in natural language
processing in general.

556
00:32:21,210 --> 00:32:24,020
I would say that the
natural language processing

557
00:32:24,020 --> 00:32:28,010
on clinical data is
somewhat behind the state

558
00:32:28,010 --> 00:32:32,270
of the art in natural
language processing overall.

559
00:32:32,270 --> 00:32:34,520
There are fewer corpora
that are available.

560
00:32:34,520 --> 00:32:37,220
There are fewer
people working on it.

561
00:32:37,220 --> 00:32:40,460
And so we're catching up.

562
00:32:40,460 --> 00:32:44,000
But I'm going to lead
into this somewhat gently.

563
00:32:44,000 --> 00:32:47,660
So what does it mean
to model a language?

564
00:32:47,660 --> 00:32:50,270
I mean, you could imagine
saying it's coming up

565
00:32:50,270 --> 00:32:55,550
with a set of parsing rules that
define the syntactic structure

566
00:32:55,550 --> 00:32:56,840
of the language.

567
00:32:56,840 --> 00:33:00,110
Or you could imagine
saying, as we suggested

568
00:33:00,110 --> 00:33:03,350
last time, coming up
with a corresponding set

569
00:33:03,350 --> 00:33:07,060
of semantic rules
that say a concept

570
00:33:07,060 --> 00:33:10,970
or terms in the language
correspond to certain concepts

571
00:33:10,970 --> 00:33:13,790
and that they are
a combinatorially,

572
00:33:13,790 --> 00:33:17,420
functionally combined
as the syntax directs,

573
00:33:17,420 --> 00:33:20,870
in order to give us a
semantic representation.

574
00:33:20,870 --> 00:33:24,090
So we don't know how to do
either of those very well.

575
00:33:24,090 --> 00:33:26,810
And so the current,
the contemporary idea

576
00:33:26,810 --> 00:33:30,200
about language
modeling is to say,

577
00:33:30,200 --> 00:33:33,230
given a sequence of tokens,
predict the next token.

578
00:33:35,750 --> 00:33:38,900
If you could do that
perfectly, presumably you

579
00:33:38,900 --> 00:33:41,480
would have a good
language model.

580
00:33:41,480 --> 00:33:43,790
So obviously, you
can't do it perfectly.

581
00:33:43,790 --> 00:33:47,910
Because we don't always
say the same word

582
00:33:47,910 --> 00:33:52,630
after some sequence of
previous words when we speak.

583
00:33:52,630 --> 00:33:56,220
But probabilistically,
you can get close to that.

584
00:33:56,220 --> 00:33:59,970
And there's usually some
kind of Markov assumption

585
00:33:59,970 --> 00:34:04,380
that says that the probability
of emitting a token

586
00:34:04,380 --> 00:34:10,230
given the stuff that came before
it is ordinarily dependent

587
00:34:10,230 --> 00:34:18,600
only on n previous words
rather than on all of history,

588
00:34:18,600 --> 00:34:21,659
on everything you've ever
said before in your life.

589
00:34:24,480 --> 00:34:30,570
And there's a measure
called perplexity,

590
00:34:30,570 --> 00:34:34,230
which is the entropy of the
probability distribution

591
00:34:34,230 --> 00:34:36,570
over the predicted words.

592
00:34:36,570 --> 00:34:39,900
And roughly speaking, it's
the number of likely ways

593
00:34:39,900 --> 00:34:47,610
that you could continue the
text if all of the possibilities

594
00:34:47,610 --> 00:34:50,710
were equally likely.

595
00:34:50,710 --> 00:34:54,989
So perplexity is often used, for
example, in speech processing.

596
00:34:57,970 --> 00:34:59,680
We did a study
where we were trying

597
00:34:59,680 --> 00:35:03,280
to build a speech system that
understood a conversation

598
00:35:03,280 --> 00:35:05,560
between a doctor and a patient.

599
00:35:05,560 --> 00:35:08,260
And we ran into real
problems, because we

600
00:35:08,260 --> 00:35:12,280
were using software that had
been developed to interpret

601
00:35:12,280 --> 00:35:14,710
dictation by doctors.

602
00:35:14,710 --> 00:35:16,600
And that was very well trained.

603
00:35:16,600 --> 00:35:19,990
But it turned out-- we didn't
know this when we started--

604
00:35:19,990 --> 00:35:24,610
that the language that doctors
use in dictating medical notes

605
00:35:24,610 --> 00:35:27,490
is pretty straightforward,
pretty simple.

606
00:35:27,490 --> 00:35:32,730
And so it's perplexity
is about nine,

607
00:35:32,730 --> 00:35:37,050
whereas conversations are much
more free flowing and cover

608
00:35:37,050 --> 00:35:38,700
many more topics.

609
00:35:38,700 --> 00:35:42,450
And so its perplexity
is about 73.

610
00:35:42,450 --> 00:35:46,230
And so the model that works
well for perplexity nine

611
00:35:46,230 --> 00:35:50,490
doesn't work as well
for perplexity 73.

612
00:35:50,490 --> 00:35:54,840
And so what this tells you about
the difficulty of accurately

613
00:35:54,840 --> 00:35:58,320
transcribing speech
is that it's hard.

614
00:35:58,320 --> 00:35:59,700
It's much harder.

615
00:35:59,700 --> 00:36:02,930
And that's still not
a solved problem.

616
00:36:02,930 --> 00:36:06,350
Now, you probably all
know about Zipf's law.

617
00:36:06,350 --> 00:36:10,790
So if you empirically
just take all the words

618
00:36:10,790 --> 00:36:15,350
in all the literature of, let's
say, English, what you discover

619
00:36:15,350 --> 00:36:20,990
is that the n-th word
is about one over n

620
00:36:20,990 --> 00:36:24,450
as probable as the first word.

621
00:36:24,450 --> 00:36:28,000
So there is a
long-tailed distribution.

622
00:36:28,000 --> 00:36:29,710
One thing you should
realize, of course,

623
00:36:29,710 --> 00:36:33,280
is if you integrate one over
n from zero to infinity,

624
00:36:33,280 --> 00:36:35,980
it's infinite.

625
00:36:35,980 --> 00:36:39,700
And that may not be an
inaccurate representation

626
00:36:39,700 --> 00:36:44,140
of language, because language
is productive and changes.

627
00:36:44,140 --> 00:36:47,860
And people make up new words
all the time and so on.

628
00:36:47,860 --> 00:36:49,840
So it may actually be infinite.

629
00:36:49,840 --> 00:36:54,260
But roughly speaking, there is
a kind of decline like this.

630
00:36:54,260 --> 00:36:56,980
And interestingly,
in the brown corpus,

631
00:36:56,980 --> 00:37:01,540
the top 10 words make
up almost a quarter

632
00:37:01,540 --> 00:37:03,640
of the size of the corpus.

633
00:37:03,640 --> 00:37:07,500
So you write a lot of thes,
ofs, ands, a's, twos, ins,

634
00:37:07,500 --> 00:37:14,470
et cetera, and much less
hematemesis, obviously.

635
00:37:17,460 --> 00:37:19,590
So what about n-gram models?

636
00:37:19,590 --> 00:37:22,770
Well, remember, if we make
this Markov assumption,

637
00:37:22,770 --> 00:37:25,530
then all we have to
do is pay attention

638
00:37:25,530 --> 00:37:27,960
to the last n tokens
before the one

639
00:37:27,960 --> 00:37:30,310
that we're interested
in predicting.

640
00:37:30,310 --> 00:37:34,800
And so people have generated
these large corpora n-grams.

641
00:37:34,800 --> 00:37:38,670
So for example, somebody,
a couple of decades ago,

642
00:37:38,670 --> 00:37:41,700
took all of
Shakespeare's writings--

643
00:37:41,700 --> 00:37:43,320
I think they were
trying to decide

644
00:37:43,320 --> 00:37:45,780
whether he had
written all his works

645
00:37:45,780 --> 00:37:49,230
or whether the earl
of somebody or other

646
00:37:49,230 --> 00:37:52,120
was actually the guy
who wrote Shakespeare.

647
00:37:52,120 --> 00:37:54,810
You know about this controversy?

648
00:37:54,810 --> 00:37:56,570
Yeah.

649
00:37:56,570 --> 00:37:58,190
So that's why they
were doing it.

650
00:37:58,190 --> 00:38:00,890
But anyway, they
created this corpus.

651
00:38:00,890 --> 00:38:03,290
And they said--
so Shakespeare had

652
00:38:03,290 --> 00:38:07,790
a vocabulary of about
30,000 words and about

653
00:38:07,790 --> 00:38:17,480
300,000 bigrams, and out of
844 million possible bigrams.

654
00:38:17,480 --> 00:38:22,970
So 99.96% of bigrams
were never seen.

655
00:38:22,970 --> 00:38:27,170
So there's a certain regularity
to his production of language.

656
00:38:27,170 --> 00:38:30,310
Now, Google, of course,
did Shakespeare one better.

657
00:38:30,310 --> 00:38:34,100
And they said, hmm, we can
take a terabyte corpus--

658
00:38:34,100 --> 00:38:36,230
this was in 2006.

659
00:38:36,230 --> 00:38:40,400
I wouldn't be surprised if
it's a petabyte corpus today.

660
00:38:40,400 --> 00:38:41,540
And they published this.

661
00:38:41,540 --> 00:38:43,190
They just made it available.

662
00:38:43,190 --> 00:38:46,520
So there were 13.6
million unique words

663
00:38:46,520 --> 00:38:51,290
that occurred at least 200
times in this tera-word corpus.

664
00:38:51,290 --> 00:38:55,490
And there were 1.2 billion
five-word sequences that

665
00:38:55,490 --> 00:38:57,500
occurred at least 40 times.

666
00:38:57,500 --> 00:38:59,060
So these are the statistics.

667
00:38:59,060 --> 00:39:02,090
And if you're interested,
there's a URL.

668
00:39:02,090 --> 00:39:05,240
And here's a very tiny
part of their database.

669
00:39:05,240 --> 00:39:11,210
So ceramics, collectibles,
collectibles--

670
00:39:11,210 --> 00:39:16,550
I don't know-- occurred 55
times in a terabyte of text.

671
00:39:16,550 --> 00:39:20,670
Ceramics collectibles fine,
ceramics collectibles by,

672
00:39:20,670 --> 00:39:25,140
pottery, cooking, comma, period,
end of sentence, and, at,

673
00:39:25,140 --> 00:39:26,490
is, et cetera--

674
00:39:26,490 --> 00:39:27,780
different number of times.

675
00:39:27,780 --> 00:39:32,640
Ceramics comes from
occurred 660 times,

676
00:39:32,640 --> 00:39:35,880
which is reasonably large
number compared to some

677
00:39:35,880 --> 00:39:37,920
of its competitors here.

678
00:39:37,920 --> 00:39:40,890
If you look at
four-grams, you see

679
00:39:40,890 --> 00:39:44,070
things like serve as the
incoming, blah, blah,

680
00:39:44,070 --> 00:39:49,500
blah, 92 times; serve
as the index, 223 times;

681
00:39:49,500 --> 00:39:53,860
serve as the
initial, 5,300 times.

682
00:39:53,860 --> 00:39:56,730
So you've got all
these statistics.

683
00:39:56,730 --> 00:40:02,430
And now, given those statistics,
we can then build a generator.

684
00:40:02,430 --> 00:40:05,940
So we can say, all right.

685
00:40:05,940 --> 00:40:08,700
Suppose I start with
the token, which

686
00:40:08,700 --> 00:40:11,400
is the beginning of a
sentence, or the separator

687
00:40:11,400 --> 00:40:13,180
between sentences.

688
00:40:13,180 --> 00:40:16,380
And I say sample a
random bigram starting

689
00:40:16,380 --> 00:40:19,350
with the beginning of
a sentence and a word,

690
00:40:19,350 --> 00:40:24,240
according to its probability,
and then sample the next bigram

691
00:40:24,240 --> 00:40:27,670
from that word and
all the other words,

692
00:40:27,670 --> 00:40:30,630
according to its
probability, and keep

693
00:40:30,630 --> 00:40:34,920
doing that until you hit
the end of sentence marker.

694
00:40:34,920 --> 00:40:40,020
So for example, here I'm
generating the sentence,

695
00:40:40,020 --> 00:40:43,860
I, starts with I,
then followed by want,

696
00:40:43,860 --> 00:40:47,730
followed by two, followed
by get, followed by Chinese,

697
00:40:47,730 --> 00:40:51,120
followed by food, followed
by end of sentence.

698
00:40:51,120 --> 00:40:53,100
So I've just generated,
"I want to get

699
00:40:53,100 --> 00:40:57,780
Chinese food," which sounds
like a perfectly good sentence.

700
00:40:57,780 --> 00:40:59,170
So here's what's interesting.

701
00:40:59,170 --> 00:41:02,130
If you look back again
at the Shakespeare corpus

702
00:41:02,130 --> 00:41:07,220
and saying, if we generated
Shakespeare from unigrams,

703
00:41:07,220 --> 00:41:09,800
you get stuff like
at the top, "To him

704
00:41:09,800 --> 00:41:12,110
swallowed confess here both.

705
00:41:12,110 --> 00:41:13,540
Which.

706
00:41:13,540 --> 00:41:19,100
Of save on trail for are ay
device and rote life have."

707
00:41:19,100 --> 00:41:21,350
It doesn't sound terribly good.

708
00:41:21,350 --> 00:41:23,240
It's not very grammatical.

709
00:41:23,240 --> 00:41:30,140
It doesn't have that sort of
Shakespearean English flavor.

710
00:41:30,140 --> 00:41:34,250
Although, you do have words
like nave and ay and so on that

711
00:41:34,250 --> 00:41:36,680
are vaguely reminiscent.

712
00:41:36,680 --> 00:41:38,930
Now, if you go to
bigrams, it starts

713
00:41:38,930 --> 00:41:40,550
to sound a little better.

714
00:41:40,550 --> 00:41:41,480
"What means, sir.

715
00:41:41,480 --> 00:41:43,040
I confess she?

716
00:41:43,040 --> 00:41:45,450
Then all sorts, he
is trim, captain."

717
00:41:49,400 --> 00:41:51,060
That doesn't make any sense.

718
00:41:51,060 --> 00:41:53,630
But it starts to
sound a little better.

719
00:41:53,630 --> 00:41:56,870
And with trigrams, we get,
"Sweet prince, Falstaff

720
00:41:56,870 --> 00:41:57,980
shall die.

721
00:41:57,980 --> 00:42:01,460
Harry of Monmouth," et cetera.

722
00:42:01,460 --> 00:42:05,540
So this is beginning to
sound a little Shakespearean.

723
00:42:05,540 --> 00:42:08,220
And if you go to quadrigrams,
you get, "King Henry.

724
00:42:08,220 --> 00:42:08,720
What?

725
00:42:08,720 --> 00:42:11,180
I will go seek the
traitor Gloucester.

726
00:42:11,180 --> 00:42:13,090
Exeunt some of the watch.

727
00:42:13,090 --> 00:42:17,960
A great banquet
serv'd in," et cetera.

728
00:42:17,960 --> 00:42:23,090
I mean, when I first saw this,
like 20 years ago or something,

729
00:42:23,090 --> 00:42:24,320
I was stunned.

730
00:42:24,320 --> 00:42:26,840
This is actually
generating stuff

731
00:42:26,840 --> 00:42:30,170
that sounds vaguely
Shakespearean and vaguely

732
00:42:30,170 --> 00:42:33,570
English-like.

733
00:42:33,570 --> 00:42:37,070
Here's an example of generating
the Wall Street Journal.

734
00:42:37,070 --> 00:42:42,410
So from unigrams, "Months
the my and issue of year

735
00:42:42,410 --> 00:42:45,830
foreign new exchanges
September were recession."

736
00:42:45,830 --> 00:42:47,600
It's word salad.

737
00:42:47,600 --> 00:42:50,600
But if you go to trigrams,
"They also point to ninety nine

738
00:42:50,600 --> 00:42:54,020
point six billion
from two hundred four

739
00:42:54,020 --> 00:42:57,980
oh six three percent of the
rates of interest stores

740
00:42:57,980 --> 00:43:00,050
as Mexico and Brazil."

741
00:43:00,050 --> 00:43:03,620
So you could imagine that this
is some Wall Street Journal

742
00:43:03,620 --> 00:43:09,080
writer on acid
writing this text.

743
00:43:09,080 --> 00:43:13,850
Because it has a little bit
of the right kind of flavor.

744
00:43:13,850 --> 00:43:17,570
So more recently,
people said, well,

745
00:43:17,570 --> 00:43:22,520
we ought to be able to make use
of this in some systematic way

746
00:43:22,520 --> 00:43:25,730
to help us with our
language analysis tasks.

747
00:43:25,730 --> 00:43:31,790
So to me, the first
effort in this direction

748
00:43:31,790 --> 00:43:35,240
was Word2Vec, which
was Mikolov's approach

749
00:43:35,240 --> 00:43:36,440
to doing this.

750
00:43:36,440 --> 00:43:38,540
And he developed two models.

751
00:43:38,540 --> 00:43:45,260
He said, let's build a
continuous bag-of-words model

752
00:43:45,260 --> 00:43:47,420
that says what
we're going to use

753
00:43:47,420 --> 00:43:54,850
is co-occurrence data on a
series of tokens in the text

754
00:43:54,850 --> 00:43:56,840
that we're trying to model.

755
00:43:56,840 --> 00:43:59,300
And we're going to
use a neural network

756
00:43:59,300 --> 00:44:05,250
model to predict the word
from the words around it.

757
00:44:05,250 --> 00:44:07,730
And in that process,
we're going to use

758
00:44:07,730 --> 00:44:13,910
the parameters of that neural
network model as a vector.

759
00:44:13,910 --> 00:44:17,060
And that vector will be the
representation of that word.

760
00:44:19,590 --> 00:44:21,830
And so what we're
going to find is

761
00:44:21,830 --> 00:44:26,580
that words that tend to
appear in the same context

762
00:44:26,580 --> 00:44:29,460
will have similar
representations

763
00:44:29,460 --> 00:44:31,835
in this high-dimensional vector.

764
00:44:31,835 --> 00:44:33,210
And by the way,
high-dimensional,

765
00:44:33,210 --> 00:44:37,670
people typically use like 300
or 500 dimensional vectors.

766
00:44:37,670 --> 00:44:39,060
So there's a lot of--

767
00:44:39,060 --> 00:44:40,830
it's a big space.

768
00:44:40,830 --> 00:44:43,860
And the words are
scattered throughout this.

769
00:44:43,860 --> 00:44:48,140
But you get this
kind of cohesion,

770
00:44:48,140 --> 00:44:53,470
where words that are
used in the same context

771
00:44:53,470 --> 00:44:55,430
appear close to each other.

772
00:44:55,430 --> 00:44:58,250
And the extrapolation
of that is that if words

773
00:44:58,250 --> 00:45:00,500
are used in the
same context, maybe

774
00:45:00,500 --> 00:45:03,890
they share something
about meaning.

775
00:45:03,890 --> 00:45:06,405
So the other model
is a skip-gram model,

776
00:45:06,405 --> 00:45:07,780
where you're doing
the prediction

777
00:45:07,780 --> 00:45:08,920
in the other direction.

778
00:45:08,920 --> 00:45:13,300
From a word, you're predicting
the words that are around it.

779
00:45:13,300 --> 00:45:16,330
And again, you are using
a neural network model

780
00:45:16,330 --> 00:45:17,950
to do that.

781
00:45:17,950 --> 00:45:20,800
And you use the
parameters of that model

782
00:45:20,800 --> 00:45:27,050
in order to represent the
word that you're focused on.

783
00:45:27,050 --> 00:45:31,240
So what came as a surprise
to me is this claim that's

784
00:45:31,240 --> 00:45:35,830
in his original paper, which
is that not only do you

785
00:45:35,830 --> 00:45:43,030
get this effect of locality
as corresponding meaning

786
00:45:43,030 --> 00:45:46,630
but that you get relationships
that are geometrically

787
00:45:46,630 --> 00:45:50,770
represented in the space
of these embeddings.

788
00:45:50,770 --> 00:45:53,980
And so what you
see is that if you

789
00:45:53,980 --> 00:45:58,510
take the encoding of the
word man and the word woman

790
00:45:58,510 --> 00:46:01,450
and look at the vector
difference between them,

791
00:46:01,450 --> 00:46:05,530
and then apply that same
vector difference to king,

792
00:46:05,530 --> 00:46:07,570
you get close to queen.

793
00:46:07,570 --> 00:46:11,410
And if you apply it uncle,
you get close to aunt.

794
00:46:11,410 --> 00:46:13,630
And so they showed a
number of examples.

795
00:46:13,630 --> 00:46:15,520
And then people
have studied this.

796
00:46:15,520 --> 00:46:17,500
It doesn't hold
it perfectly well.

797
00:46:17,500 --> 00:46:21,010
I mean, it's not like we've
solved the semantics problem.

798
00:46:21,010 --> 00:46:24,040
But it is a genuine
relationship.

799
00:46:24,040 --> 00:46:25,930
The place where it
doesn't work well

800
00:46:25,930 --> 00:46:30,460
is when some of these things are
much more frequent than others.

801
00:46:30,460 --> 00:46:33,970
And so one of the examples
that's often cited

802
00:46:33,970 --> 00:46:41,420
is if you go, London is to
England as Paris is to France,

803
00:46:41,420 --> 00:46:43,040
and that one works.

804
00:46:43,040 --> 00:46:47,950
But then you say as Kuala
Lumpur is to Malaysia,

805
00:46:47,950 --> 00:46:50,500
and that one doesn't
work so well.

806
00:46:50,500 --> 00:46:57,310
And then you go, as
Juba or something

807
00:46:57,310 --> 00:47:01,090
is to whatever country
it's the capital of.

808
00:47:01,090 --> 00:47:05,140
And since we don't write about
Africa in our newspapers,

809
00:47:05,140 --> 00:47:07,040
there's very little
data on that.

810
00:47:07,040 --> 00:47:10,420
And so that doesn't
work so well.

811
00:47:10,420 --> 00:47:13,150
So there was this
other paper later

812
00:47:13,150 --> 00:47:16,960
from van der Maaten
and Geoff Hinton,

813
00:47:16,960 --> 00:47:19,930
where they came up with
a visualization method

814
00:47:19,930 --> 00:47:22,180
to take these
high-dimensional vectors

815
00:47:22,180 --> 00:47:25,090
and visualize them
in two dimensions.

816
00:47:25,090 --> 00:47:28,750
And what you see is that if
you take a bunch of concepts

817
00:47:28,750 --> 00:47:30,520
that are count concepts--

818
00:47:30,520 --> 00:47:36,490
so 1/2, 30, 15, 5, 4, 2,
3, several, some, many,

819
00:47:36,490 --> 00:47:38,530
et cetera--

820
00:47:38,530 --> 00:47:41,450
there is a geometric
relationship between them.

821
00:47:41,450 --> 00:47:45,380
So they, in fact, do map to
the same part of the space.

822
00:47:45,380 --> 00:47:48,970
Similarly, minister, leader,
president, chairman, director,

823
00:47:48,970 --> 00:47:51,580
spokesman, chief,
head, et cetera

824
00:47:51,580 --> 00:47:54,420
form a kind of
cluster in the space.

825
00:47:54,420 --> 00:47:58,540
So there's definitely
something to this.

826
00:47:58,540 --> 00:48:04,120
I promised you that I would
get back to a different attempt

827
00:48:04,120 --> 00:48:06,880
to try to take a
core of concepts

828
00:48:06,880 --> 00:48:09,640
that you want to use
for term-spotting

829
00:48:09,640 --> 00:48:13,780
and develop an automated way of
enlarging that set of concepts

830
00:48:13,780 --> 00:48:17,080
in order to give you a
richer vocabulary by which

831
00:48:17,080 --> 00:48:20,480
to try to identify cases
that you're interested in.

832
00:48:20,480 --> 00:48:23,480
So this was by some
of my colleagues,

833
00:48:23,480 --> 00:48:27,310
including Kat, who
you saw on Tuesday.

834
00:48:27,310 --> 00:48:32,800
And they said,
well, what we'd like

835
00:48:32,800 --> 00:48:35,770
is the fully automated and
robust, unsupervised feature

836
00:48:35,770 --> 00:48:38,860
selection method that
leverages only publicly

837
00:48:38,860 --> 00:48:42,910
available medical knowledge
sources instead of VHR data.

838
00:48:42,910 --> 00:48:46,690
So the method that David's
group had developed,

839
00:48:46,690 --> 00:48:49,870
which we talked about
earlier, uses data

840
00:48:49,870 --> 00:48:51,790
from electronic
health records, which

841
00:48:51,790 --> 00:48:54,520
means that you move
to different hospitals

842
00:48:54,520 --> 00:48:56,690
and there may be
different conventions.

843
00:48:56,690 --> 00:48:58,390
And you might
imagine that you have

844
00:48:58,390 --> 00:49:03,880
to retrain that sort of method,
whereas here the idea is

845
00:49:03,880 --> 00:49:06,910
to derive these surrogate
features from knowledge

846
00:49:06,910 --> 00:49:08,110
sources.

847
00:49:08,110 --> 00:49:13,330
So unlike that earlier model,
here they built a Word2Vec

848
00:49:13,330 --> 00:49:17,620
skip-gram model from about 5
million Springer articles--

849
00:49:17,620 --> 00:49:21,610
so these are published
medical articles--

850
00:49:21,610 --> 00:49:25,420
to yield 500 dimensional
vectors for each word.

851
00:49:25,420 --> 00:49:29,800
And then what they did is
they took the concept names

852
00:49:29,800 --> 00:49:33,130
that they were interested
in and their definitions

853
00:49:33,130 --> 00:49:38,580
from the UMLS, and
then they summoned

854
00:49:38,580 --> 00:49:42,390
the word vectors for each
of these words, weighted

855
00:49:42,390 --> 00:49:44,650
by inverse document frequency.

856
00:49:44,650 --> 00:49:48,485
So it's sort of a
TF-IDF-like approach

857
00:49:48,485 --> 00:49:51,240
to weight different words.

858
00:49:51,240 --> 00:49:53,700
And then they went
out and they said, OK,

859
00:49:53,700 --> 00:49:56,610
for every disease that's
mentioned in Wikipedia,

860
00:49:56,610 --> 00:49:59,760
Medscape, eMedicine, the
Merck Manuals Professional

861
00:49:59,760 --> 00:50:03,390
Edition, the Mayo Clinic
Diseases and Conditions,

862
00:50:03,390 --> 00:50:06,120
MedlinePlus Medical
Encyclopedia,

863
00:50:06,120 --> 00:50:09,330
they used named entity
recognition techniques

864
00:50:09,330 --> 00:50:15,550
to find all the concepts that
are related to this phenotype.

865
00:50:15,550 --> 00:50:19,080
So then they said, well,
there's a lot of randomness

866
00:50:19,080 --> 00:50:22,840
in these sources, and maybe
in our extraction techniques.

867
00:50:22,840 --> 00:50:25,320
But if we insist that
some concept appear

868
00:50:25,320 --> 00:50:28,810
in at least three of
these five sources,

869
00:50:28,810 --> 00:50:32,400
then we can be pretty confident
that it's a relevant concept.

870
00:50:32,400 --> 00:50:34,480
And so they said,
OK, we'll do that.

871
00:50:34,480 --> 00:50:37,130
Then they chose
the top k concepts

872
00:50:37,130 --> 00:50:41,190
whose embedding vectors are
closest by cosine distance

873
00:50:41,190 --> 00:50:43,020
to the embedding
of this phenotype

874
00:50:43,020 --> 00:50:44,850
that they've calculated.

875
00:50:44,850 --> 00:50:47,280
And they say, OK,
the phenotype is

876
00:50:47,280 --> 00:50:51,970
going to be a linear combination
of all these related concepts.

877
00:50:51,970 --> 00:50:55,840
So again, this is a bit
similar to what we saw before.

878
00:50:55,840 --> 00:50:58,110
But here, instead of
extracting the data

879
00:50:58,110 --> 00:51:01,110
from electronic medical
records, they're

880
00:51:01,110 --> 00:51:04,680
extracting it from published
literature and these web

881
00:51:04,680 --> 00:51:07,260
sources.

882
00:51:07,260 --> 00:51:16,230
And again, what you see is that
the expert-curated features

883
00:51:16,230 --> 00:51:22,050
for these five phenotypes,
which are coronary artery

884
00:51:22,050 --> 00:51:24,180
disease, rheumatoid
arthritis, Crohn's

885
00:51:24,180 --> 00:51:29,070
disease, ulcerative colitis,
and pediatric pulmonary arterial

886
00:51:29,070 --> 00:51:37,260
hypertension, they started
with 20 to 50 curated features.

887
00:51:37,260 --> 00:51:39,150
So these were the
ones that the doctors

888
00:51:39,150 --> 00:51:44,610
said, OK, these are the
anchors in David's terminology.

889
00:51:44,610 --> 00:51:51,090
And then they expanded
these to a larger set

890
00:51:51,090 --> 00:51:56,850
using the technique that I just
described, and then selected

891
00:51:56,850 --> 00:52:04,515
down to the top n that
were effective in finding

892
00:52:04,515 --> 00:52:06,360
relevant phenotypes.

893
00:52:06,360 --> 00:52:13,140
And this is a terrible graph
that summarizes the results.

894
00:52:13,140 --> 00:52:19,590
But what you're seeing is that
the orange lines are based

895
00:52:19,590 --> 00:52:22,830
on the expert-curated features.

896
00:52:22,830 --> 00:52:28,920
This is based on an earlier
version of trying to do this.

897
00:52:28,920 --> 00:52:33,000
And SEDFE is the technique
that I've just described.

898
00:52:33,000 --> 00:52:37,410
And what you see is that
the automatic techniques

899
00:52:37,410 --> 00:52:42,000
for many of these phenotypes
are just about as good

900
00:52:42,000 --> 00:52:44,760
as the manually curated ones.

901
00:52:44,760 --> 00:52:47,640
And of course, they require
much less manual curation.

902
00:52:47,640 --> 00:52:52,980
Because they're using this
automatic learning approach.

903
00:52:52,980 --> 00:52:56,100
Another interesting
example to return

904
00:52:56,100 --> 00:52:58,770
to the theme of
de-identification

905
00:52:58,770 --> 00:53:02,380
is a couple of my
students, a few years ago,

906
00:53:02,380 --> 00:53:06,150
built a new de-identifier
that has this rather

907
00:53:06,150 --> 00:53:08,280
complicated architecture.

908
00:53:08,280 --> 00:53:13,680
So it starts with a
bi-directional recursive neural

909
00:53:13,680 --> 00:53:18,330
network model that
is implemented

910
00:53:18,330 --> 00:53:23,280
over the character sequences
of words in the medical text.

911
00:53:23,280 --> 00:53:25,920
So why character sequences?

912
00:53:25,920 --> 00:53:27,841
Why might those be important?

913
00:53:33,140 --> 00:53:38,090
Well, consider a misspelled
word, for example.

914
00:53:38,090 --> 00:53:41,120
Most of the character
sequence is correct.

915
00:53:41,120 --> 00:53:44,600
There will be a bug in
it at the misspelling.

916
00:53:44,600 --> 00:53:47,540
Or consider that a
lot of medical terms

917
00:53:47,540 --> 00:53:50,060
are these compound
terms, where they're

918
00:53:50,060 --> 00:53:53,120
made up of lots of
pieces that correspond

919
00:53:53,120 --> 00:53:56,360
to Greek or Latin roots.

920
00:53:56,360 --> 00:54:00,440
So learning those can
actually be very helpful.

921
00:54:00,440 --> 00:54:02,990
So you start with that model.

922
00:54:02,990 --> 00:54:06,110
You then could
concatenate the results

923
00:54:06,110 --> 00:54:10,250
from both the left-running and
the right-running recursive

924
00:54:10,250 --> 00:54:12,140
neural network.

925
00:54:12,140 --> 00:54:18,095
And concatenate that with
the Word2Vec embedding

926
00:54:18,095 --> 00:54:20,850
of the whole word.

927
00:54:20,850 --> 00:54:26,490
And you feed that into another
bi-directional RNN layer.

928
00:54:26,490 --> 00:54:33,050
And then for each word, you
take the output of those RNNs,

929
00:54:33,050 --> 00:54:36,650
run them through a feed-forward
neural network in order

930
00:54:36,650 --> 00:54:38,940
to estimate the prob--

931
00:54:38,940 --> 00:54:40,310
it's like a soft max.

932
00:54:40,310 --> 00:54:44,900
And you estimate the probability
of this word belonging

933
00:54:44,900 --> 00:54:49,280
to a particular category of
personally identifiable health

934
00:54:49,280 --> 00:54:50,300
information.

935
00:54:50,300 --> 00:54:51,440
So is it a name?

936
00:54:51,440 --> 00:54:52,520
Is it an address?

937
00:54:52,520 --> 00:54:53,570
Is it a phone number?

938
00:54:53,570 --> 00:54:56,150
Is it or whatever?

939
00:54:56,150 --> 00:54:59,480
And then the top layer is a
kind of conditional random

940
00:54:59,480 --> 00:55:04,970
field-like layer that imposes
a sequential probability

941
00:55:04,970 --> 00:55:10,490
distribution that says, OK,
if you've seen a name, then

942
00:55:10,490 --> 00:55:14,220
what's the next most likely
thing that you're going to see?

943
00:55:14,220 --> 00:55:19,220
And so you combine that with
the probability distributions

944
00:55:19,220 --> 00:55:24,920
for each word in order to
identify the category of PHI

945
00:55:24,920 --> 00:55:28,860
or non-PHI for that word.

946
00:55:28,860 --> 00:55:31,400
And this did insanely well.

947
00:55:31,400 --> 00:55:41,000
So optimized by F1 score, we're
up at a precision of 99.2%,

948
00:55:41,000 --> 00:55:44,270
recall of 99.3%.

949
00:55:44,270 --> 00:55:51,290
Optimized by recall,
we're up at about 98%, 99%

950
00:55:51,290 --> 00:55:53,240
for each of them.

951
00:55:53,240 --> 00:55:55,370
So this is doing quite well.

952
00:55:55,370 --> 00:56:00,030
Now, there is a non-machine
learning comment to make,

953
00:56:00,030 --> 00:56:02,570
which is that if you read
the HIPAA law, the HIPAA

954
00:56:02,570 --> 00:56:05,660
regulations, they
don't say that you

955
00:56:05,660 --> 00:56:10,400
must get rid of 99%
of the personally

956
00:56:10,400 --> 00:56:13,760
identifying information in
order to be able to share

957
00:56:13,760 --> 00:56:15,500
this data for research.

958
00:56:15,500 --> 00:56:18,761
It says you have to
get rid of all of it.

959
00:56:18,761 --> 00:56:23,770
So no technique we
know is 100% perfect.

960
00:56:23,770 --> 00:56:27,840
And so there's a kind of
practical understanding

961
00:56:27,840 --> 00:56:30,240
among people who
work on this stuff

962
00:56:30,240 --> 00:56:32,850
that nothing's
going to be perfect.

963
00:56:32,850 --> 00:56:36,990
And therefore, that you can
get away with a little bit.

964
00:56:36,990 --> 00:56:42,300
But legally, you're on thin ice.

965
00:56:42,300 --> 00:56:46,590
So I remember many years ago,
my wife was in law school.

966
00:56:46,590 --> 00:56:51,600
And I asked her at one point,
so what can people sue you for?

967
00:56:51,600 --> 00:56:55,640
And she said,
absolutely anything.

968
00:56:55,640 --> 00:56:57,430
They may not win.

969
00:56:57,430 --> 00:57:00,180
But they can be a
real pain if you have

970
00:57:00,180 --> 00:57:02,460
to go defend yourself in court.

971
00:57:02,460 --> 00:57:04,750
And so this hasn't
played out yet.

972
00:57:04,750 --> 00:57:08,910
We don't know if a
de-identifier that

973
00:57:08,910 --> 00:57:13,050
is 99% sensitive
and 99% specific

974
00:57:13,050 --> 00:57:17,730
will pass muster with people
who agree to release data sets.

975
00:57:17,730 --> 00:57:21,000
Because they're worried,
too, about winding up

976
00:57:21,000 --> 00:57:23,700
in the newspaper or
winding up getting sued.

977
00:57:26,910 --> 00:57:28,810
Last topic for today--

978
00:57:28,810 --> 00:57:34,980
so if you read this interesting
blog, which, by the way,

979
00:57:34,980 --> 00:57:39,870
has a very good
tutorial on BERT,

980
00:57:39,870 --> 00:57:43,290
he says, "The year 2018 has been
an inflection point for machine

981
00:57:43,290 --> 00:57:47,850
learning models handling
text, or more accurately, NLP.

982
00:57:47,850 --> 00:57:49,680
Our conceptual
understanding of how

983
00:57:49,680 --> 00:57:52,770
best to represent words
and sentences in a way

984
00:57:52,770 --> 00:57:55,710
that best captures underlying
meanings and relationships

985
00:57:55,710 --> 00:57:57,760
is rapidly evolving."

986
00:57:57,760 --> 00:58:00,330
And so there are a
whole bunch of new ideas

987
00:58:00,330 --> 00:58:05,530
that have come about in about
the last year or two years,

988
00:58:05,530 --> 00:58:10,410
including ELMo, which learns
context-specific embeddings,

989
00:58:10,410 --> 00:58:13,920
the Transformer architecture,
this BERT approach.

990
00:58:13,920 --> 00:58:19,470
And then I'll end with just
showing you this gigantic GPT

991
00:58:19,470 --> 00:58:24,060
model that was developed
by the OpenAI people, which

992
00:58:24,060 --> 00:58:27,360
does remarkably better
than the stuff I showed you

993
00:58:27,360 --> 00:58:31,690
before in generating language.

994
00:58:31,690 --> 00:58:33,160
All right.

995
00:58:33,160 --> 00:58:36,010
If you look inside
Google Translate,

996
00:58:36,010 --> 00:58:40,180
at least as of not
long ago, what you find

997
00:58:40,180 --> 00:58:43,260
is a model like this.

998
00:58:43,260 --> 00:58:49,470
So it's essentially an LSTM
model that takes input words

999
00:58:49,470 --> 00:58:53,970
and munges them together
into some representation,

1000
00:58:53,970 --> 00:58:58,980
a high-dimensional vector
representation, that summarizes

1001
00:58:58,980 --> 00:59:03,270
everything that the model
knows about that sentence

1002
00:59:03,270 --> 00:59:06,330
that you've just fed it.

1003
00:59:06,330 --> 00:59:08,550
Obviously, it has to be
a pretty high-dimensional

1004
00:59:08,550 --> 00:59:12,120
representation, because your
sentence could be about almost

1005
00:59:12,120 --> 00:59:13,690
anything.

1006
00:59:13,690 --> 00:59:17,520
And so it's important to
be able to capture all

1007
00:59:17,520 --> 00:59:19,980
that in this representation.

1008
00:59:19,980 --> 00:59:22,170
But basically, at
this point, you

1009
00:59:22,170 --> 00:59:24,340
start generating the output.

1010
00:59:24,340 --> 00:59:27,130
So if you're translating
English to French,

1011
00:59:27,130 --> 00:59:29,310
these are English
words coming in,

1012
00:59:29,310 --> 00:59:32,670
and these are French words
going out, in sort of the way

1013
00:59:32,670 --> 00:59:35,190
I showed you, where we're
generating Shakespeare

1014
00:59:35,190 --> 00:59:39,030
or we're generating Wall
Street Journal text.

1015
00:59:41,910 --> 00:59:45,780
But the critical feature here
is that in the initial version

1016
00:59:45,780 --> 00:59:48,210
of this, everything
that you learned

1017
00:59:48,210 --> 00:59:51,870
about this English sentence
had to be encoded in this one

1018
00:59:51,870 --> 00:59:58,150
vector that got passed from
the encoder into the decoder,

1019
00:59:58,150 --> 01:00:03,720
or from the source language into
the target language generator.

1020
01:00:03,720 --> 01:00:06,930
So then someone came
along and said, hmm--

1021
01:00:06,930 --> 01:00:11,470
someone, namely these
guys, came along and said,

1022
01:00:11,470 --> 01:00:13,440
wouldn't it be nice
if we could provide

1023
01:00:13,440 --> 01:00:17,430
some auxiliary information
to the generator that said,

1024
01:00:17,430 --> 01:00:19,980
hey, which part of
the input sentence

1025
01:00:19,980 --> 01:00:23,120
should you pay attention to?

1026
01:00:23,120 --> 01:00:25,790
And of course, there's
no fixed answer to that.

1027
01:00:25,790 --> 01:00:29,180
I mean, if I'm translating
an arbitrary English sentence

1028
01:00:29,180 --> 01:00:32,840
into an arbitrary French
sentence, I can't say,

1029
01:00:32,840 --> 01:00:36,770
in general, look at the third
word in the English sentence

1030
01:00:36,770 --> 01:00:39,680
when you're generating the third
word in the French sentence.

1031
01:00:39,680 --> 01:00:43,040
Because that may or may
not be true, depending

1032
01:00:43,040 --> 01:00:44,780
on the particular sentence.

1033
01:00:44,780 --> 01:00:46,520
But on the other
hand, the intuition

1034
01:00:46,520 --> 01:00:50,060
is that there is such
a positional dependence

1035
01:00:50,060 --> 01:00:56,030
and a dependence on what the
particular English word was

1036
01:00:56,030 --> 01:01:00,330
that is an important component
of generating the French word.

1037
01:01:00,330 --> 01:01:04,190
And so they created this
idea that in addition

1038
01:01:04,190 --> 01:01:10,340
to passing along
the this vector that

1039
01:01:10,340 --> 01:01:13,490
encodes the meaning
of the entire input

1040
01:01:13,490 --> 01:01:18,680
and the previous word that you
had generated in the output,

1041
01:01:18,680 --> 01:01:23,730
in addition, we pass along this
other information that says,

1042
01:01:23,730 --> 01:01:27,320
which of the input words
should we pay attention to?

1043
01:01:27,320 --> 01:01:30,110
And how much attention
should we pay to them?

1044
01:01:30,110 --> 01:01:34,520
And of course, in the
style of these embeddings,

1045
01:01:34,520 --> 01:01:37,520
these are all represented
by high-dimensional vectors,

1046
01:01:37,520 --> 01:01:41,540
high-dimensional real
number vectors that

1047
01:01:41,540 --> 01:01:44,030
get combined with
the other vectors

1048
01:01:44,030 --> 01:01:46,880
in order to produce the output.

1049
01:01:46,880 --> 01:01:53,660
Now, a classical linguist
would look at this and retch.

1050
01:01:53,660 --> 01:01:57,980
Because this looks nothing
like classical linguistics.

1051
01:01:57,980 --> 01:02:04,160
It's just numerology that gets
trained by stochastic gradient

1052
01:02:04,160 --> 01:02:08,240
descent methods in order
to optimize the output.

1053
01:02:08,240 --> 01:02:12,990
But from an engineering point
of view, it works quite well.

1054
01:02:12,990 --> 01:02:16,700
So then for a while, that
was the state of the art.

1055
01:02:16,700 --> 01:02:22,640
And then last year, these
guys, Vaswani et al.

1056
01:02:22,640 --> 01:02:27,920
came along and said,
you know, we now

1057
01:02:27,920 --> 01:02:30,020
have this complicated
architecture,

1058
01:02:30,020 --> 01:02:34,490
where we are doing the
old-style translation where

1059
01:02:34,490 --> 01:02:37,250
we summarize everything
into one vector,

1060
01:02:37,250 --> 01:02:41,690
and then use that to generate
a sequence of outputs.

1061
01:02:41,690 --> 01:02:43,850
And we have this
attention mechanism

1062
01:02:43,850 --> 01:02:47,450
that tells us how
much of various inputs

1063
01:02:47,450 --> 01:02:52,040
to use in generating each
element of the output.

1064
01:02:52,040 --> 01:02:55,050
Is the first of those
actually necessary?

1065
01:02:55,050 --> 01:02:58,040
And so they published this
lovely paper saying attention

1066
01:02:58,040 --> 01:03:00,740
is all you need,
that says, hey, you

1067
01:03:00,740 --> 01:03:04,280
know that thing that you guys
have added to this translation

1068
01:03:04,280 --> 01:03:05,720
model.

1069
01:03:05,720 --> 01:03:07,790
Not only is it a
useful addition,

1070
01:03:07,790 --> 01:03:12,770
but in fact, it can take the
place of the original model.

1071
01:03:12,770 --> 01:03:16,340
And so the Transformer
is an architecture that

1072
01:03:16,340 --> 01:03:19,280
is the hottest thing
since sliced bread

1073
01:03:19,280 --> 01:03:23,940
at the moment, that says,
OK, here's what we do.

1074
01:03:23,940 --> 01:03:25,580
We take the inputs.

1075
01:03:25,580 --> 01:03:29,400
We calculate some
embedding for them.

1076
01:03:29,400 --> 01:03:31,460
We then want to
retain the position,

1077
01:03:31,460 --> 01:03:35,380
because of course, the sequence
in which the words appear,

1078
01:03:35,380 --> 01:03:36,890
it matters.

1079
01:03:36,890 --> 01:03:39,590
And the positional encoding
is this weird thing

1080
01:03:39,590 --> 01:03:44,230
where it encodes using
sine waves so that--

1081
01:03:44,230 --> 01:03:46,700
it's an orthogonal basis.

1082
01:03:46,700 --> 01:03:49,460
And so it has nice
characteristics.

1083
01:03:49,460 --> 01:03:52,370
And then we run it
into an attention model

1084
01:03:52,370 --> 01:03:54,890
that is essentially
computing self-attention.

1085
01:03:54,890 --> 01:03:58,145
So it's saying what--

1086
01:03:58,145 --> 01:04:02,870
it's like Word2Vec, except
in a more sophisticated way.

1087
01:04:02,870 --> 01:04:06,260
So it's looking at all
the words in the sentence

1088
01:04:06,260 --> 01:04:11,270
and saying, which words is
this word most related to?

1089
01:04:13,890 --> 01:04:17,580
And then, in order to
complicate it some more,

1090
01:04:17,580 --> 01:04:20,280
they say, well, we don't
want just a single notion

1091
01:04:20,280 --> 01:04:21,420
of attention.

1092
01:04:21,420 --> 01:04:25,210
We want multiple
notions of attention.

1093
01:04:25,210 --> 01:04:27,240
So what does that sound like?

1094
01:04:27,240 --> 01:04:30,510
Well, to me, it
sounds a bit like what

1095
01:04:30,510 --> 01:04:34,230
you see in convolutional
neural networks,

1096
01:04:34,230 --> 01:04:39,270
where often when you're
processing an image with a CNN,

1097
01:04:39,270 --> 01:04:42,240
you're not only applying
one filter to the image

1098
01:04:42,240 --> 01:04:45,540
but you're applying a whole
bunch of different filters.

1099
01:04:45,540 --> 01:04:47,820
And because you
initialize them randomly,

1100
01:04:47,820 --> 01:04:50,520
you hope that they
will converge to things

1101
01:04:50,520 --> 01:04:55,370
that actually detect different
interesting properties

1102
01:04:55,370 --> 01:04:56,920
of the image.

1103
01:04:56,920 --> 01:04:58,710
So the same idea here--

1104
01:04:58,710 --> 01:05:00,210
that what they're
doing is they're

1105
01:05:00,210 --> 01:05:06,330
starting with a bunch of these
attention matrices and saying,

1106
01:05:06,330 --> 01:05:07,980
we initialize them randomly.

1107
01:05:07,980 --> 01:05:10,260
They will evolve
into something that

1108
01:05:10,260 --> 01:05:14,860
is most useful for helping us
deal with the overall problem.

1109
01:05:14,860 --> 01:05:17,400
So then they run
this through a series

1110
01:05:17,400 --> 01:05:22,290
of, I think, in Vaswani's paper,
something like six layers that

1111
01:05:22,290 --> 01:05:24,300
are just replicated.

1112
01:05:24,300 --> 01:05:30,510
And there are additional things
like feeding forward the input

1113
01:05:30,510 --> 01:05:36,240
signal in order to add it to
the output signal of the stage,

1114
01:05:36,240 --> 01:05:39,750
and then normalizing,
and then rerunning it,

1115
01:05:39,750 --> 01:05:42,900
and then running it through
a feed-forward network that

1116
01:05:42,900 --> 01:05:47,550
also has a bypass that combines
the input with the output

1117
01:05:47,550 --> 01:05:49,500
of the feed-forward network.

1118
01:05:49,500 --> 01:05:52,890
And then you do this
six times, or n times.

1119
01:05:52,890 --> 01:05:57,260
And that then feeds
into the generator.

1120
01:05:57,260 --> 01:06:02,390
And the generator then uses
a very similar architecture

1121
01:06:02,390 --> 01:06:04,820
to calculate output
probabilities,

1122
01:06:04,820 --> 01:06:09,330
And then it samples from those
in order to generate the text.

1123
01:06:09,330 --> 01:06:12,230
So this is sort of
the contemporary way

1124
01:06:12,230 --> 01:06:16,190
that one can do translation,
using this approach.

1125
01:06:16,190 --> 01:06:19,780
Obviously, I don't have time to
go into all the details of how

1126
01:06:19,780 --> 01:06:21,440
all this is done.

1127
01:06:21,440 --> 01:06:23,960
And I'd probably
do it wrong anyway.

1128
01:06:23,960 --> 01:06:27,710
But you can look at the paper,
which gives a good explanation.

1129
01:06:27,710 --> 01:06:30,590
And that blog that I
pointed to also has

1130
01:06:30,590 --> 01:06:34,670
a pointer to another
blog post by the same guy

1131
01:06:34,670 --> 01:06:39,800
that does a pretty good job
of explaining the Transformer

1132
01:06:39,800 --> 01:06:41,330
architecture.

1133
01:06:41,330 --> 01:06:43,680
It's complicated.

1134
01:06:43,680 --> 01:06:48,200
So what you get out of the
multi-head attention mechanism

1135
01:06:48,200 --> 01:06:49,310
is that--

1136
01:06:49,310 --> 01:06:53,700
here is one attention machine.

1137
01:06:53,700 --> 01:06:58,190
And for example, the colors
here indicate the degree

1138
01:06:58,190 --> 01:07:01,850
to which the encoding
of the word "it"

1139
01:07:01,850 --> 01:07:05,300
depends on the other
words in the sentence.

1140
01:07:05,300 --> 01:07:09,860
And you see that it's focused on
the animal, which makes sense.

1141
01:07:09,860 --> 01:07:14,215
Because "it," in
fact, is referring

1142
01:07:14,215 --> 01:07:17,210
to the animal in this sentence.

1143
01:07:17,210 --> 01:07:21,020
Here they introduce
another encoding.

1144
01:07:21,020 --> 01:07:26,210
And this one focuses on "was
too tired," which is also good.

1145
01:07:26,210 --> 01:07:32,490
Because "it," again, refers to
the thing that was too tired.

1146
01:07:32,490 --> 01:07:34,560
And of course, by
multi-headed, they

1147
01:07:34,560 --> 01:07:37,440
mean that it's doing
this many times.

1148
01:07:37,440 --> 01:07:40,200
And so you're
identifying all kinds

1149
01:07:40,200 --> 01:07:45,930
of different relationships
in the input sentence.

1150
01:07:45,930 --> 01:07:52,380
Well, along the same lines
is this encoding called ELMo.

1151
01:07:52,380 --> 01:07:56,970
People seem to like
Sesame Street characters.

1152
01:07:56,970 --> 01:08:00,090
So ELMo is based on a
bi-directional LSTM.

1153
01:08:00,090 --> 01:08:02,670
So it's an older technology.

1154
01:08:02,670 --> 01:08:06,200
But what it does
is, unlike Word2Vec,

1155
01:08:06,200 --> 01:08:12,000
which built an embedding
for each type--

1156
01:08:12,000 --> 01:08:17,060
so every time the
word "junk" appears,

1157
01:08:17,060 --> 01:08:19,229
it gets the same embedding.

1158
01:08:19,229 --> 01:08:23,510
Here what they're saying is,
hey, take context seriously.

1159
01:08:23,510 --> 01:08:26,540
And we're going to calculate
a different embedding

1160
01:08:26,540 --> 01:08:32,710
for each occurrence
in context of a token.

1161
01:08:32,710 --> 01:08:34,899
And this turns out
to be very good.

1162
01:08:34,899 --> 01:08:38,200
Because it goes part
of the way to solving

1163
01:08:38,200 --> 01:08:41,439
the word-sense
disambiguation problem.

1164
01:08:41,439 --> 01:08:43,580
So this is just an example.

1165
01:08:43,580 --> 01:08:46,899
If you look at the word
"play" in GloVe, which

1166
01:08:46,899 --> 01:08:49,330
is a slightly more
sophisticated variant

1167
01:08:49,330 --> 01:08:53,410
of the Word2Vec approach,
you get playing, game, games,

1168
01:08:53,410 --> 01:08:57,520
played, players, plays, player,
play, football, multiplayer.

1169
01:08:57,520 --> 01:09:00,390
This all seems to
be about games.

1170
01:09:00,390 --> 01:09:02,740
Because probably,
from the literature

1171
01:09:02,740 --> 01:09:06,130
that they got this from,
that's the most common usage

1172
01:09:06,130 --> 01:09:08,350
of the word "play."

1173
01:09:08,350 --> 01:09:13,090
Whereas, using this
bi-directional language model,

1174
01:09:13,090 --> 01:09:16,330
they can separate
out something like,

1175
01:09:16,330 --> 01:09:18,340
"Kieffer, the only
junior in the group,

1176
01:09:18,340 --> 01:09:22,550
was commended for his ability
to hit in the clutch, as well as

1177
01:09:22,550 --> 01:09:24,609
his all-around excellent play."

1178
01:09:24,609 --> 01:09:27,970
So this is presumably
the baseball player.

1179
01:09:27,970 --> 01:09:29,620
And here is, "They
were actors who

1180
01:09:29,620 --> 01:09:33,100
had been handed fat roles
in a successful play."

1181
01:09:33,100 --> 01:09:35,979
So this is a different
meaning of the word play.

1182
01:09:35,979 --> 01:09:40,540
And so this embedding also
has made really important

1183
01:09:40,540 --> 01:09:44,109
contributions to improving the
quality of natural language

1184
01:09:44,109 --> 01:09:47,140
processing by being able
to deal with the fact

1185
01:09:47,140 --> 01:09:50,620
that single words have multiple
meanings not only in English

1186
01:09:50,620 --> 01:09:53,710
but in other languages.

1187
01:09:53,710 --> 01:10:00,120
So after ELMo comes BERT, which
is this Bidirectional Encoder

1188
01:10:00,120 --> 01:10:02,820
Representations
from Transformers.

1189
01:10:02,820 --> 01:10:07,380
So rather than using the LSTM
kind of model that ELMo used,

1190
01:10:07,380 --> 01:10:10,620
these guys say, well,
let's hop on the bandwagon,

1191
01:10:10,620 --> 01:10:14,790
use the Transformer-based
architecture.

1192
01:10:14,790 --> 01:10:18,570
And then they introduced
some interesting tricks.

1193
01:10:18,570 --> 01:10:21,510
So one of the problems
with Transformers

1194
01:10:21,510 --> 01:10:25,320
is if you stack them on
top of each other there

1195
01:10:25,320 --> 01:10:27,930
are many paths from
any of the inputs

1196
01:10:27,930 --> 01:10:31,210
to any of the intermediate
nodes and the outputs.

1197
01:10:31,210 --> 01:10:33,930
And so if you're
doing self-attention,

1198
01:10:33,930 --> 01:10:38,220
you're trying to figure
out where the output should

1199
01:10:38,220 --> 01:10:42,210
pay attention to the input,
the answer, of course,

1200
01:10:42,210 --> 01:10:45,810
is like, if you're trying
to reconstruct the input,

1201
01:10:45,810 --> 01:10:50,700
if the input is present in
your model, what you will learn

1202
01:10:50,700 --> 01:10:53,250
is that the
corresponding word is

1203
01:10:53,250 --> 01:10:55,950
the right word for your output.

1204
01:10:55,950 --> 01:10:58,720
So they have to prevent
that from happening.

1205
01:10:58,720 --> 01:11:02,610
And so the way they do
it is by masking off,

1206
01:11:02,610 --> 01:11:07,590
at each level, some fraction
of the words or of the inputs

1207
01:11:07,590 --> 01:11:09,460
at that level.

1208
01:11:09,460 --> 01:11:11,880
So what this is doing
is it's a little bit

1209
01:11:11,880 --> 01:11:15,810
like the skip-gram model
in Word2Vec, where it's

1210
01:11:15,810 --> 01:11:19,770
trying to predict the
likelihood of some word,

1211
01:11:19,770 --> 01:11:23,100
except it doesn't know
what a significant fraction

1212
01:11:23,100 --> 01:11:24,940
of the words are.

1213
01:11:24,940 --> 01:11:29,910
And so it can't overfit in the
way that I was just suggesting.

1214
01:11:29,910 --> 01:11:32,820
So this turned out
to be a good idea.

1215
01:11:32,820 --> 01:11:34,380
It's more complicated.

1216
01:11:34,380 --> 01:11:37,440
Again, for the details,
you have to read the paper.

1217
01:11:37,440 --> 01:11:41,520
I gave both the Transformer
paper and the BERT paper

1218
01:11:41,520 --> 01:11:44,010
as optional readings for today.

1219
01:11:44,010 --> 01:11:46,380
I meant to give them
as required readings,

1220
01:11:46,380 --> 01:11:47,970
but I didn't do it in time.

1221
01:11:47,970 --> 01:11:50,220
So they're optional.

1222
01:11:50,220 --> 01:11:52,770
But there are a whole
bunch of other tricks.

1223
01:11:52,770 --> 01:11:57,240
So instead of using words,
they actually used word pieces.

1224
01:11:57,240 --> 01:12:03,690
So think about syllables and
don't becomes do and apostrophe

1225
01:12:03,690 --> 01:12:06,570
t, and so on.

1226
01:12:06,570 --> 01:12:11,130
And then they discovered
that about 15% of the tokens

1227
01:12:11,130 --> 01:12:15,540
to be masked seems to work
better than other percentages.

1228
01:12:15,540 --> 01:12:21,720
So those are the hidden tokens
that prevent overfitting.

1229
01:12:21,720 --> 01:12:26,010
And then they do some
other weird stuff.

1230
01:12:26,010 --> 01:12:28,860
Like, instead of
masking a token,

1231
01:12:28,860 --> 01:12:32,790
they will inject random other
words from the vocabulary

1232
01:12:32,790 --> 01:12:36,810
into its place, again,
to prevent overfitting.

1233
01:12:36,810 --> 01:12:39,720
And then they look at
different tasks like,

1234
01:12:39,720 --> 01:12:43,020
can I predict the next
sentence in a corpus?

1235
01:12:43,020 --> 01:12:44,790
So I read a sentence.

1236
01:12:44,790 --> 01:12:48,330
And the translation is
not into another language.

1237
01:12:48,330 --> 01:12:52,500
But it's predicting what the
next sentence is going to be.

1238
01:12:52,500 --> 01:12:56,880
So they trained it on 800
million words from something

1239
01:12:56,880 --> 01:13:02,430
called the Books corpus
and about 2 and 1/2

1240
01:13:02,430 --> 01:13:06,000
million-word Wikipedia corpus.

1241
01:13:06,000 --> 01:13:07,640
And what they found
was that there

1242
01:13:07,640 --> 01:13:12,360
is an enormous improvement
on a lot of classical tasks.

1243
01:13:12,360 --> 01:13:15,990
So this is a listing of
some of the standard tasks

1244
01:13:15,990 --> 01:13:20,980
for natural language processing,
mostly not in the medical world

1245
01:13:20,980 --> 01:13:24,450
but in the general NLP domain.

1246
01:13:24,450 --> 01:13:32,280
And you see that you get things
like an improvement from 80%.

1247
01:13:32,280 --> 01:13:35,880
Or even the GPT model
that I'll talk about

1248
01:13:35,880 --> 01:13:39,060
in a minute is at 82%.

1249
01:13:39,060 --> 01:13:42,030
They're up to about 86%.

1250
01:13:42,030 --> 01:13:47,470
So a 4% improvement in
this domain is really huge.

1251
01:13:47,470 --> 01:13:50,110
I mean, very often
people publish papers

1252
01:13:50,110 --> 01:13:53,110
showing a 1% improvement.

1253
01:13:53,110 --> 01:13:54,900
And if their corpus
is big enough,

1254
01:13:54,900 --> 01:13:57,190
then it's statistically
significant,

1255
01:13:57,190 --> 01:13:59,020
and therefore publishable.

1256
01:13:59,020 --> 01:14:02,590
But it's not significant in the
ordinary meaning of the term

1257
01:14:02,590 --> 01:14:05,890
significant, if you're
doing 1% better.

1258
01:14:05,890 --> 01:14:08,590
But doing 4% better
is pretty good.

1259
01:14:08,590 --> 01:14:15,370
Here we're going
from like 66% to 72%

1260
01:14:15,370 --> 01:14:17,670
from the earlier
state of the art--

1261
01:14:17,670 --> 01:14:26,410
82 to 91; 93 to 94; 35 to
60 in the CoLA task corpus

1262
01:14:26,410 --> 01:14:28,540
of linguistic acceptability.

1263
01:14:28,540 --> 01:14:32,110
So this is asking, I
think, Mechanical Turk

1264
01:14:32,110 --> 01:14:36,550
people, for generated
sentences, is this sentence

1265
01:14:36,550 --> 01:14:39,000
a valid sentence of English?

1266
01:14:39,000 --> 01:14:42,700
And so it's an
interesting benchmark.

1267
01:14:42,700 --> 01:14:47,650
So it's producing really
significant improvements

1268
01:14:47,650 --> 01:14:49,240
all over the place.

1269
01:14:49,240 --> 01:14:50,860
They trained two models of it.

1270
01:14:50,860 --> 01:14:52,750
The base model is
the smaller one.

1271
01:14:52,750 --> 01:14:57,470
The large model is just
trained on larger data sets.

1272
01:14:57,470 --> 01:15:01,050
Enormous amount of computation
in doing this training--

1273
01:15:01,050 --> 01:15:04,610
so I've forgotten, it
took them like a month

1274
01:15:04,610 --> 01:15:08,270
on some gigantic
cluster of GPU machines.

1275
01:15:08,270 --> 01:15:11,780
And so it's daunting,
because you can't just

1276
01:15:11,780 --> 01:15:14,000
crank this up on your
laptop and expect

1277
01:15:14,000 --> 01:15:16,018
it to finish in your lifetime.

1278
01:15:20,210 --> 01:15:23,610
The last thing I want to
tell you about is this GPT-2.

1279
01:15:23,610 --> 01:15:26,780
So this is from the
OpenAI Institute,

1280
01:15:26,780 --> 01:15:30,320
which is one of these
philanthropically funded--

1281
01:15:30,320 --> 01:15:33,320
I think, this one,
by Elon Musk--

1282
01:15:33,320 --> 01:15:37,910
research institute
to advance AI.

1283
01:15:37,910 --> 01:15:42,900
And what they said is, well,
this is all cool, but--

1284
01:15:42,900 --> 01:15:45,260
so they were not using BERT.

1285
01:15:45,260 --> 01:15:49,520
They were using the
Transformer architecture

1286
01:15:49,520 --> 01:15:53,720
but without the same
training style as BERT.

1287
01:15:53,720 --> 01:15:56,780
And they said, the
secret is going

1288
01:15:56,780 --> 01:16:02,930
to be that we're going to apply
this not only to one problem

1289
01:16:02,930 --> 01:16:05,160
but to a whole
bunch of problems.

1290
01:16:05,160 --> 01:16:08,690
So it's a multi-task
learning approach that says,

1291
01:16:08,690 --> 01:16:10,880
we're going to
build a better model

1292
01:16:10,880 --> 01:16:16,000
by trying to solve a bunch of
different tasks simultaneously.

1293
01:16:16,000 --> 01:16:19,950
And so they built
enormous models.

1294
01:16:19,950 --> 01:16:24,180
By the way, the task itself is
given as a sequence of tokens.

1295
01:16:24,180 --> 01:16:26,880
So for example, they
might have a task

1296
01:16:26,880 --> 01:16:31,890
that says translate to French,
English text, French text.

1297
01:16:31,890 --> 01:16:36,780
Or answer the question,
document, question, answer.

1298
01:16:36,780 --> 01:16:43,400
And so the system
not only learns

1299
01:16:43,400 --> 01:16:45,660
how to do whatever
it's supposed to do.

1300
01:16:45,660 --> 01:16:47,990
But it even learns
something about the tasks

1301
01:16:47,990 --> 01:16:52,670
that it's being asked to work
on by encoding these and using

1302
01:16:52,670 --> 01:16:54,890
them as part of its model.

1303
01:16:54,890 --> 01:16:58,070
So they built four
different models.

1304
01:16:58,070 --> 01:17:01,790
Take a look at the bottom one.

1305
01:17:01,790 --> 01:17:09,120
1.5 billion parameters--
this is a large model.

1306
01:17:09,120 --> 01:17:10,860
This is a very large model.

1307
01:17:13,430 --> 01:17:16,610
And so it's a byte-level model.

1308
01:17:16,610 --> 01:17:20,240
So they just said forget
words, because we're trying

1309
01:17:20,240 --> 01:17:21,890
to do this multilingually.

1310
01:17:21,890 --> 01:17:25,020
And so for Chinese,
you want characters.

1311
01:17:25,020 --> 01:17:29,330
And for English, you might
as well take characters also.

1312
01:17:29,330 --> 01:17:32,990
And the system will, in
its 1.5 billion parameters,

1313
01:17:32,990 --> 01:17:37,520
learn all about the sequences of
characters that make up words.

1314
01:17:37,520 --> 01:17:39,590
And it'll be cool.

1315
01:17:39,590 --> 01:17:44,540
And so then they look at a whole
bunch of different challenges.

1316
01:17:44,540 --> 01:17:48,380
And what you see is that the
state of the art before they

1317
01:17:48,380 --> 01:17:54,010
did this on, for example,
the Lambada data set

1318
01:17:54,010 --> 01:18:00,130
was that the perplexity of
its predictions was a hundred.

1319
01:18:00,130 --> 01:18:04,300
And with this large model, the
perplexity of its predictions

1320
01:18:04,300 --> 01:18:06,500
is about nine.

1321
01:18:06,500 --> 01:18:10,340
So that means that it's
reduced the uncertainty of what

1322
01:18:10,340 --> 01:18:13,700
to predict next
ridiculously much--

1323
01:18:13,700 --> 01:18:16,280
I mean, by more than
an order of magnitude.

1324
01:18:16,280 --> 01:18:18,920
And you get similar
gains, accuracy going

1325
01:18:18,920 --> 01:18:25,700
from 59% to 63% accuracy on a--

1326
01:18:25,700 --> 01:18:29,480
this is the children's
something-or-other challenge--

1327
01:18:29,480 --> 01:18:31,640
from 85% to 93%--

1328
01:18:31,640 --> 01:18:37,100
so dramatic improvements
almost across the board,

1329
01:18:37,100 --> 01:18:40,160
except for this
particular data set,

1330
01:18:40,160 --> 01:18:42,720
where they did not do well.

1331
01:18:42,720 --> 01:18:47,880
And what really blew
me away is here's

1332
01:18:47,880 --> 01:18:51,660
an application of this
1.5 billion-word model

1333
01:18:51,660 --> 01:18:56,730
that they built. So they
said, OK, I give you a prompt,

1334
01:18:56,730 --> 01:18:59,490
like the opening paragraph
of a Wall Street Journal

1335
01:18:59,490 --> 01:19:02,010
article or a Wikipedia article.

1336
01:19:02,010 --> 01:19:07,230
And you complete the article
by using that generator idea

1337
01:19:07,230 --> 01:19:10,980
that I showed you before, that
just uses the language model

1338
01:19:10,980 --> 01:19:14,520
and picks the most
likely word to come next

1339
01:19:14,520 --> 01:19:17,160
and emits that as the next word.

1340
01:19:17,160 --> 01:19:20,490
So here is a prompt that says,
"A train carriage containing

1341
01:19:20,490 --> 01:19:24,270
controlled nuclear materials
was stolen in Cincinnati today.

1342
01:19:24,270 --> 01:19:26,070
Its whereabouts are unknown."

1343
01:19:26,070 --> 01:19:27,450
By the way, this is made up.

1344
01:19:27,450 --> 01:19:31,230
I mean, this is not
a real news article.

1345
01:19:31,230 --> 01:19:34,380
And the system comes
back with a completion

1346
01:19:34,380 --> 01:19:36,900
that says, "The incident
occurred on the downtown train

1347
01:19:36,900 --> 01:19:40,260
line, which runs from
Covington and Ashland stations.

1348
01:19:40,260 --> 01:19:42,300
In an email to
Ohio news outlets,

1349
01:19:42,300 --> 01:19:44,910
the US Department
of Energy said it's

1350
01:19:44,910 --> 01:19:47,250
working with the Federal
Railroad Administration

1351
01:19:47,250 --> 01:19:50,400
to find the thief," et cetera.

1352
01:19:50,400 --> 01:19:53,790
This looks astoundingly good.

1353
01:19:53,790 --> 01:19:56,650
Now, the paper from
which this comes--

1354
01:19:56,650 --> 01:19:59,220
this is actually from a
blog, but they've also

1355
01:19:59,220 --> 01:20:01,560
published a paper about it--

1356
01:20:01,560 --> 01:20:04,590
claims that these examples
are not even cherry-picked.

1357
01:20:04,590 --> 01:20:09,410
If you go to that page and
pick sample 1, 2, 3, 4, 5,

1358
01:20:09,410 --> 01:20:12,810
6, et cetera, you get
different examples

1359
01:20:12,810 --> 01:20:15,270
that they claim are
not cherry-picked.

1360
01:20:15,270 --> 01:20:17,880
And every one of
them is really good.

1361
01:20:17,880 --> 01:20:21,690
I mean, you could imagine
this being an actual article

1362
01:20:21,690 --> 01:20:24,090
about this actual event.

1363
01:20:24,090 --> 01:20:27,520
So somehow or other,
in this enormous model,

1364
01:20:27,520 --> 01:20:30,600
and with this
Transformer technology,

1365
01:20:30,600 --> 01:20:34,510
and with the multi-task
training that they've done,

1366
01:20:34,510 --> 01:20:37,300
they have managed
to capture so much

1367
01:20:37,300 --> 01:20:40,810
of the regularity of
the English language

1368
01:20:40,810 --> 01:20:43,840
that they can generate these
fake news articles based

1369
01:20:43,840 --> 01:20:48,910
on a prompt and make them
look unbelievably realistic.

1370
01:20:48,910 --> 01:20:51,940
Now, interestingly,
they have chosen not

1371
01:20:51,940 --> 01:20:54,400
to release that trained model.

1372
01:20:54,400 --> 01:20:57,980
Because they're worried that
people will, in fact, do this,

1373
01:20:57,980 --> 01:21:02,260
and that they will generate
fake news articles all the time.

1374
01:21:02,260 --> 01:21:04,360
They've released a
much smaller model

1375
01:21:04,360 --> 01:21:09,010
that is not nearly as good
in terms of its realism.

1376
01:21:09,010 --> 01:21:12,580
So that's the state of the
art in language modeling

1377
01:21:12,580 --> 01:21:13,970
at the moment.

1378
01:21:13,970 --> 01:21:18,520
And as I say, the general domain
is ahead of the medical domain.

1379
01:21:18,520 --> 01:21:20,530
But you can bet
that there are tons

1380
01:21:20,530 --> 01:21:24,040
of people who are sitting
around looking at exactly

1381
01:21:24,040 --> 01:21:27,250
these results and
saying, well, we

1382
01:21:27,250 --> 01:21:29,590
ought to be able to
take advantage of this

1383
01:21:29,590 --> 01:21:33,310
to build much better language
models for the medical domain

1384
01:21:33,310 --> 01:21:36,670
and to exploit them in order
to do phenotyping, in order

1385
01:21:36,670 --> 01:21:41,200
to do entity recognition,
in order to do inference,

1386
01:21:41,200 --> 01:21:43,420
in order to do
question answering,

1387
01:21:43,420 --> 01:21:47,156
in order to do any of
these kinds of topics.

1388
01:21:47,156 --> 01:21:51,030
And I was talking to
Patrick Winston, who

1389
01:21:51,030 --> 01:21:54,660
is one of the good
old-fashioned AI people,

1390
01:21:54,660 --> 01:21:56,970
as he characterizes himself.

1391
01:21:56,970 --> 01:22:00,090
And the thing that's a
little troublesome about this

1392
01:22:00,090 --> 01:22:04,770
is that this technology
has virtually nothing

1393
01:22:04,770 --> 01:22:07,470
to do with anything
that we understand

1394
01:22:07,470 --> 01:22:11,670
about language or about
inference or about question

1395
01:22:11,670 --> 01:22:15,010
answering or about anything.

1396
01:22:15,010 --> 01:22:19,140
And so one is left with
this queasy feeling that,

1397
01:22:19,140 --> 01:22:22,530
here is a wonderful engineering
solution to a whole set

1398
01:22:22,530 --> 01:22:24,870
of problems, but
it's unclear how

1399
01:22:24,870 --> 01:22:29,110
it relates to the original goal
of artificial intelligence,

1400
01:22:29,110 --> 01:22:31,830
which is to understand something
about human intelligence

1401
01:22:31,830 --> 01:22:35,160
by simulating it in a computer.

1402
01:22:35,160 --> 01:22:38,410
Maybe our BCS
friends will discover

1403
01:22:38,410 --> 01:22:42,780
that there are, in fact,
transformer mechanisms deeply

1404
01:22:42,780 --> 01:22:44,670
buried in our brain.

1405
01:22:44,670 --> 01:22:46,830
But I would be surprised
if that turned out

1406
01:22:46,830 --> 01:22:48,960
to be exactly the case.

1407
01:22:48,960 --> 01:22:52,480
But perhaps there is
something like that going on.

1408
01:22:52,480 --> 01:22:54,930
And so this leaves an
interesting scientific

1409
01:22:54,930 --> 01:22:57,180
conundrum of,
exactly what have we

1410
01:22:57,180 --> 01:23:02,040
learned from this type of very,
very successful model building?

1411
01:23:02,040 --> 01:23:02,760
OK.

1412
01:23:02,760 --> 01:23:03,540
Thank you.

1413
01:23:03,540 --> 01:23:06,590
[APPLAUSE]