Core research questions for natural language processing of clinical text

huntcopywriterΤεχνίτη Νοημοσύνη και Ρομποτική

24 Οκτ 2013 (πριν από 4 χρόνια και 19 μέρες)

75 εμφανίσεις

Core research questions for
natural language processing
of clinical text
Noémie Elhadad
noemie@dbmi.columbia.edu
NLP’s promise for medicine and health
}

Increasingly large amounts of texts
}

Clinical literature in PubMed
}

Patients & health consumers talking online
}

Patient notes in the Electronic Health Record (EHR)
}

Natural language processing
}

Leverage the information f
rom natural text (with all of its
noise, idiosyncrasies and ambiguities) into a format
amenable to computing
}

Support clinical discovery
}

Provide tailored access to relevant information and helps
in decision making
NLP’s promise for medicine and health
}

Clinical discovery
}

Discover patterns about patient populations, their diseases and
their treatments based on their records
}

Clinical decision making
}

Extract information in a patient’s record as input to decision
support systems
}

Improved workflow for Electronic Health Record users
}

Intelligent search
}

Question answering
}

Summarization of longitudinal records
}


How far away are we?
}

NLP in the general domain
}

Spam filtering, machine translation, sentiment analysis,
Siri
,
Watson, …
}

Linguistically informed, data-driven approaches
}

Shared corpora for training and testing
}

Clinical NLP
}

Some success stories
}

Still many challenges and open research questions

Outline
}

What is in the EHR
(and isn’t free text going away anyway)
?
}

Towards natural language understanding
}

The EHR and the Truth – it’s complicated
}

Information redundancy
Outline
}

What is in the EHR
(and isn’t free text going away anyway)
?
}

Towards natural language understanding
}

The EHR and the Truth – it’s complicated
}

Information redundancy
Information captured in the EHR
}

What is captured?
}

Diagnoses (billing codes)
}

Laboratory results (time series)
}

Medications (mix of free text and codes)
}

Reports (structured documents)
}

Notes from physicians, nurses, social workers, etc.
}

Why is it captured?
}

Clinical purposes
}

Billing purposes
}

Legal purposes
}

Compliance purposes
Narrative vs. structured data
}

Ongoing debate in health IT
}

Even “simple” phenotypic information is not trivial to
encode in a structured format
}

Smoking status of a patient
}

Mismatch between structured data and narratives
}

Diagnoses
}

Medications
}

Labs
At
NewYork
-Presbyterian (Eclipsys data)
Outline
}

What is in the EHR
(and isn’t free text going away anyway)
?
}

Towards natural language understanding
}

The EHR and the Truth – it’s complicated
}

Information redundancy
Towards natural language
understanding
}

Information extraction
}

Locates and structures specific information in text
}

Can help answer questions like “how many patients in my
institution come in with shortness of breath on any given day?”

43
yo
female with history of GERD woke up w/ SOB and LUE
discomfort 1 day PTA. She presented to [**Hospital2 72**] where she
was ruled out for MI by enzymes. She underwent stress test the
following day at [**Hospital2 72**]. She developed SOB and
shoulder pain during the test.
Information extraction (1)
}

Named entity recognition
}

Conditions and symptoms
,

Diagnostic procedures and laboratory
tests
,

Therapeutic procedures
, …

}

How to determine that “GERD” is a condition, but “stress test” is
diagnostic procedure?
Rely on terminologies and ontologies (UMLS)
}

How to determine that “LUE discomfort” is a symptom, even though it
is not in the UMLS?
Rely on context and syntax
43
yo
female with history of
GERD
woke up w/
SOB
and
LUE
discomfort
1 day PTA. She presented to [**Hospital2 72**] where she
was ruled out for
MI
by
enzymes
. She underwent
stress test
the
following day at [**Hospital2 72**]. She developed
SOB
and
shoulder pain
during the
test
.
Information extraction (2)
}

Normalize the entities (map to a semantic concept in an
ontology)
}

GERD
à
C0017168 ‘
Gastroesophageal
reflux disease’, ‘stress
test’
à

C0015260 ‘Exercise Stress Test’
}

How to determine that PTA is not mapped to Post Traumatic Amnesia,
Percutaneous
Transluminal
Angioplasty, or Parent Teacher Association?
}

How to determine that ‘enzymes’ maps to ‘Clinical Enzyme Test’, but not
‘Enzyme’?
Rely on the context (e.g., “1 day” and “MI by”)
43
yo
female with history of
GERD
woke up w/
SOB
and
LUE
discomfort
1 day PTA. She presented to [**Hospital2 72**] where she
was ruled out for
MI
by
enzymes
. She underwent
stress test
the
following day at [**Hospital2 72**]. She developed
SOB
and
shoulder pain
during the
test
.
Information extraction (3)
}

Identify

modifiers associated with the entities
}

“GERD” has one temporal modifier “history of”
}

“SOB” has one temporal modifier “1 day PTA”
}

“LUE discomfort” has one temporal modifier “1 day PTA” and one
anatomical modifier “LUE”
}

MI has one negation modifier “ruled out”
}


43
yo
female with
history of

GERD
woke up w/
SOB
and
LUE

discomfort

1 day PTA
. She presented to [**Hospital2 72**] where she
was
ruled out
for
MI
by
enzymes
. She underwent
stress test
the
following day
at [**Hospital2 72**]. She developed
SOB
and
shoulder pain
during the
test
.
Information extraction (3)
}

Challenges and Solutions
}

How to determine the types of modifiers of interest?
Task dependent: some are obvious, some not always needed
}

How to recognize them in text?
Rely on syntax and context
43
yo
female with
history of

GERD
woke up w/
SOB
and
LUE

discomfort

1 day PTA
. She presented to [**Hospital2 72**] where she
was
ruled out
for
MI
by
enzymes
. She underwent
stress test
the
following day
at [**Hospital2 72**]. She developed
SOB
and
shoulder pain
during the
test
.
Information extraction (4)
}

Identify relations associated among the entities
}

“enzyme” is the test used to rule out “MI”
}

“SOB” occurred during “stress test”
}

“stress test” and “test” refer to the same entity
43
yo
female with history of
GERD
woke up w/
SOB
and
LUE
discomfort
1 day PTA. She presented to [**Hospital2 72**] where she
was ruled out for
MI
by
enzymes
. She underwent
stress test
the
following day at [**Hospital2 72**]. She developed
SOB
and
shoulder pain
during the
test
.
Information extraction (4)
}

Challenges and Solutions
}

How to determine the types of relations?
Task dependent
}

How to recognize them in text?
Use context (patterns “X during Y” “ruled out for X by Y”)
}

How to identify them when they are not explicit in text (e.g.,
pharmacovigilance
)?
Use statistical models to find patterns of occurrences

43
yo
female with history of
GERD
woke up w/
SOB
and
LUE
discomfort
1 day PTA. She presented to [**Hospital2 72**] where she
was ruled out for
MI
by
enzymes
. She underwent
stress test
the
following day at [**Hospital2 72**]. She developed
SOB
and
shoulder pain
during the
test
.
Towards robust information extraction
}

Shared annotated lexical resources needed
}

Access to shared clinical texts is difficult
}

Some task-specific annotated corpora
}

i2b2 challenges, computational medicine challenge, …
}

Task-independent multi-layered annotations (ongoing)
}

ShARe
– MIMIC II notes annotated for POS, chunks, disorder
mentions and 13 modifiers
}

SHARPn
– clinical element models
}

Temporal relations following ISO
TimeML

}

Overall, 1.5M tokens, several institutions, rich annotation
Outline
}

What is in the EHR
(and isn’t free text going away anyway)
?
}

Towards natural language understanding
}

The EHR and the Truth – it’s complicated
}

Information redundancy
Allegory of the
cave
EHR
Allegory of the
cave
EHR
}

Uncontrolled data source

}

From the patient


}

From the clinician
Allegory of the
cave
EHR
}

Erroneous sometimes
}

Mistakes
}

Rounding
}

Not enough information
}

Important data is not captured by the EHR
}

Specifically true for the conditions whose phenotype is unknown
}

Healthy patients are not well represented
}

Sparse
}

“Captive population”
}

Data missing not at random (healthy patients are not measured often)
}

Too much information
}

For a given task, lots of irrelevant information
}

Information redundancy – copy and paste
Implications on methods
}

Learning from large clinical corpora, especially longitudinal
records
}

Detangle the biases of the EHR from clinical patterns
}

(This is true for non-linguistic data in the EHR as well)
Outline
}

What is in the EHR
(and isn’t free text going away anyway)
?
}

Towards natural language understanding
}

The EHR and the Truth – it’s complicated
}

Information redundancy
Information redundancy in the EHR
}

“Copy and paste”
}

Clinicians copy chunks of text from a previous note into
current note
}

No new information since last note
}

Effect on documentation quality
}

Error propagation
}

Notes become incoherent with time
}

What is the effect on text mining?
Information redundancy in the EHR
}

Scenario of use:
phenotyping
/ disease model
}

Look for terms that are representative of a patient cohort
(collocation identification)
}

Discover clusters of terms that co-occur often in patients’
notes (topic model)
}

1,604 patients with CKD stage 3
}

22,564 notes (three note types only), span 1-10 years
}

6.1M tokens, 140K vocabulary
}

600K UMLS concept tokens, 7K concept vocabulary
}

High redundancy at the string level
Example of topic modeling (LDA)
output on corpus of CKD patients
1
.
#
renal

h
tn
#
P
ulm
#
2
.
#
c
kd
#
lisinopril
#
P
ulmonary
#
3
.
#
cr
#
hctz
#
C
t
#
4
.
#
kidney
#
bp
#
C
hest
#
5
.
#
appt
#
lipitor
#
C
opd
#
6
.
#
lasix
#
asa
#
l
ung
#
7
.
#
disease
#
date
#
p
fts
#
8
.
#
anemia
#
amlodipine
#
s
ob
#
9
.
#
pth
#
ldl
#
c
ough
#
10
.
#
iv
#
hpl
#
p
na
#
11
.
#
mi
#
hl
#
sputum
#
12
.
#
chronic
#
buttock
#
s
evere
#
13
.
#
rf
#
trazodone
#
d
efect
#
14
.
#
gfr
#
nephrolithiasis
#
findings
#
15
.
#
missed
#
uncontrolled
#
prevacid
#
16
.
#
refer
#
adrenal
#
f
ev
#
17
.
#
pap
#
erythema
#
l
ong
#
18
.
#
secondary
#
metoprolol
#
a
ce
#
19
.
#
pmd
#
drugs
#
p
ack
#
20
.
#
itching
#
#
#
s
piriva
#
#
Topic 1
Topic 2
Topic 3
LDA behavior on “traditional” corpus
}

Wall Street Journal corpus
}

400 documents
}

600 documents
}

1,300 documents
}

Train LDA on the three subsets and compare their
performances on the same withheld WSJ subset
}

The higher the log-likelihood, the more successful the model is
at modeling the content of a corpus
}

Given two models on the same data, the one with the lower
number of topics has better explanatory power (fewer latent
variables are needed to explain the data)
LDA behavior on “traditional” corpus
}

More data is better (1,300 vs. 400 documents)
}

Same shape
-825000
-820000
-815000
-810000
-805000
-800000
-795000
-790000
-785000
-780000
0
50
100
150
200
250
300
350
400
Log Likelihood of model filt
Number of Topics
LDA-WSJ
WSJ-400
WSJ-600
WSJ-1300
LDA on “non-traditional” corpus
}

What happens when we introduce redundancy in the Wall
Street Journal?
}

WSJx2: 2,600 documents
}

WSJx3: 3,900 documents
}

WSJx5: 3,309 documents
every document is sampled randomly between 1 and 5
times(most similar to EHR corpus)
LDA on “non-traditional” corpus
}

Adding redundancy worsens the models (WSJx5 is almost
as bad as WSJ-600)
-815000
-810000
-805000
-800000
-795000
-790000
-785000
-780000
0
50
100
150
200
250
300
350
400
Log Likelihood of model filt
Number of Topics
LDA-WSJ
WSJ-600
WSJ-1300
WSJx2
WSJx3
WSJx5
Information redundancy in the EHR
}

Redundancy has a negative impact on topic modeling
}

Similar experiments with collocation extraction
}

Redundant EHR and synthetic WSJ corpora have more
collocations identified
}

But their quality is dubious
}

Mitigation strategies to take the fact that several notes
come from the same patients into account
}

Don’t treat each note independently of each other
}

But don’t remove redundancy artificially because redundant
information is still meaningful
}

Fingerprinting,
RedLDA

LDA, Fingerprinting,
RedLDA

on EHR Corpus
Conclusions
}

The EHR and the narrative part are a gold mine of
information
}

The infrastructural challenges to opening data to
researchers are being addressed (slowly, but surely)
}

The scientific challenges are concrete ones, beyond the
“get more data and train better models”
}

Better understanding of the characteristics of EHR
}

Novel methods that do not violate the assumptions of the EHR
Thank you!



Sharon
Lipsky
Gorman,
Karthik

Natarajan
,
Rimma

Pivovarov
, Adler
Perotte
,
David Albers, George
Hripcsak
@ Columbia University
Raphael Cohen,
Iddo

Aviram
, Michael Elhadad @ Ben
Gurion
University


R01 LM 010027 (Elhadad) – An NLP approach to generating patient record
summaries
NLM HHS (Elhadad) – Causal inference on narrative and structured
temporal data to augment discovery and care
R01 GM 090187 (Chapman, Elhadad,
Savova
) – Annotation, development
and evaluation for clinical information extraction