SPOT - HISB 2012

bloatdecorumΛογισμικό & κατασκευή λογ/κού

30 Οκτ 2013 (πριν από 3 χρόνια και 7 μήνες)

75 εμφανίσεις

IBM Research

© 2012 IBM Corporation


SPOT

the drug!


An unsupervised pattern matching method to extract drug
names from very large clinical corpora

Anni Coden, Dan Gruhl, Neal Lewis,
Michael Tanenblatt, Joe Terdiman M.D.

IBM Research

© 2012 IBM Corporation

2

The challenge identifying drugs in clinical records

complete

drug dictionary

(including misspellings)

discovery

algorithm

disambiguation

within context

clinical

records

drug

mentions

in clinical

record

in

Challenges


Large amount of messy input data


Discovery and disambiguation is difficult and expensive


Inherently iterative process


IBM Research

© 2012 IBM Corporation

3


SPOT

meets the challenge


Messy data


SPOT

does not rely on Natural Language Processing


Side effect: language independent


Large amount of data


SPOT’s
lookup algorithm is fast (separate paper)


Single 3GHz core, 32GB RAM, 1 TB storage


1 GB every 3min, 20 sec


24 3GHz cores, 200GB of RAM, 2 TB storage


1 GB every 8.5 seconds


Discovery and disambiguation


Uses large (1.7 million progress notes / 3GB)

amount of available
data to discover and disambiguate in single step with some post
filtering


IBM Research

© 2012 IBM Corporation

4

What is a drug?


Presc: Prescription medications


OTC: Over
-
the
-
counter medications (regulated in
many countries for patient safety)


Supp: A substance (not within Presc or OTC)
taken with the intent of improving the health of a
patient (e.g. vitamin, herbal supplement)


Other: A substance taken without the intent of
improving the health of a patient (e.g. tobacco,
alcohol, illicit drugs, foods)

IBM Research

© 2012 IBM Corporation

5


SPOT

the context matching algorithm


Clinical record text:
“The patient was on aspirin 325 mg and also on
Zocor 20 mg once a day.”


Normalize text: “
The patient was on aspirin <number> <dosage> and
also on Zocor <number> <dosage> once a day.”


Use seed dictionary (RxNorm) and identify drug mentions:
“The patient
was on
aspirin
<number> <dosage> and also on
Zocor

<number>
<dosage> once a day.”


Identify contexts within a window (e.g. window 3) and replace drug
mention with wild card


patient was on * <number> <dosage> and


and also on * <number> <dosage> once


Apply contexts to
large

training corpus


Result is a large set of contexts and words replacing the wild cards


IBM Research

© 2012 IBM Corporation

6


SPOT

the context matching algorithm


Score contexts and words using the following metrics


Confidence

of a context is defined as the percentage of time when a
unique element from the seed dictionary is surrounded by the
pattern. We note that this differs from a confidence measures that
counts all occurrences.


Support

of a context is defined as the number of times the context
occurs in the training corpus.


Prevalence

of a candidate dictionary term is defined as the number
of unique patterns in which the term is found in the training corpus.


New dictionary terms identified from contexts which have
confidence and support bigger than (user) chosen threshold and
have prevalence bigger than (user) chosen threshold.


Apply iteratively (with cutoff threshold)



IBM Research

© 2012 IBM Corporation

7


SPOT

the results


Training corpus: 100K progress notes out of 1.7 million notes


Window size: 6


Seed dictionary: subset of RxNorm


Executed
SPOT

on 1.6 million notes:


confidence: 68%, support 3







Applied contexts
-
> SPOTRx dictionary


IBM Research

© 2012 IBM Corporation

8

Performance results

Reference corpus annotated

by domain expert

IBM Research

© 2012 IBM Corporation

9

Examples


MetaMap did not identify correctly:


“Feels vicodin is not helping.”


"Last use <TIME> ago
-

<DATE> Hallucinogenics
-

xl
only; <DATE> Inhalants
-

Denies any use. Club drugs
(ecstasy, GHB, poppers, etc)
-

Denies any use"


SPOT

identified the following misspelled drugs


“prestig, prestiq” (pristiq®),


“peg
-
ifn” (peginterferon),


“PEG” (polyethylene glycol),


“tgel” (Neutrogena T/Gel®).

IBM Research

© 2012 IBM Corporation

10

Conclusion


SPOT
to be run on an ongoing basis (e.g. monthly) in
big institutions to stay “current”


SPOT
is fast, (nearly) language
-
independent and fairly
accurate


SPOT
’s accuracy could be improved with language
dependent post
-
processing

Hopefully everywhere soon!