An NLP Ecosystem for Development and Use of Natural Language Processing in the Clinical Domain

scarfpocketAI and Robotics

Oct 24, 2013 (4 years and 2 months ago)

60 views

An NLP Ecosystem

for Development and Use

of Natural Language Processing

in the Clinical Domain

Wendy W. Chapman, PhD

Division of Biomedical Informatics

University of California, San Diego

Integrating Data for Analysis, Anonymization, and Sharing

Overview


The promise of natural language processing (NLP)


Challenges of developing NLP in the clinical domain


Challenges in applying NLP in the clinical domain


iDASH


Opportunities for sharing and collaboration in NLP

NLP Success


Fresh off

its butt
-
kicking performance on Jeopardy!, IBM

s
supercomputer "Watson" has enrolled in medical school at
Columbia University,


New York Daily News February 18th
2011


Clinical NLP Since 1960

s


Why has clinical NLP had little impact on
clinical care?

Barriers to Development


Sharing clinical data difficult


Have not had shared datasets for development and
evaluation


Modules trained on general English not sufficient



Insufficient common conventions and standards for
annotations


Data sets are unique to a lab


Not easily interchangeable



Limited collaboration


Clinical NLP applications silos and black boxes


Have not had open source applications



Reproducibility is formidable


Open source release not always sufficient


Software engineering quality not always great


Mechanisms for reproducing results are sparse

Overview


The promise of natural language processing (NLP)


Challenges of developing NLP in the clinical domain


Challenges in applying NLP in the clinical domain


Developing an NLP ecosystem on iDASH

Security & Privacy Concerns


Clinical texts have many patient identifiers


18 HIPAA identifiers


Names


Addresses


Items not regulated by HIPAA


tight end for the Steelers


Unique cases


50s
-
year
-
old woman who is pregnant


Sensitive information


HIV status


Institutions are reluctant to share data

Lack of user
-
centered development and scalability


Perceived cost of applying NLP outweighs the
perceived benefit (Len D

Avolio)


Overview


The promise of natural language processing (NLP)


Challenges of developing NLP in the clinical domain


Challenges in applying NLP in the clinical domain


Developing an NLP ecosystem on iDASH

iDASH


i
ntegrating Data


Analysis


Anonymization


Sharing


Data

Computational
Resources

Software/Tools

Disincentives to Share



Scooping


by faster analysts Exposure of
potential errors in data


Resources for preparing data submissions


Maintaining data


Interacting with potential users takes time


Threat of privacy breach when human subjects
are involved


Do not have policies in place


Fallible de
-
identification, anonymization algorithms


iDASH aims to minimize these disincentives

nlp
-
ecosystem.ucsd.edu


Privacy

preserving


Access control


De
-
identification


Query counts


Artificial data

generators

Digital

Informed

consent

HIPAA &/or FISMA Compliant Cloud


Customizable

DUAs


Informed

Consent

Registry




15

2011 summer internship program funded by NIH U54HL108460

NLP
Ecosystem

Data

MT Samples

Tools &
Services

Collaborative
Development
Tools

Virtual
Machines

Evaluation
Workbench

Education

Bibliography

Tutorials

Research

Resources

Guidelines

Schemas

De
-
Identification

UCSD Clinical
Data

TxtVect

Annotation
Admin &
eHOST

Registry

Tools &
Services

Collaborative
Knowledge
Authoring

Virtual
Machines

Evaluation
Workbench

De
-
Identification

TextVect

Annotation
Environment

Increase

access

to NLP

Decrease

Burden of

Developing

NLP

Collaborative Effort to Build Ecosystem

Registry

orbit

Increase ability to find NLP tools

Registry: orbit.nlm.nih.gov

Len D

Avolio, Dina Demner
-
Fushman

De
-
identification service

Increase access to clinical text

De
-
identification


Several available de
-
identification modules


Need to adapt to local text


Efficient


Secure


Customizable ensemble de
-
identification system


Build a de
-
identified corpus


Incorporate existing de
-
id modules


Launch as virtual machine


Iterative training, evaluation, and modification by user


Correct mistakes


Add regular expressions


Brett South, Stephane Meystre, Oscar Fernandez, Danielle Mowery

TextVect

Increase access to textual features

TextVect

Select

level


Sentence


Document

Select
features


Lexical:

N
-
gram


Syntactic:

Part
-
of
-
speech tags


Semantic:

UMLS codes

Select
output


Feature vector


Train classifier

NLM: Abhishek Kumar

collaborative Knowledge Authoring
Support Service (cKass)

Decrease the Burden of Customizing
an NLP Application

Customizing an IE App


User’s Concepts

Cough

Dyspnea

Infiltrate on CXR

Wheezing

Fever

Cervical
Lymphadenopathy

IE Output



Map

Customizing an IE App


User’s Concepts

Cough

Dyspnea

Infiltrate on CXR

Wheezing

Fever

Cervical
Lymphadenopathy

IE Output


Dry cough

Productive cough

Cough

Hacking cough

Bloody cough


Customizing an IE App


User’s Concepts

Cough

Dyspnea

Infiltrate on CXR

Wheezing

Fever

Cervical
Lymphadenopathy

IE Output


Temp 38.0C

Low
-
grade
temperature


Customizing an IE App


User’s Concepts

Cough

Dyspnea

Infiltrate on CXR

Wheezing

Fever

Cervical
Lymphadenopathy

IE Output


NECK: no
adenopathy


Disorder:
adenopathy

Negation: negated


KOS
-
IE

Knowledge Organization Systems for Information Extraction


Compile information helpful for IE

Radiologist

User KB

NLP Tools



Physician



Radiologist



Nurse



Clinical Researcher



Knowledge Engineer.


Decision
Support
System

Shared KB

External KB

Collaborative Knowledge Base Development: cKASS

LQ Wang, M Conway, F Fana, M Tharp, D Hillert

Knowledge Authoring

Augment user KB
with
lexical variants, synonyms,

and related concepts



User
-
driven authoring


Top
-
down: Provide access to external knowledge sources


UMLS, Specialist Lexicon,
Bioportal


Bottom
-
up: Annotate to derive synonyms


Recommendation
-
based authoring


Generate lexical variants


Mine external knowledge sources


Mine patient records




Evaluation workbench

Decrease the Burden of Evaluation &
Error Analysis

Evaluation Workbench


Compare the output of two NLP annotators on
clinical text


NLP system
vs

human annotation


View annotations


Calculate outcome measures


Drill down to all levels of annotation


Document
-
level


Perform error analysis


Future versions will support formal error analysis


Levels of Annotation


Document


Report classified as Shigellosis


Group


Section classified as Past Medical History Section


Utterance


Group of text classified as Sentence


Snippet


“chest pain”
classified as CUI 058273



Word


“pain”
classified as noun)



Token


“.”
classified as EOS marker



34

Document &

annotations

Outcome Measures for

Selected Annotations

Select
Classifications

to View

Report

List

Attributes for
Selected
Annotation

Relationships for
Selected
Annotation

VA and ONC SHARP: Christensen, Murphy, Frabetti, Rodriguez, Savova

Annotation Environment

Decrease the Burden of Annotation

Challenges to Annotating


Time consuming


Recruiting & training annotators for high agreement


Expensive


Domain experts especially expensive


Need for annotation by multiple people


Challenging to design annotation task


How many annotators?


How should I quantify quality of annotations?


Logistically challenging


Managing files and batches of reports


Setting up annotation tool


Reinventing the wheel


Hasn’t someone created a schema for this before?



How can we reduce the burden of
annotation?


iDASH Annotation Environment

Annotation Admin

eHOST

Web application

iDASH cloud


Client app on your computer


VA, SHARP, and NIGMS : S Duvall, B South, G Savova, N Elhadad, H Hochheiser

Goal: provide an environment to decrease the

Burden of annotation for research and application

Annotator
Registry

Annotator Registry



Enlist for annotation


Certify for annotation tasks


Personal health information


Part
-
of
-
speech tagging


UMLS mapping


Set pay rate



Searchable


Available for inclusion in
new annotation task

http://
idash.ucsd.edu
/
nlp
-
annotator
-
registry



Annotation Admin:

Intended Users & Uses

Users


NLP researchers


Annotation administrators

Uses


Manage annotation projects


who annotates what


Currently done with hundreds of files on hard drive


Integrate with annotation tool (eHOST)


Download batches of raw reports to annotators


Upload and store annotated reports


Manage simple annotation projects


Facilitate distributed annotation


1. Assign annotators to a task

Annotation Admin

2. Create a Schema

3. Assign users and set time expectations

3. Keep track of progress

Tools &
Services

Collaborative
Knowledge
Authoring

Virtual
Machines

Evaluation
Workbench

De
-
Identification

TextVect

Annotation
Environment

Increase

access

to NLP

Decrease

Burden of

Developing

NLP

Collaborative Effort to Build Resources

Registry

Conclusion


More demand for EHR data


NLP has potential to extend value of narrative clinical reports


There have been many barriers


To development


To deployment


Recent developments facilitate collaboration & sharing


Common annotation conventions


Privacy algorithms


Shared datasets


Hosted environments


iDASH hopes to facilitate


Development of NLP


Application of NLP

Questions | Discussion

Division of Biomedical Informatics

University of California, San Diego

Integrating Data for Analysis, Anonymization, and Sharing

wwchapman@ucsd.edu

iDASH
/
ShARe

Workshop on Annotation

September 29, 2012

La Jolla, CA