AUTOMATED CLASSIFICATION OF THE NARRATIVE OF MEDICAL REPORTS USING NATURAL LANGUAGE PROCESSING

huntcopywriterΤεχνίτη Νοημοσύνη και Ρομποτική

24 Οκτ 2013 (πριν από 3 χρόνια και 7 μήνες)

368 εμφανίσεις


AUTOMATED CLASSIFICATION OF THE NARRATIVE OF
MEDICAL REPORTS USING NATURAL LANGUAGE
PROCESSING

by

Ira Goldstein



A Dissertation
Submitted to the University at Albany, State University of New York
in Partial Fulfillment of
the Requirements for the Degree of
Doctor of Philosophy


College of Computing and Information
Department of Informatics

2011


Automated Classification of the Narrative of Medical Reports
using Natural Language Processing


by


Ira Goldstein








COPYRIGHT 2011



iii

Abstract
In this dissertation we present three topics critical to the document level
classification of the narrative in medical reports: the use of preferred terminology in light
of the presence of synonymous terms, the less than optimal performance of classification
systems when presented with a non-uniform distribution of classes, and the problems
associated with scarcity of labeled data when presented with an imbalance of classes in
the data sets.
The literature is replete with instances of conflicting reports regarding the value of
applying preferred terminology to improve system performance when presented with
synonymous terms. Our study shows that the addition of preferred terms to the text of the
medical reports helps to improve true positives for a hand-crafted rule-based system and
that the addition did not consistently improve performance for the two machine learning
systems. We show that the differences in the data, task, and approach can account for the
variations in these results as well as the conflicting reports in the literature.
The imbalance of classes in data sets can cause suboptimal classification
performance by systems based on an exploration of statistics for representing attributes of
data. To address this problem, we developed specializing, a panel of one-versus-all
classifiers, which have been activated in a strict order, and apply it to an imbalanced data
set. We show that specializing performs significantly better than voting and stacking
panels of classifiers when used for multi-class classification on our data.
Machine learning systems need labeled data in order to be trained, which is
expensive to develop and may not always be readily available. We combine the semi-
supervised approach of co-training with specializing in order to address the issues

iv

associated with a scarcity of labeled examples when presented with an imbalance of
classes in the data sets. We show that by combining co-training and specializing, we are
able to consistently improve recall on the less well-represented classes, even when trained
on a small number of labeled samples.

v

Acknowledgements
The path that led to this dissertation was one that could not have been traveled
without the support of many people. I was privileged to have an exceptional and
encouraging dissertation committee. I would like to thank Dr. Özlem Uzuner, my
committee chair, for sharing your professional world with me and for holding me to the
highest standards. You will always be my mentor and friend. I want to thank Dr. Jagdish
Gangolly and Dr. George Berg, the other members of my committee, for your valuable
suggestions which helped in the writing of this dissertation, and for our conversations
which expanded my horizons. While the department has a number of outstanding faculty
members, I would like to offer a special thank you to Drs. David and Deborah Andersen
for opening your home and your hearts to everyone in the Informatics program.
Few paths are straight and level. My dissertation experience had many peaks and
valleys. I would like to thank Fawzi and Dana Mulki, Meg Fryling, Jack Rivituso, and
Wil Doane for sharing the high points and for raising me up when I was low.
This dissertation would not have been possible were it not for my family. Thank you
to my parents, Allen and Pearl Goldstein, for giving me the confidence in my ability to do
anything that I set my mind to. Thank you to my sons, Matthew and Eric, for your love
and support. Last, but nowhere near least, to my wife Marilyn. Thank you for everything.
You cleared the path for me on this journey. I am very fortunate to be able to travel on all
of life’s journeys with you.



vi

Table of Contents
Abstract .............................................................................................................................. iii
Acknowledgements ............................................................................................................. v
Table of Contents ............................................................................................................... vi
Figures ................................................................................................................................ xi
Tables ............................................................................................................................... xiii
Chapter 1. Introduction ................................................................................................. 1
1.1. Key Concepts ...................................................................................................... 2
1.2. Research Questions ............................................................................................. 3
1.2.1. Preferred Terminology ................................................................................ 4
1.2.2. Imbalanced Data Sets .................................................................................. 5
1.2.3. Availability of Labeled Data ....................................................................... 6
1.3. Approach, Design, and Contribution .................................................................. 7
1.4. Structure of the Dissertation ............................................................................... 8
1.5. Summary ........................................................................................................... 10
Chapter 2. Related Literature ...................................................................................... 11
2.1. Introduction ....................................................................................................... 11
2.2. Natural Language Processing ............................................................................ 11
2.2.1. Morphology ............................................................................................... 12
2.2.2. Syntactics .................................................................................................. 13
2.2.3. Semantics .................................................................................................. 15
2.3. Automated Classification .................................................................................. 24
2.3.1. Hand-Crafted Rules .................................................................................. 25

vii

2.3.2. Supervised Machine Learning ................................................................... 27
2.3.3. Unsupervised Machine Learning .............................................................. 31
2.3.4. Semi-supervised Machine Learning .......................................................... 31
2.3.5. Combining Classifiers ............................................................................... 32
2.4. Challenges and Limitations in the Literature .................................................... 34
2.5. Summary ........................................................................................................... 37
Chapter 3. Medical Records and Evaluation Metrics ................................................. 38
3.1. Introduction ....................................................................................................... 38
3.2. Data ................................................................................................................... 38
3.2.1. CMC Data Set ........................................................................................... 39
3.2.2. i2b2 Data Sets ........................................................................................... 40
3.2.3. Institutional Review Board ....................................................................... 44
3.3. Evaluation Metrics ............................................................................................ 45
3.4. Summary ........................................................................................................... 47
Chapter 4. Examining the Role of Preferred Terminology in the Document Level
Classification of Medical Reports ..................................................................................... 48
4.1. Introduction ....................................................................................................... 48
4.2. Controlled Vocabulary ...................................................................................... 50
4.3. Data ................................................................................................................... 52
4.4. Methods ............................................................................................................. 53
4.4.1. Hand-Crafted Rule-Based System
............................................................ 54
4.4.2. Lucene Classifier ....................................................................................... 58
4.4.3. BoosTexter Classifiers .............................................................................. 60

viii

4.4.4. Noun Phrase Detection and Preferred Term Identification ....................... 62
4.4.5. Surface Processing .................................................................................... 70
4.5. Evaluation ......................................................................................................... 71
4.6. Results and Discussion ..................................................................................... 73
4.6.1. Hand-Crafted Rule-Based System ............................................................ 73
4.6.2. NPDP Post-Hoc Coordination .................................................................. 74
4.6.3. Lucene Classifier ....................................................................................... 75
4.6.4. BoosTexter Classifiers .............................................................................. 77
4.7. Conclusions ....................................................................................................... 80
4.8. Summary ........................................................................................................... 82
Chapter 5. Specializing for Predicting Obesity and its Co-morbidities ...................... 83
5.1. Introduction ....................................................................................................... 83
5.2. Related Work .................................................................................................... 87
5.2.1. Combining Classifiers ............................................................................... 87
5.2.2. Combined Classifiers ................................................................................ 88
5.3. Materials and Methods ...................................................................................... 90
5.3.1. Data ........................................................................................................... 91
5.3.2. Training and Test Sets .............................................................................. 92
5.3.3. Feature Extraction ..................................................................................... 92
5.3.4. Weka
......................................................................................................... 95
5.3.5. Specializing ............................................................................................... 97
5.3.6. Evaluation Metrics and Methods ............................................................ 105
5.4. Results and Discussion ................................................................................... 113

ix

5.4.1. Aggregate Result Analysis ...................................................................... 113
5.4.2. Disease-Level Results Analysis .............................................................. 117
5.4.3. Specializing analysis and discussion ....................................................... 119
5.5. Limitations and future work ............................................................................ 120
5.6. Accessibility of Data ....................................................................................... 122
5.7. Conclusions ..................................................................................................... 123
5.8. Acknowledgements ......................................................................................... 123
Chapter 6. Co-Specializing: Addressing the Scarcity of Labeled Data .................... 125
6.1. Introduction ..................................................................................................... 125
6.2. Co-Training ..................................................................................................... 127
6.3. Co-Specializing Classifier .............................................................................. 128
6.4. Data ................................................................................................................. 131
6.4.1. Feature Extraction ................................................................................... 132
6.4.2. Seed, Training, and Test Sets .................................................................. 134
6.4.3. Data Views .............................................................................................. 135
6.5. System Development ...................................................................................... 137
6.6. Evaluation ....................................................................................................... 138
6.7. Results and Discussion ................................................................................... 140
6.7.1. Aggregate Obesity Results ...................................................................... 140
6.7.2. Individual Obesity Disease Results
......................................................... 143
6.7.3. Smoking Status Results .......................................................................... 145
6.7.4. Co-Specializing Discussion .................................................................... 153
6.8. Conclusions ..................................................................................................... 154

x

6.9. Summary ......................................................................................................... 155
Chapter 7. Conclusions ............................................................................................. 156
7.1. Introduction ..................................................................................................... 156
7.2. Major Findings ................................................................................................ 156
7.2.1. Preferred Terminology ............................................................................ 156
7.2.2. Imbalanced Data Sets .............................................................................. 159
7.2.3. Availability of Labeled Data ................................................................... 161
7.3. Strengths and Limitations ............................................................................... 162
7.4. Issues for Additional Exploration ................................................................... 164
References ....................................................................................................................... 166
Appendix ......................................................................................................................... 189
A. Instructions for Annotating Semantic Correctness ................................................. 189
B. Abbreviations ......................................................................................................... 191


xi

Figures
Figure 1 - Sample radiology report from the XML ICD-9-CM data released as part of the
2007 CMC challenge. ....................................................................................................... 40
Figure 2 - Sample discharge summary from i2b2 Shared-Task and Workshop on
Challenges in Natural Language Processing for Clinical Data: Obesity Challenge. ........ 42
Figure 3 - Hand Crafted Rule-Based Coder - Developed from an examination of the test
data set. Synonyms developed from ICD-9-CM code definitions and expanded with
recurrent terms indentified in the text. Employs Negex’s pre-UMLS uncertainty and
negation phrases. Given the task of assigning codes only for definitive diagnosis,
uncertainty phrases are treated as negation. After processing with synonymy, uncertainty,
and negation a series of simple rules are applied. ............................................................. 57
Figure 4 - Lucene Based Coder – Training data are indexed (tf•idf) using the Apache
Lucene libraries. Test samples are treated as queries to the index. The top three ranked
indexed (training) documents are retrieved. Any codes used by a majority of the retrieved
records are assigned to the test sample. If there is no code in common among the top three
retrieved documents, the fourth highest ranked document is retrieved. Any codes in
common between the four retrieved records are assigned to the sample. If no codes are in
common, a NULL code is assigned to the test sample. .................................................... 60
Figure 5 - Boosting Based Coder – Boosting is a machine learning algorithm for
improving the performance of supervised learning systems. It performs several iterations
of breaking the data into subsamples, and training “weak learners” (i.e., a classifier that
performs slightly better than chance). The weak learners are combined to create the
boosted classifier. The boosted classifier assigns codes to each sample in the test data. . 62

xii

Figure 6 - General and Applied Processes for Selecting Specialist Classifiers ................ 99
Figure 7 - General and Applied Processes for Selecting Catch-all Classifiers. .............. 100
Figure 8 - Specializing Classifiers for a General Task and for the Disease Gallstones. . 101

xiii


Tables
Table 1- i2b2 Obesity Data. Y=Present, N=Absent, Q=Questionable, and
U=Unmentioned Number of samples in each class in the training and test sets. ............. 43
Table 2 - Smoking Status Training and Test Data. Number of samples per class ............ 44
Table 3 - UMLS Semantic Types and Codes .................................................................... 63
Table 4 - Number of times disease related noun phrases were detected by NPDP. .......... 64
Table 5 - Consistency of the NPDP assigned preferred terms in the CMC data set. ........ 67
Table 6 - Consistency of a sample of 4000 NPDP assigned preferred terms from the i2b2
Obesity data set. ................................................................................................................ 67
Table 7 - CMC Corrected Preferred Term Lucene Classifier runs at several levels of
manually induced inconsistency. ...................................................................................... 68
Table 8 - CMC Corrected Preferred Term Unigram BoosTexter Classifier runs at several
levels of manually induced inconsistency. ........................................................................ 68
Table 9 - CMC Corrected Preferred Term N-Gram BoosTexter Classifier runs at several
levels of manually induced inconsistency. ........................................................................ 68
Table 10 - Hand-Crafted Rule-Based System Performances. ........................................... 74
Table 11 - Number of unique noun phrases and unique preferred terms by data set. ....... 75
Table 12 - Lucene Classifier Performance. ....................................................................... 77
Table 13 - BoosTexter Classifier Performance. ................................................................ 79
Table 14 - Aggregate Results for Classifying Obesity and Fifteen of its Co-morbidities by
Combined and Complementary Classifiers ..................................................................... 109

xiv

Table 15 - Performance per Class, Aggregated Over All Diseases. ............................... 110
Table 16 - Macro-averaged F-measure by Disease. ........................................................ 112
Table 17 - Assignments by Specialists and Catch-all Classifiers by Disease. ................ 113
Table 18 - Obesity Ground Truth. ................................................................................... 135
Table 19 - Smoking Ground Truth. ................................................................................. 135
Table 20 - Base Classifier F-Measure for each disease and smoking data set on each of
the data views. ................................................................................................................. 137
Table 21 - Co-Specializing Experiments ........................................................................ 138
Table 22 - Performance per class, aggregated over six diseases. .................................... 141
Table 23 - Asthma, performance per class. ..................................................................... 146
Table 24 - Atherosclerotic CV Disease, performance per class. ..................................... 147
Table 25 - Diabetes, performance per class. ................................................................... 148
Table 26 - GERD, performance per class. ...................................................................... 149
Table 27 - Hypercholesterolemia, performance per class. .............................................. 150
Table 28 - Obesity, performance per class. ..................................................................... 151
Table 29 - Smoking, performance per class. ................................................................... 152
Table 30 - Abbreviations used. ....................................................................................... 191


1

Chapter 1. Introduction
The application of natural language processing (NLP) to the narrative of medical
reports can inform many applications, including those that provide syndromic
surveillance (Shapiro, 2004), develop patient problem lists (Bui, Taira, El-Saden,
Dordoni, & Aberle, 2004), aid decision support (Fiszman, Chapman, Aronsky, & Evans,
2000), and assign billing codes (Farkas & Szarvas, 2008). The objective of this
dissertation was to develop methods that improve the automatic document level
classification of the narrative of medical reports. Manual assignment of classes to
documents is laborious and is subject to the inconsistencies associated with any labor
intensive process. In addition to reducing the labor involved, automatic assignment of
classes also has the benefit of eliminating irregularities in classification that can arise due
to human error. This chapter describes key concepts, next develops research topics, then
examines the approach, design, and contribution taken, and lastly presents a roadmap of
the subsequent chapters.
This dissertation presents three topics that are critical to the document level
classification of the narrative of medical reports: the application of preferred terminology
in light of the presence of synonymous terms, the less than optimal performance of
classification systems when presented with a non-uniform distribution of classes, and the
problems associated with scarcity of labeled data when presented with an imbalance of
classes in the data sets. In order to insure a common understanding, we first define several
key concepts before presenting research topics.


2

1.1. Key Concepts
Synonyms are words or phrases that have the same essential sense or meaning. For
example, kidney stone and renal calculi both refer to the same medical condition. The
presence of these semantically-equivalent but lexically-disparate terms can lead to errors
in classification, question answering, and information retrieval. A controlled vocabulary
attempts to address this problem by creating a predefined list of preferred terms, each of
which describe a single concept. A preferred term gathers lexically-disparate but
semantically-equivalent terms under one term and provides a consistent means of
referring to a given concept. While the extensive use of controlled vocabularies has been
reported in the medical informatics literature, the results have been mixed. As described
in detail in section 2.2.3.3, some of the reports have shown that the application of
controlled vocabularies to clinical records can aid many tasks, while others have shown
no improvement. The conflict in the literature prompted this investigation into the
applicability of preferred terms to the classification of narrative medical records.
Real-world data is messy. The distribution of classes is typically skewed (e.g., in
medical records, the presence of a disease in patients is less frequent than the absence of
the disease), with few examples of the less well-represented classes. This non-uniform
distribution of classes and sparsity of samples can pose challenges to methods, such as
classification, that are based on an exploration of statistics for representing attributes of
data. In statistical classification, the small number of examples offered by a sparse, less
well-represented class may adversely affect the training of a classifier on that class. This
suboptimal classification performance of the less well-represented classes occurs as a
result of maximizing overall performance. While challenging to classify, the less well-

3

represented classes are often the more interesting classes (Kotsiantis & Pintelas, 2009)
and it is important to classify them correctly.
Training statistical machine learning classifiers to predict labels for pre-determined
classes requires labeled data. The cost associated with the labor of manually labeling a
data set is high and resources are often limited. This problem is compounded when trying
to label a sufficient number of examples given the inherent skewed distribution of most
data sets. Developing approaches that address the imbalance found in data and the limited
availability of labeled data is both important and challenging.
1.2. Research Questions
This dissertation addresses three topics that are important to the classification of the
narrative of medical reports. The first topic relates to the presence of synonymous terms
in the narrative of the medical reports and the ability to achieve a semantic understanding
of the text. Does classification performance improve when synonymous terms are
brought together under a standardized preferred term? The second topic looks at the
imbalance caused by the non-uniform distribution of classes in the data that can cause
less than optimal performance in classification systems. When presented with a non-
uniform distribution of classes in the data, can the application of a panel of one-versus-
all classifiers improve performance on the less well-represented classes? The third topic
concerns the limited amount of available labeled data for use in training classification
systems. When presented with a limited amount of labeled data for use in training
classification systems, can the application of a panel of one-versus-all classifiers improve
performance on the less well-represented classes? In order to address these topics, this

4

dissertation examined eight specific research questions that are detailed in the following
three sections.
1.2.1. Preferred Terminology
The first set of questions relate to issues brought about by synonymy. A concept can
be represented by semantically-equivalent but lexically-disparate terms (i.e., synonyms).
The presence of synonyms in medical reports may mask the semantic similarity between
the synonyms (Cimino, Hripcsak, & Johnson, 1994) and lead to errors in translation,
question answering, and information retrieval. Bringing synonyms together under a
standardized preferred term can obviate the lexical disparity between semantically-
equivalent terms, and can help reduce the errors caused by lexical disparity.
As described fully below in the section on Synonymy (page 21), the literature is
replete with examples of problems related to lexical disparity caused by synonyms. While
some studies have reported improved system performance when resolving lexical
disparity, cases have also been reported where resolving lexical disparity does not
improve system performance. These conflicting reports in the literature inspired the
following research questions:
• Will the addition of preferred terms for diseases and symptoms improve system
performance of a hand-crafted rule-based system in a multi-label document
classification task?
• Will the addition of preferred terms for diseases and symptoms, when enhanced
with the polarity of assertions about diseases and symptoms, improve system

5

performance of a hand-crafted rule-based system in a multi-label document
classification task?
• Will the addition of preferred terms for diseases and symptoms improve system
performance of machine learning systems in a multi-label document
classification task?
• Will the addition of preferred terms for diseases and symptoms, when enhanced
with the polarity of assertions about diseases and symptoms, improve system
performance of machine learning systems in a multi-label document
classification task?
1.2.2. Imbalanced Data Sets
The second set of research questions relate to imbalances found in data sets.
Chawla, Japkowicz, and Kotcz (2004) note problems when presented with a non-uniform
distribution of samples among the classes (e.g., there are very few cases of a particular
disease when compared to the total number of patients seen by a physician). They believe
that the imbalance caused by the non-uniform distribution of classes in data sets is
pervasive and ubiquitous. This imbalance may cause suboptimal classification
performance by systems based on an exploration of statistics for representing attributes of
data. Given a non-uniform distribution of classes, machine learning classifiers may
simply ignore the less-well represented classes in order to maximize overall performance
(Tang & Liu, 2005). These observations motivated the following research questions:

6

• Can the application of a panel of one-versus-all classifiers, when activated in a
strict order, outperform J48, Naïve Bayes, and AdaBoost.M1 classifiers on the
less well-represented classes in a multi-class, multi-label task?
• Can the application of a panel of one-versus-all classifiers, when activated in a
strict order, outperform the voting and stacking approaches to combining J48,
Naïve Bayes, and AdaBoost.M1 classifiers on the less well-represented classes
in a multi-class, multi-label task?
1.2.3. Availability of Labeled Data
The third set of research questions relate to the limited availability of labeled data.
Despite initiatives such as the Semantic Web (Berners-Lee, 1998) and the HL7 for
electronic health records (Dolin et al., 2006), the contents of many sources of data remain
predominately an uncategorized and unclassified bastion of free text. Even though
unclassified data may be obtainable, creating labeled data with which to train machine
learning classification systems requires human expertise (Zhou, 2009) and can be tedious
and expensive to develop (Zhong, 2005).
Semi-supervised methods of machine learning, such as co-training (Blum &
Mitchell, 1998), utilize unlabeled data to improve the results of supervised machine
learning systems that have been trained with a small labeled training set. However, semi-
supervised methods of machine learning do not inherently address the issues associated
with imbalanced data sets. These observations prompted the following research
questions:

7

• Does the application of co-training to a panel of one-versus-all classifiers, when
activated in a strict order, outperform J48, Naïve Bayes, and SVM classifiers on
the less well-represented classes in a multi-class classification task when
presented with a scarcity of labeled data?
• Does the application of co-training to a panel of one-versus-all classifiers, when
activated in a strict order, outperform J48, Naïve Bayes, and SVM based
co-training classifiers on the less well-represented classes in a multi-class
classification task when presented with a scarcity of labeled data?
1.3. Approach, Design, and Contribution
The narrative of medical reports can inform many applications. However, in order
to extract meaningful information, the medical reports need to be processed. The purpose
of this dissertation was to improve the understanding and use of foundational natural
language processing tools and of machine learning approaches in classifying the narrative
of medical reports. This research focused on one domain in order to prevent an overly
complex undertaking. This dissertation extended previous work in both NLP and machine
learning.
This dissertation used multiple data sets from multiple sources. These data sets
represented two different types of documents, radiological reports and discharge
summaries. As detailed in section

3.2, the characteristics of the radiological reports differ
from those of the discharge summaries.

8

This dissertation used multiple approaches to classification: hand-crafted rule-based
and machine learning approaches. The machine learning approaches used individual
classification algorithms as well as panels of individual classifiers.
This dissertation used baselines and intermediate systems with which to answer the
above stated research questions. Comparing full systems to both baseline and
intermediate systems enabled analysis that determined which component, or combination
of components, provided the improvement of system performance.
1.4. Structure of the Dissertation
This dissertation is presented in several parts. Chapters 1-3 provide context and
background for the reader. Chapters 4-6 present essays that address each of the research
questions. Chapter 7 reviews the findings, discusses limitations, and discusses issues for
additional exploration.
Chapter 1 presents an overview of the research, objectives, and research questions.
This dissertation is informed by work in the domains of natural language processing and
machine learning. Chapter 2 presents relevant literature in these two domains, focusing
on their application to the narrative of medical reports, and explores the strengths and
weaknesses found in the literature. Understanding the characteristics of the data provides
context for the results. Chapter 3 describes the data sets employed in this dissertation and
the evaluation metrics used in the experiments.
Chapter 4 examines the issue of lexical disparity and the role preferred terminology
plays as a means of post-hoc coordination in automatic classification of patient medical
data. We apply natural language processing tools to the free text of the medical reports.

9

We build and evaluate both a hand-crafted rule-based system and two machine learning
systems: one based upon Lucene and the other based upon BoosTexter. We experiment
with medical reports from two different corpora: one containing radiology reports and the
other containing discharge summaries. We gather semantically-equivalent but lexically-
disparate medical terms used in these reports under preferred terms for multi-label
classification of the medical reports.
Chapter 5 reports on a new method for combining classifiers that helps to address
the problems associated with imbalance of classes in the data sets. In this chapter we
present specializing, a method for combining classifiers for multi-class classification.
Specializing trains one specialist classifier per class and utilizes each specialist to
distinguish that class from all others in a one-versus-all (OVA) manner. It then
supplements the specialist classifiers with a catch-all classifier that performs multi-class
classification across all classes. We refer to the resulting combined classifier as a
specializing classifier.
Chapter 6 reports on the application of specializing to the problem of imbalance
given a scarcity of labeled data. In this chapter we present co-specializing, a method that
combines the approaches of co-training and specializing to address the issues associated
with a scarcity of labeled examples when presented with an imbalance of classes in the
data sets. Co-specializing employs two panels of classifiers, each panel making use of a
different view of the data.
This dissertation concludes with Chapter 7, which reviews the findings, discusses
limitations, and proposes future research. This is followed by a
list of references cited, the

10

instructions used for annotating semantic correctness, and a list of abbreviations used in the
dissertation.
1.5. Summary
This dissertation contributes to our understanding of the role preferred terminology
plays in the classification of the narrative of medical reports. This dissertation also
provides a new method for combining classifiers. This method addresses issues with
statistical machine learning classifiers when they are presented with a non-uniform
distribution of samples among the classes. Lastly, this dissertation examines the
capability of this new method when presented with a small set of labeled samples and a
large set of unlabeled samples.


11

Chapter 2. Related Literature
2.1. Introduction
This dissertation is informed by work in the domains of natural language processing
and machine learning. Both of these domains employ textual data. This chapter reports on
the relevant literature in these two domains, focusing on their application to the narrative
of medical reports, and explores the strengths and weaknesses in the literature.
The literature review begins with a description of natural language processing and
the morphologic, syntactic, and semantics tools employed. It continues by describing
research into issues related to ambiguity, negation, and synonymy. This chapter then
proceeds with a look at automated classification, including hand-crafted rules, supervised,
unsupervised, and semi-supervised machine learning methods, as well as approaches for
combining multiple classifiers. This section finishes with a discussion of the conflicts and
questions of generalizability found in the literature.
2.2. Natural Language Processing
Natural language processing provides morphologic, syntactic, and semantic tools
(Collier & Takeuchi, 2004; Manning & Schütze, 1999) to help transform stored text from
raw data (e.g., dictated text) to useful information (e.g., a standardized diagnostic
classification). Morphologic tools look at stored text as a sequence of linguistic units
without regard to context or meaning. Syntactic tools provide grammar-related
information about the text. Semantic tools, such as thesauri and ontologies, provide
information about the meaning and sense of words. Taken together, these NLP tools can
be used to resolve issues brought about by ambiguity, negation, and synonymy found in

12

stored text. NLP has been successfully applied to stored text in a wide range of
applications, including machine translation (Somers, 1999; Weaver, 1955), question
answering (Q.-l. Guo & Zhang, 2009; Katz & Lin, 2002), information retrieval (Lewis &
Spärck Jones, 1996; Sager, 1976; Tzoukermann, Klavans, & Strzalkowski, 2003),
document summarization (Luhn, 1958; Spärck Jones, 2007), and author identification
(Mosteller & Wallace, 1963; Stamatatos, 2009).
2.2.1. Morphology
The first NLP step in transforming stored text into useful information is to divide
the text into linguistic units (Mikheev, 2003). Each linguistic unit corresponds to a word,
number, or punctuation mark. These linguistic units are referred to as tokens. The process
of dividing the text into linguistic units is called tokenization.
Morphologic tools look at stored text as a sequence of tokens. Morphologic tools
can provide information about token frequency (Zipf, 1949), document length, and
vocabulary richness (Salton & McGill, 1983). To be more effective, syntactic and
semantic tools can benefit from surface processing provided by morphological string
manipulation.
Two common types of surface processing are case conversion and stemming, both
of which attempt to conflate variations of a given token. Case conversion transforms each
token so that all of the alphabetic characters are the same case. For example, when
converting to lower case, the tokens EXTRACT and Extract would both become extract.
Stemming brings together variations of a word (Lovins, 1968; Porter, 1980) by
manipulating word suffixes. For example, the tokens extraction and extracted would both

13

be transformed into the token extract. It should be noted that in tasks such as information
retrieval, stemming typically improves recall while reducing precision (Tzoukermann et
al., 2003), performance metrics commonly used in natural language processing tasks (see
section

3.3 for a full description of these metrics). Stemming algorithms do not always
correctly handle the inconsistencies present in natural languages. Lovins (1971) describes
under-stemming and over-stemming issues that occur in stemming algorithms.
Under-stemming occurs when stemming fails to bring together two words that should
otherwise be brought together (e.g., patience and patiently). Over-stemming occurs when
stemming brings together words that should be kept separate (e.g., experiment and
experience both stem to experi).
2.2.2. Syntactics
In addition to morphological processing, written text can be analyzed by looking at
the rules for arranging words into phrases, sentences, and paragraphs (i.e., syntactics), as
well as by looking at the meaning and sense of the words (i.e., semantics). Syntactic tools
provide grammar-related information about the text. Syntactic tools can parse the text,
examining each token in context (i.e., in relationship to the surrounding tokens), and can
tag each token in the text with its grammatical part of speech (e.g., verb, noun, adjective)
(Brill, 1992; Church, 1988; DeRose, 1988; Janas, 1977). For example, pool is a noun in
They swam in the pool. but a verb in They decided to pool their resources. The
information provided by part of speech (POS) tools is not limited to individual tokens.
Shallow parsing divides the text into non-overlapping series of contiguous tokens
(chunks) based upon their linguistic properties (Federici, Montemagni, & Pirrelli, 1996).

14

Each chunk is then tagged with phrasal information. For example, the tokens urinary,
tract, and infection when appearing together can be identified as the distinct noun phrase
urinary tract infection. Approaches to tagging have been based upon both statistical (e.g.,
context-pattern rules and Hidden Markov Models) and non-statistical (e.g.,
transformation-based rules and context free grammars) methods.
Grammatical parsing and POS tagging can be accomplished in many different ways.
Greene & Rubin’s TAGGIT (1971) uses hand-crafted context-pattern rules to assign POS
tags. These assignments are made based upon a word’s character patterns as well as upon
the POS tags of the words to the right and to the left (i.e., its local context). This early
system could disambiguate approximately 77 percent of the words in the Brown
University million word corpus (Voutilainen, 2003). The University of Lancaster
CLAWS system (Marshall, 1983, 1987) extends TAGGIT by applying probabilities of the
occurrence of bigrams (pairs of consecutive words), derived from a subset of the Brown
corpus. This use of statistical corpus evidence improves accuracy at the expense of
needing a tagged corpus in order to determine bigram probabilities.
A second statistical approach, Hidden Markov Models (HMM), has been
successfully applied to POS tagging (Cutting, Kupiec, Pedersen, & Sibun, 1992). One
advantage of an HMM approach is that it does not require previously tagged text for
training. This approach is considered hidden since the POS categories are not observable.
A second advantage is that HMMs operate on sentences and not on a word’s local
context. While HMMs can be very accurate, since they do not build upon pre-tagged text,
the results stray at times from correct tag assignments (Abney, 1997).

15

Transformation-based tagging (Brill, 1992, 1995; Hindle, 1989) is a non-statistical
syntactic NLP tool. Transformation-based tagging is an iterative process that generates an
ordered list of rules. The process begins by tagging known words with their most-
frequently occurring tag, and tagging unknown words with an arbitrary tag such as noun.
The process then compares the current tagging with a hand-tagged training text. For each
tagging error identified, a candidate new rule is generated. If the new rule corrects more
tags than it breaks, it is then added to the ordered list as an exception and the tagging is
updated based upon the new rule. The process then repeats the comparison to the hand-
tagged training set and new rule creation steps until no new rules are added to the ordered
list.
Another non-statistical approach to parsing text utilizes context free grammars. A
context free grammar looks at the syntax of a language “as having a small, possibly finite
kernel of basic sentences … along with a set of transformations that can be applied to
kernel sentences or to earlier transforms to produce new and more complicated sentences
from elementary components” (Chomsky, 1956, p. 124). Context free grammars are
useful in analyzing the syntax of languages and sub-languages and can be used to parse
free text.
2.2.3. Semantics
While NLP tools can resolve issues associated with morphology and syntax, the
semantic understanding (i.e., the meaning) of text remains a challenge. The nature of
meaning is still an unsettled philosophical question (Manin, 2008). In fact, there is no
universally agreed to meaning of the word meaning. Brodbeck (1968) describes four

16

distinct meanings for the term meaning: referential, significance or lawfulness,
intentional, and psychological. Even when referential, the articulated meaning of a
concept can be generalized, indirect, or otherwise vague. S.C. Levinson (2000, p. 29)
explains this by noting that, with respect to human effort, “inference is cheap, articulation
expensive.” For example, an individual who says I am going to the city has avoided
explicitly articulating their destination, and has left it up to the listener to infer the name
of the city to which they are going.
Semantic tools, such as thesauri and ontologies, provide information about the
meaning and sense of words. The semantic context determines dependencies (e.g., “part
of” or “is a” types of relationships) and the sense intended for a particular word (Miller,
1995; Miller, Beckwith, Fellbaum, Gross, & Miller, 1990), i.e., it is used to disambiguate
the word. While POS taggers can determine when a given word is used as a noun or as a
verb, they can not be used by themselves to disambiguate the sense of a word for a given
part of speech. For example, it is necessary to disambiguate the noun bank in order to
understand if we met at the bank refers to a gathering at a financial institution or at the
sloping land adjacent to a body of water.
2.2.3.1. Ambiguity
Ambiguity is the inability to discern the sense of a polysemous word (i.e., a word
with two or more meanings) in a specific context. Word sense disambiguation (WSD)
attempts to determine the correct sense of a word (Schütze, 1998). Using a POS tagger by
itself can help to disambiguate a word based upon the part of speech (e.g., see the
noun/verb example for the word pool above in the section on Syntactics). However, the
tagger can not disambiguate when the various senses have the same part of speech. For

17

example, is Monty Hall the name of a person, or the name of an auditorium? One
approach to performing WSD is to combine multiple methods. McRoy (1992) combines
POS tagging with morphology, collocations, and word associations lexical knowledge
sources. She assigns weights to the senses provided by each knowledge source to
determine the correct sense. She tests the system on a 25,000 word corpus from the Wall
Street Journal and was able to disambiguate 98% of all non-proper nouns and non-
abbreviated words.
Word sense clustering (Pereira, Tishby, & Lee, 1993; Purandare & Pedersen, 2004;
Schütze, 1998) has been used to perform WSD. Word sense clustering uses untagged text,
and cluster instances of a word, based only on the instances’ mutual contextual
similarities. Contextual similarity is based upon surface lexical features. Each word is
represented by a context vector in a high dimensional word space. Disambiguation is
accomplished by mapping the ambiguous word to the word space and then selecting the
cluster whose centroid is closest.
WSD may be aided by two observations about word senses. Gale, Church, and
Yarowsky (1992, p. 236) observe that polysemous words that appear multiple times “in a
well-written coherent discourse” will all share one sense per discourse. Yarowsky (1993)
takes this further and observes that words have one sense per collocation (defining
collocation as the “co-occurrence of two words in some defined relationship”). Krovetz
(1998) and Martinez and Agirre (2000) have made contrary observations. They each
report that multiple senses occur more than 30% of the time in a corpus. However,
Martinez and Agirre also report that one sense per collocation does appear to hold within
a given corpus, but that the collocations vary across corpora.

18

2.2.3.2. Negation and Assertion
Even when presented with unambiguous word senses, the text surrounding a word
or phrase can alter the assertion being made about the word or phrase (e.g., compare the
assertions in I am hungry. and I am not hungry.). Assertions can be divided into positive
assertions, negative assertions, and uncertain assertions (Chapman, Bridewell, Hanbury,
Cooper, & Buchanan, 2001). Understanding the type of assertion made in a statement is
important in interpreting the meaning of the statement.
According to Horn and Katō (2000, p. 1), “negative utterances are a core feature of
every system of human communication.” However, negation is considered to be a
troublesome aspect of medical NLP (Friedman, Alderson, Austin, Cimino, & Johnson,
1994). The ability to decode negation is important in understanding medical narratives
(Chapman et al., 2001). In the narrative of medical reports, physicians often state that a
given condition is absent or that a disease can be ruled out (Mutalik, Deshpande, &
Nadkarni, 2001). The simple detection that a condition or disease is mentioned in a
medical report is not sufficient to state that the condition or disease is present in the
patient.
In addition to simple negation phrases, such as patient denies pneumonia or
unremarkable two views of the chest without focal pneumonia, where it is obvious that a
given diagnoses has been eliminated, the negation phrase may sometimes be at a distance
from the diagnosis which is being negated (Chapman et al., 2001). For example, in the
sentence No fevers, chills, sweats, nausea, vomiting, diarrhea, chest pain and only mild
shortness of breath. the word no applies to everything in the sentence except for mild
shortness of breath. While the prior example contained hints to determine the scope, such

19

hints do not appear in all instances. Therefore, the scope of a negation may not always be
evident. For example, in the short sentence No cough, fever., is fever present or absent?
Assertions made in medical reports are used to form a diagnosis, and are frequently
divided into positive assertions, negative assertions, and uncertain assertions. Positive
assertions are those that are believed to be true and, for example, indicate that a disease is
present in the patient. Negative assertions are those diagnoses that have been eliminated
as potential problems in the patient, while uncertain assertions are those that have not
been eliminated, but are not definitively present in the patient. Medical professionals will
often assert negative or uncertain diagnoses in patient records (Rao, Sandilya, Niculescu,
Germond, & Rao, 2003). These negative and uncertain assertions are used to provide
information that contrasts with the positively asserted diagnoses (Kim & Park, 2006), as
well as to catalog all of the diagnoses that have been considered in order to maintain a
record reflecting the thought process used to arrive at the final diagnosis.
Negative and uncertain assertions in text can adversely affect the performance of
NLP systems. For example, Xu and Croft (2000, p. 93), while investigating the impact of
local context on information retrieval, referred to “negative indicators of relevance” as
leading to the retrieval of incorrect concepts when a negative phrase was included in a
query. Rao et al. (2003) note that the information contained in the negative phrases of a
diagnosis is lost when using a bag-of-words approach to mine medical text, which then
results in negated items being erroneously considered as if they were present in the
patient.
Several research efforts focus on identifying and making use of uncertain and
negative assertions in text. Mutalik et al. (2001) use NegFinder (look-ahead, left-

20

recursive (1) grammar) to show that the Unified Medical Language System (UMLS) can
be used to reliably detect negated concepts in medical narratives. They examine full
negations and total absence of a concept, and find that negations generally occurred in
close proximity to the target. They conclude (page 598) that “[n]egation of most concepts
in medical narrative can be reliably detected by a simple strategy. The reliability of
detection depends on several factors, the most important being the accuracy of concept
matching.” Sibanda (2006) extends NegEx (Chapman et al., 2001), a series of regular
expressions that both precede and follow target words, in order to identify not just
positive, negative, and uncertain assertions, but also assertions made in reference to
someone other than the patient. Patrick et al. (2007) use an atomic parse approach,
finding that negations do not often cross sentence boundaries. Harkema, Dowling,
Thornblade, and Chapman (2009) also extend NegEx to identify assertions that are
historical (i.e., reports of prior conditions in the patient).
The use of negation has relevance to real-world applications. This can be illustrated
by examining the role of negation in the assignment of standardized diagnostic codes to
medical reports. For example, the 9
th
Revision of the International Classification of
Diseases (ICD-9) is a standardized list of diagnostic classifications (World Health
Organization, 2009) and is used to code clinical information, billing charges (Puckett,
1986), and epidemiological records (Tsui, Wagner, Dato, & Chang, 2002). ICD-9 codes
can also assist in the development of patient problem lists (Bui et al., 2004). ICD-9-CM is
the Clinical Modification to ICD-9 and is the standard used in the United States (National
Center for Health Statistics, 2007).

21

Creating ICD-9-CM coding systems requires recognizing the disease(s) and
symptom(s) positively asserted in the medical narratives to be present in the patient
(Lussier, Shagina, & Friedman, 2000). Over-coding reports, i.e., assignment of
unnecessary ICD-9-CM codes to a report, is a discouraged practice (Pedersen, Pakhomov,
Patwardhan, & Chute, 2007; Pestian et al., 2007). Therefore, while physicians often assert
uncertain or negative diagnoses in patient records (Rao et al., 2003), either to provide
information that contrasts with the positive diagnoses (Kim & Park, 2006) or to keep
track of all potential diagnoses that have been considered or dismissed, these uncertain or
negative diagnoses should not be coded (Tulane University, 2006). The presence of
negative and uncertain assertions in patient records complicates automatic ICD-9-CM
coding.
2.2.3.3. Synonymy
“[T]he complexity, variability, and richness of natural language, rather ironically,
leads to ambiguity” (Cleveland & Cleveland, 2001, p. 35) and can frustrate users’ efforts
in information retrieval. In medical reports, synonymous concepts can be represented by
semantically-equivalent but lexically-disparate terms (i.e., multiple words or phrases that
have the same essential sense or meaning). For example, incontinence and enuresis both
refer to the same medical condition, and when encountered need to be treated in the same
manner. Informal terms, such as slang (e.g., positive reactor for positive reaction to PPD)
and abbreviations (e.g., UTI or U. T. I. for urinary tract infection) also need to be
considered as synonyms of the more formal terms. Determining all possible alternate
terms for a concept can be challenging.

22

The presence of synonymous terms in medical reports masks the semantic similarity
between these terms (Cimino et al., 1994) and can hinder applications that rely on
morphologic tools in order to make decisions about documents, e.g., can lead to errors in
classification, question answering, and information retrieval. Bringing synonyms together
under a standardized preferred term can obviate the lexical disparity between
semantically-equivalent terms and help reduce the errors caused by lexical disparity.
A common method of resolving lexical disparity in free text is the use of a
controlled vocabulary (Brennan & Aronson, 2003). A controlled vocabulary is a
predefined list of authorized terms, each of which describe a single concept (or when used
in combination describe a more specific single concept), which reduces or, hopefully,
eliminates the ambiguity of those concepts (National Information Standards Organization
(U.S.) & American National Standards Institute, 2006). In a controlled vocabulary, one of
the lexically-disparate but semantically-equivalent terms is designated as the preferred
term. The preferred term gathers lexically-disparate but semantically-equivalent terms
under one term and provides a consistent means of referring to a given concept.
The literature is replete with examples of problems related to lexical disparity.
While investigating the use of free text for syndromic surveillance, Shapiro (2004)
explores issues relating to word variation in chief-complaint data. He reports challenges
caused by lexical disparity. He observes (page 95) that “a single symptom can be
described in multiple ways by using synonyms and paraphrases.” Similarly, Aronow,
Fangfang, and Croft (1999) report issues of unpredictable data quality, including
problems with inconsistent use of terminology, when attempting to classify radiological
reports.

23

The extensive use of controlled vocabularies has been reported in the medical
informatics literature. Cimino et al. (1994) note that representing concepts through a
controlled vocabulary is a “fundamental requirement” in medical informatics. Controlled
vocabularies can either be hand-built for a specific task or employ standardized resources.
Goldstein et al. (2007) develop their own controlled vocabulary. They gather
semantically-equivalent disease names and descriptions under hand-selected preferred
terms. They substitute instances of semantically-equivalent disease names and
descriptions with their preferred terms. They show that this substitution improves the true
positives in ICD-9-CM coding of narratives of radiological reports by a hand-crafted rule-
based system. The UMLS Metathesaurus (Aronson, 2001; Nadkarni, Chen, & Brandt,
2001) is often used as a controlled vocabulary for medical reports. MetaMap (Aronson,
2001) uses shallow parsing to identify phrases and maps them to concepts from the
UMLS Metathesaurus.
The application of controlled vocabularies to clinical records can aid many tasks.
Delbecque et al. (2005) show promising results when mapping noun phrases to UMLS
semantic types in a question-answering task. Brennan and Aronson (2003) successfully
apply MetaMap to the task of identifying medical concepts in free text messages written
by lay (non-medical) people. They use MetaMap to map the text of patient e-mails to
UMLS concepts. They focus their investigation on the Nursing vocabularies within the
UMLS Metathesaurus. They compare the ratio of the number of concepts identified to the
number of phrases in each e-mail for each of the vocabularies. They find the best results
when using a combination of UMLS Metathesaurus vocabularies.

24

Cases have also been reported where controlled vocabularies do not improve system
performance. Ruch (2006) combines a pattern matcher and a vector space retrieval engine
to classify MEDLINE abstracts. Ruch found only modest (+0.2%) improvement in his
system when using Medical Subject Headings (MeSH) for vocabulary control. In a review
of the state of the art of term identification, Krauthammer and Nenadic (2004, p. 524)
found that “[r]elying exclusively on existing controlled vocabularies to identify
terminology in text suffers from both low recall and low precision, as such resources are
insufficient for automatic terminology mining.” Passos and Wainer (2009) apply
WordNet (Miller et al., 1990) as a controlled vocabulary to a document clustering task.
They hypothesize that WordNet as a controlled vocabulary would mitigate issues of
synonymy. They find that measures of word similarity computed using the relationship
between words on WordNet do not improve document clustering on their data.
2.3. Automated Classification
The task of automated classification entails the training, or the creation, of a
classifier that assigns class labels to instances in the data. Automated classification forms
the basis for systems that can be used for diverse purposes including spam filtering
(Blanzieri & Bryl, 2008), authorship attribution (Diederich, Kindermann, Leopold, &
Paass, 2003), web page classification (Blum & Mitchell, 1998), and the assignment of
medical billing codes to patient records (Farkas & Szarvas, 2008). Systems have been
based on hand-crafted rules, on machine learning algorithms, and on a combination of
both approaches (Guzella & Caminhas, 2009; Takahashi, Takamura, & Okumura, 2005).
Hand-crafted rules and machine learning algorithms each have their own strengths and

25

weaknesses. Systems based on hand-crafted rules benefit from expert knowledge of the
domain but require considerable effort to retrain. However, while machine learning
systems can usually be retrained with minimal effort, they may not perform as well as
hand-crafted rule-based systems (Ben-David & Frank, 2009).
Comparing automated classification with manual classification performed by
humans we see that manual classification is typically based upon Aristotle’s hierarchical
theory of categories that brings items with common characteristics together into the same
grouping (Ludwig Wittgenstein, as cited by Lakoff, 1987; Taylor, 2004). This approach
has sufficed when dealing with a relatively small number of items that could be placed
into relatively broad and well-defined categories (e.g., books into U.S. Library of
Congress Subject Headings). On the one hand, manual classification can be labor
intensive, costly (Medelyan & Witten, 2008; Takahashi et al., 2005), and can suffer from
issues of consistency in class assignments (Leininger, 2000; Olson & Wolfram, 2008).
Automated classification, on the other hand, can reduce the labor and labor related costs
associated with manual classification, while at the same time can provide consistent class
assignments.
2.3.1. Hand-Crafted Rules
The use of expert knowledge to manually develop automated classification systems
appears to be a fruitful approach to classifying medical free text. Wilcox and Hripcsak
(2003, p. 330) report that “[f]or medical text report classification, expert knowledge
acquisition is more significant to performance and more cost-effective to obtain than
knowledge discovery. Building classifiers should therefore focus more on acquiring

26

knowledge from experts than trying to learn this knowledge inductively.” In the same
article, Wilcox and Hripcsak (2003, p. 336) report having also considered machine
learning rule-based (decision trees and rule induction), instance-based (nearest neighbor),
and probabilistic (naïve Bayes) algorithms; they conclude that “no inductive learning
performed as well as physicians or expert-written rules.”
Goldstein et al. (2007) compare three systems for automatically predicting the
ICD-9-CM codes of radiological reports (which is expanded upon in Chapter 4) and
report that their hand-crafted rule-based system outperforms the algorithmically more
complex BoosTexter, a machine learning algorithm. They also report that the rule-based
system outperforms a bag-of-words term frequency - inverse document frequency vector
space model based upon Apache Lucene. A rule-based approach can also enhance
machine learning approaches; Mendonca et al. (2005) successfully add rules to the
MedLee system to find information about neonatal pneumonia.
The primary deficiency of a hand-crafted rule-based system is its static nature
(Hughes, Gose, & Roseman, 1990). Retraining a hand-crafted rule-based system is a
manual, and potentially costly, process. As new information arises that requires an update
to a natural language processing system, extensive human involvement (e.g., analysis and
computer programming) is necessary in order to accommodate the new information (Ben-
David & Frank, 2009). Even when the effort is made to maintain such systems, it is not
without long-term problems. Clancey (1983) reports on the MYCIN program and finds
that updating the rule set is a difficult task for anyone other than the original rule authors.
Studer et al. (1998), in a review of the field of Knowledge Engineering, observe that rule-
based systems become hard to maintain over long periods of time.

27

2.3.2. Supervised Machine Learning
Machine learning (ML) approaches to NLP are often preferred to hand-crafted rules
for several reasons. First, while hand-crafted rules may be based on intuition and
experience (Paek & Pieraccini, 2008), ML approaches benefit from being able to detect
subtle interactions between features that are important to classifying data, but may not be
discernable to humans who build hand-crafted rules. These patterns are useful in
understanding classification only if the selected attributes and patterns are meaningful to
humans (Michalski & Stepp, 1983). The literature also reports that ML approaches can
excel when presented with high dimensional feature space in the data (Apté, Weiss, &
Grout, 1994). Lastly, due to the reasons noted above, the cost of retraining a ML classifier
will typically be lower than the costs associated with updating hand-crafted rules (Ben-
David & Frank, 2009).
Approaches to machine learning include k-nearest neighbor (k-NN), vector space,
support vector machine, and Bayesian algorithms. ML approaches can be used
individually or as an ensemble. Depending upon the algorithm, machine learning
classifiers are trained using labeled data, non-labeled data, or a mix of labeled and
non-labeled data.
2.3.2.1. K-Nearest Neighbor
K-NN (Cover & Hart, 1967; Fix & Hodges, 1952) is a lazy machine learning
algorithm in which attributes of the training data are mapped to an n-dimensional space.
The algorithm is considered lazy since, for each sample in the training set, it simply stores
a feature vector that represents that sample, and delays all computation until presented
with a test sample to classify. To assign a class label to an unclassified test sample, the

28

attributes of the test sample are mapped to the same n-dimensional space as used by the
training data. The k closest samples (nearest neighbors) from the training set are
retrieved. The unclassified test sample is then assigned a class label based upon a vote of
the k retrieved training samples. The number of neighbors (k) is a positive integer, and is
usually an odd number to avoid ties in the voting. Retraining simply entails mapping
additional annotated training data to the n-dimensional space.
An information retrieval bag of words approach can be considered a type of k-NN,
using words as attributes (i.e., neither syntax nor semantics are considered). This
approach is similar to k-NN in that it employs a vector space model (Salton & Buckley,
1988). However, it differs from k-NN in that each word (attribute) is weighted by term
frequency - inverse document frequency (tf•idf), which looks at the similarity between
search terms and documents. The term frequency (tf) is measured by the frequency of a
term in a given document, normalized for document length by dividing it by the total
number of words in the document. Document frequency (df) is the count of those
documents in a corpus that contain any occurrences of the search term. Inverse document
frequency (idf) is simply 1/df. When combined, tf•idf expresses in a single measure how
well the search term describes each of the documents (tf), and how common the search
term is in the corpus (df). tf•idf also provides a means for ranking a set of retrieved
documents. The highest ranked retrieved documents (i.e., those with the highest tf•idf)
can be used to inform the coding of the target document (Tzoukermann et al., 2003).
Retraining simply requires adding newly categorized documents to the stored corpus, and
removing changed or superseded documents from the stored corpus.

29

2.3.2.2. Support Vector Machines
Support Vector Machines (SVM) (Burges, 1998; Vapnik, 1995) are supervised
machine learning algorithms that allow the use of linear classification techniques to be
applied to non-linear data that are not linearly separable. This is done by mapping the data
into a multi-dimensional space such that a hyper-plane is able to separate one class of
data from another. Targets are then classified based upon the side of the hyper-plane upon
which they fall in the multi-dimensional space. The support vectors refer to those training
samples that are used to define the hyper-plane. Interestingly, an SVM trained only on the
support vector training samples would yield the same hyper-plane as an SVM that is
trained on all of the training samples. Retraining simply requires a remapping and
transformation of all of the training data into a (potentially) new multi-dimensional space
in order to find the new separating hyper-plane.
2.3.2.3. Decision trees
Decision trees (Quinlan, 1986, 1993) construct directed acyclic graphs (i.e., there is
only one path between any two nodes) to predict the label of an unclassified test sample.
The branches of the tree (i.e., nodes of the graph) represent possible values of an attribute.
The leaves represent the class labels. The path between two nodes follows a one-way
direction from the parent node to the child node. In machine learning, a tree is constructed
(i.e., a classification rule is generated) by recursively selecting an attribute and splitting
the training data into subsets based upon the values of that attribute. The recursion ends
when all of the members of the subset have the same class label or when the subset can
no longer be divided. The structure of the tree will be determined by the algorithm used to
split the training data.

30

Decision tree construction varies based upon the algorithm used to determine the
best split for each subgroup of the training data. Approaches such as Quinlan’s (1986;
1993) ID3 and C4.5 use information gain to determine the best split. Information gain
measures the change in information entropy (Shannon & Weaver, 1949) for a given split.
Classification and Regression Trees (CART) (Breiman, Friedman, Olshen, & Stone,
1984) uses the Gini index to determine the best split. The Gini index is calculated by
taking the area between a Lorenz curve and the diagonal representing the line of perfect
equality, and dividing it by the half-square area in which the Lorenz curve lies (Weiner &
Solbrig, 1984).
2.3.2.4. Naïve Bayes
Bayes Theorem (Bayes 1763, as cited by Pearl, 1988):
)(
)()|(
)|(
CP
APACP
CAP =

states that the conditional (posterior) probability of A, given knowledge of C, relates to
the conditional probability of C given A, and the marginal probabilities of A and C. This
results in a belief function that shows that once we have observed C, we then update our
belief about observing the outcome of A. Bayesian classifiers (Duda & Hart, 1973) use
the conditional probability of each attribute given the class, and the marginal probabilities
of the attribute and the class, in order to predict the class label given a set of attributes for
an unclassified test sample. Naïve Bayes classifiers make the assumption that the state of
each attribute is independent of the state of all of the other attributes. Even though this
assumption is often not true in real-world data, Naïve Bayes’ results tend to be
competitive with other machine learning approaches (John & Langley, 1995). Since

31

Bayesian classifiers use correlation, and not causation, their results are independent of the
order in which training samples are provided.
2.3.3. Unsupervised Machine Learning
The above ML approaches all use labeled training data, and are referred to as
supervised learning methods. The dilemma with training on labeled data is that it can be
tedious and expensive to develop (Zhong, 2005) and may not always be readily available
(Zhou, 2009). Unsupervised methods of machine learning, such as SenseClusters
(Purandare & Pedersen, 2004), do not use labeled data, but develop categories from the
content of the training set. As such, the resultant categories will not necessarily align with
pre-defined classification schemes (e.g., ICD-9, Library of Congress Subject Headings).
However, these categories may lead to the discovery of previously unknown relationships
in the data. Unsupervised approaches (e.g., context free grammars) can be useful in
creating the scaffolding for building automated computerized classification systems.
2.3.4. Semi-supervised Machine Learning
Semi-supervised methods of machine learning (Blum & Mitchell, 1998; Mitchell,
1999; Nigam, McCallum, Thrun, & Mitchell, 2000) utilize unlabeled data to improve the
results from supervised machine learning which has been trained with a small labeled
training set. Semi-supervised machine learning predicts class labels based upon the set of
classes found in the labeled data. Nigam et al. (2000) examine the effect of unlabeled data
on supervised classification and find that the addition of unlabeled data can reduce
classification errors by up to 30%. Co-training (Blum & Mitchell, 1998) is a semi-
supervised approach that uses two conditionally independent views of the data.

32

Co-training iteratively trains classifiers, alternating between each of the two views of the
data. At each iteration, the classifier adds the newly predicted sample for which it is most
confident of its prediction to the set of labeled data. The classifier uses the samples
labeled from one view in an attempt to improve the learned classifier for the other view.
Both Nigam and Ghani (2000) and J. Chan et al. (2004) explore the requirements of
redundant sufficiency and conditional independence. They both observe that co-training
appears to benefit from redundancy within a feature set. They both examine co-training
using a random split of features, rather than natural split of features. J. Chan et al. (2004,
p. 589) find that “co-training using a random split of all the features was just as
competitive as, and often outperformed, co-training with the natural feature sets.”
2.3.5. Combining Classifiers
Systems that combine classifiers typically outperform individual classifiers
(Daskalakis et al., 2008; Duin & Tax, 2000; Eom, Kim, & Zhang, 2008). The selection of
classifiers and the method for combining classifiers is dependent upon the characteristics
of the data to be classified. Duin and Tax (2000, p. 28) report that “there is no overall
winning combining rule” for classifiers. Classifiers can be combined as ensembles of
different classification algorithms working with the same training set, or as multiple
iterations of a given algorithm each of which works on different subset of the training
data (Kittler, Hatef, Duin, & Matas, 1998). Voting is an example of the former, boosting
an example of the latter, and stacking blends both approaches.
Voting considers the predictions from several individual classifiers and assigns a
class label based upon a combination rule. Majority rule is a consensus approach; the

33

class label with the most votes is assigned to the test sample. Given a probability
distribution of each of the individual classifiers, each classifier’s prediction can be
weighted before voting. When dealing with assigning continuous data as the label, the
weighted votes may be combined using a median rule that would avoid errors caused by
outliers (Kittler et al., 1998).
Boosting (Freund, 1995; Freund & Schapire, 1996; Schapire, 1990; Schapire &
Singer, 2000) is a supervised machine learning algorithm for boosting the performance of
supervised learning systems. A generic boosting algorithm performs multiple iterations of
the following steps: 1) it divides the data into subsamples and 2) it trains a weak learner,
i.e., a classifier that performs marginally better than chance, for a set of subsamples. At
the end of each iteration the algorithm assigns more weight to the samples that were
misclassified in the previous iterations, increasing the probability that those samples will
be trained on by the next weak learner. At the end of a predetermined number of
iterations, which is usually developed empirically, the weak learners are combined into
the final boosted classifier, which usually performs better than any of the individual weak
learners. Unclassified test samples are then classified by applying the set of rules
developed by the classifier to the test samples. Retraining requires that the new training
data are processed by the boosting algorithm, which then creates a new boosted classifier.
Adaboost (Freund & Schapire, 1996) is an early implementation of the boosting
algorithm that is able to boost numerous base learners. BoosTexter (Schapire & Singer,
2000) boosts decision stumps (a single level decision tree) and can perform multi-label
classification. MultiBoosting (Webb, 2000) extends Adaboost by using a C4.5 decision

34

tree as the base learner and applies wagging to improve performance. Wagging assigns
random weights to randomly drawn bootstrap replicas of the training set.
Stacking (Wolpert, 1992) trains a meta-classifier that accepts as its input the
predictions of the individual classifiers. The meta-classifier attempts to minimize the
classification errors by working out the biases of the individual classifiers on subsets of
the training set. Stacking uses k-fold rotate-validation (with replacement) training sets for
the individual classifiers, but uses the validation set to train the meta-classifier. Stacking
can be applied to multiple classification algorithms or to a single classification algorithm.
Li and Zhou (2007) extend the co-training approach by applying Breiman’s (2001)
random forest method of combining classifiers. They examine three medical diagnosis
data sets in detail. They find that their approach is able to improve the performance of
their system using a small number of diagnosed samples enhanced with undiagnosed
samples. Hai-Tao, Xiao-Nan, Fei-Teng, ChunHui, & Jian-Min (2009) also extend
co-training and employ Breiman’s (1996) bagging approach to combining classifiers.
They apply their approach to a computer network traffic classification task. They find that
their combined classifier approach achieves higher accuracy than single classifier
approaches.
2.4. Challenges and Limitations in the Literature
Conflicts and questions of generalizability appear in the literature. There are
conflicting reports in the literature regarding the value of applying a controlled
vocabulary as a means to improve system performance in the presence of lexical disparity
(i.e., synonyms). Passos and Wainer (2009) apply WordNet as a controlled vocabulary to

35

a document clustering task. They assess the usefulness of WordNet to aid document
clustering. They find that measures of word similarity computed using the relationship
between words on WordNet do not improve document clustering on their data. Hersh,
Price, and Donohoe (2000) apply the UMLS Metathesaurus as their controlled vocabulary
to expand queries for an information retrieval task. They query MEDLINE using the
SMART retrieval system. They find that the use of the UMLS Metathesaurus for query
expansion generally causes a decline in retrieval performance.
Compare the above reports with de Buenaga Rodríguez et al. (2000) and Brennan
and Aronson (2003). de Buenaga Rodríguez et al. (2000) apply WordNet to a text
categorization task. They explore the “semantic closeness” between terms, synonyms of a
term, and categories, and then develop weights for the terms in the training set. They find
improved results when the training data was enhanced with WordNet over approaches
based only on training. Brennan and Aronson (2003) apply MetaMap to the task of
identifying medical concepts in free text messages written by lay people. They use
MetaMap to map the text of patient e-mails to UMLS concepts. They find the
improvement occurs when using a combination of UMLS Metathesaurus vocabularies.
While radiological reports are a common source of medical narratives investigated
by NLP, the reports in the literature often show that the investigations drew upon a
limited number of categories. Friedman et al. (1994) evaluate four diseases found in
radiological reports. They parse, normalize terms, perform synonymy, and lastly map
terms to a controlled vocabulary. Hripcsak and Friedman (1995) and Wilcox and
Hripcsak (2003) both appear to use the same dataset of six clinical conditions found in
200 admission chest radiograph reports. Thomas et al. (2005) develop their system with

36

ankle radiological reports (“normal,” “fracture,” or “neither normal nor fracture”) and
then test on spine and extremities reports. Aronow et al. (1999) look at classifying
mammograms using POS tagging and noun phrases. They perform negation detection
using NegExpander, and also examine uncertain assertions. Taria et al. (2001) focus their
research on thoracic radiological reports.
Most studies of medical narratives either focus upon one type (i.e., genre) of
medical narrative (e.g., discharge summaries or radiological reports) or look at assigning
a limited set of clinical conditions (i.e., categories) to free texts. To date, reports have not
shown generalizability across multiple genres and categories. Additionally, neither
machine learning approaches nor hand-crafted rules have definitively been shown to be
the best solution to assigning categories to medical reports. In fact, some studies (Uzuner,
Goldstein, Luo, & Kohane, 2008) have shown that rule-based and machine learning (e.g.,
SVMs) perform similarly on specific classification tasks, implying that the skill of those
wielding the tools for the creation of a computerized classification system may be more
important than the actual approach taken for implementation of those systems.
As previously described, labeled data are expensive to develop (Zhong, 2005), and
may not always be readily available (Zhou, 2009). In addition, Chawla, Japkowicz, and
Kotcz (2004) note issues when presented with a non-uniform distribution of samples
among the classes (e.g., the occurrence of an intrusion on a network when compared to
the large number of valid transactions on the network). They believe that the imbalance
caused by the non-uniform distribution of classes is pervasive and ubiquitous, and may
cause suboptimal classification performance. Machine learning approaches to building
classifiers may simply ignore the less-well represented classes in order to maximize

37

overall performance. Proposed solutions to this imbalance problem include data
re-sampling (Chawla, Bowyer, Hall, & Kegelmeyer, 2002) and one-class machine
learning approaches (Raskutti & Kowalczyk, 2004). Re-sampling can suffer from the
removal of examples if under-sampled, and from over-fitting if the data over-sampled.
While the one-class machine learning approaches work well for binary classification, they
do not address multi-labeled, multi-class classification tasks.
2.5. Summary
This dissertation addresses some of the gaps and issues noted in the literature.
Specifically, given the unsettled question regarding preferred terminology, this
dissertation investigates the role of preferred terminology in the classification of two
different types of medical reports. Given the issues related to imbalance of classes found
in data sets, this dissertation also reports on a new method of combining classifiers when
presented with a non-uniform distribution of samples among the classes in patient
discharge summaries. Lastly, we apply the results of these investigations to co-training in
order to examine the performance of the combined classifier when presented with a small
set of labeled samples and a large set of unlabeled samples.

38

Chapter 3. Medical Records and Evaluation Metrics
3.1. Introduction
This chapter first presents the three data sets employed and then describes the
evaluation metrics used in the experiments. The use of multiple data sets provides the
ability to compare results based upon varying characteristics of those data sets, and
improves the potential to generalize results. We employ Precision, Recall, and F-measure,
metrics commonly used to evaluate NLP systems.
3.2. Data
The data for this dissertation were provided by the Cincinnati Computational
Medicine Center (CMC) and the Informatics for Integrating Biology and the Bedside
(i2b2). In 2007, the CMC organized a challenge (Pestian et al., 2007) and released data
consisting of de-identified and classified radiological reports. The i2b2 released two data
sets consisting of de-identified and classified discharge summaries. The first i2b2 data set
came from the 2006 Shared-Task and Workshop on Challenges in Natural Language
Processing for Clinical Data: Smoking Challenge (Uzuner et al., 2008). The second i2b2
data set came from the 2008 Shared-Task and Workshop on Challenges in Natural
Language Processing for Clinical Data: Obesity Challenge (Informatics for Integrating
Biology and the Bedside, 2008; Uzuner, 2008).
While the three data sets exhibit some differing characteristics, they do contain
similarities. The records of the CMC data set (see Figure 1) can be described as being
well formatted and written concisely and efficiently (i.e., it contains very little extraneous
information). The records of the i2b2 data sets (see Figure 2) can be described as being

39

semi-structured and verbose. Each of the data sets came from a relatively small
community, and from a limited geographic area. The CMC data were collected from the
Cincinnati Children’s Hospital Medical Center’s Department of Radiology. The i2b2 data
were collected from Partners Healthcare, a system of hospitals in eastern Massachusetts.
In each of these data, variation in vocabulary that might arise from the use of regional
expressions would be limited. This would be especially true for the CMC data set since it