Automatic Classification of

chardfriendlyΤεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 3 χρόνια και 11 μήνες)

134 εμφανίσεις

(To be replaced by template)


Research Results Report





Automatic Classification of

Pathology Reports into

SNOMED Codes


June 2008




By
Weihang Zhang

Supervisor
s
: Prof.

Jon Patrick

Dr.
Irena Koprinska

Automatic Classification of Pathology Reports into SNOMED Codes


ABSTRACT

Automatic classification
of pathology

reports into SNOMED codes

is an
important task to unlock the great deal of health care information in patient
reports, where a great deal of formal terminology was contained in the
report
s
but a substantial amount of which was used in non
-
standardised for
ms.

In this thesis, we introduce an automatic medical coding system which
received support from technologies from text categorization, SNOMED
terminology, TTSCT conception conversion system. In order to detect a better
text feature extraction method for th
is specific data source from SWAPS, we
propose a new method called

Section Manipulation


to deal with the
semi
-
structured
report text
.

The main goal is to convert the medical notes into
predefined

medical codes
which make
s

the future retrieval for medical

information more accurate, and
help clinical research
practice

and decision making.

Keywords
:

text categorization, SNOMED, TTSCT, health information, natural
language processing


Automatic Classification of Pathology Reports into SNOMED Codes


ACKNOWLEDGEMENTS

I

would like to send my appreciation to my parents


solid

support, which gave
me the great patience and
perseverance

to tackle the difficulties and
problems
encountered
in my study and experiments.

The greatest thanks is to my supervisor, Prof. Jon Patrick, who is always kind
to people and
responsible

for work.
During this project period,
he
arranged
almost every study materials and equipments for me, not to mention the times
of pointing out the directions of the study, and correcting my writing mistakes.
His
encouragement helped me a lot.

Dr. James Curran, I am
very grateful for
his
critical suggestion on my
experiment methods and presentation technic which gave my work a new life.

And Dr. Irena
Koprinska
, who was always supportive and patient when I
need
ed

new clues for experiment designs.

Mr. Yefeng Wang, the g
reatest friend, when I need
ed

to discuss about
technical detail,
he was

always

available
.

The last but not least, our NLP group is indeed a family.


CONTENTS

ABSTRACT

................................
................................
................................
................................
........

3

ACKNOWLEDGEMENTS

................................
................................
................................
.............

4

CONTENTS

................................
................................
................................
................................
........

5

LIST OF FIGURES

................................
................................
................................
.........................

7

LIST OF TABLES

................................
................................
................................
............................

8

CHAPTER 1.

INTRODUCTION

................................
................................
..........................

9

1.1

M
OTIVATION

................................
................................
................................
...........................

9

1.2

C
ONTRIBUTION
................................
................................
................................
.....................

10

1.3

T
HESIS
S
TRUCTURE

................................
................................
................................
...............

11

CHAPTER 2.

BACKGROUND

................................
................................
.............................

12

2.1

SNOMED

T
ERMINOLOGY

................................
................................
................................
....

12

2.1.1.

Introduction

................................
................................
................................
................

12

2.1.2.

Reference Terminology (RT)

................................
................................
...................

13

2.1.3.

The Fundamental Features

................................
................................
.....................

13

2.1.4.

SNOMED CT

................................
................................
................................
.............

14

2.2

TTSCT

S
YSTEM

................................
................................
................................
...................

14

CHAPTER 3.

PREVIOUS RELATED WOR
K

................................
................................

16

3.1

P
REVIOUS EFFORTS ON
A
UTOMATING
M
EDICAL
T
EXT
C
ATEGORIZATION

.............................

16

3.1.1.

Early System

................................
................................
................................
.............

16

3.1.2.

The Focus on Domain Knowledge

................................
................................
.........

17

3.2

M
EDICAL
C
ODING
I
MPLEMENTATIONS AN
D
A
NALYSES

................................
.........................

18

3.2.1.

Concept Indication Engagement

................................
................................
............

18

3.2.2.

ICD

................................
................................
................................
..............................

18

3.2.3.

SNOMED

................................
................................
................................
...................

20

3.2.4.

UMLS

................................
................................
................................
.........................

22

CHAPTER 4.

THE DATA

................................
................................
................................
......

25

4.1

T
HE
SWAPS

M
EDICAL
R
ECORDS

................................
................................
.........................

25

4.2

D
ATA
I
NSPECTION

................................
................................
................................
.................

25

4.2.1.

Database Structure
................................
................................
................................
...

25

4.2.2.

Text Sample Description

................................
................................
..........................

28

4.2.3.

Histograms for the Data

................................
................................
...........................

30

CHAPTER 5.

EXPERIMENT FRAMEWORK

................................
................................

34

Automatic Classification of Pathology Reports into SNOMED Codes

5.1

T
HE
O
VERVIEW OF
C
ODING
W
ORK
F
LOW

................................
................................
............

34

5.2

T
EXT
-
VECTOR
R
EPRESENTATION
(I
NDEXING
)

................................
................................
.......

35

5.2.1.

Vector Space Document Representation

................................
..............................

35

5.2.2.

Indexing Schemes

................................
................................
................................
....

36

5.3

F
EATURE
S
ELECTION


D
IMENSIONALITY
R
ED
UCTION

................................
.........................

37

5.3.1.

Document Frequency Thresholding (DFT)

................................
...........................

38

5.3.2.

Information Gain (InfoGain)

................................
................................
....................

38

5.4

C
LASSIFIERS
C
ONSTRUCTION

................................
................................
...............................

39

5.4.1.

Decision Tree (DT)

................................
................................
................................
...

40

5.4.2.

Maximum Entropy (MaxEnt)

................................
................................
...................

40

5.4.3.

Support Vector Machine (SVM)

................................
................................
..............

41

5.5

E
VALUATION
M
ETHOD

................................
................................
................................
.........

42

CHAPTER 6.

THE FEATURE EXTRACTI
ON
................................
................................

44

6.1

W
ORD
S
TEMMING

................................
................................
................................
................

44

6.2

N
-
GRAM
T
OKENIZATION

................................
................................
................................
.......

45

6.3

S
TOPWORDS
E
XCLUSION

................................
................................
................................
......

46

6.4

N
EGATION AND
C
ONCEPT
D
ETECTION

................................
................................
..................

47

6.5

S
ECTION
M
ANIPULATIONS

................................
................................
................................
....

48

6.5.1.

Section Exclusion

................................
................................
................................
.....

49

6.5.2.

Section Chunking

................................
................................
................................
.....

49

6.5.3.

Section Merging

................................
................................
................................
........

49

CHAPTER 7.

EXPERIMENT RESULT AN
D DISCUSSION

................................
...

51

7.1

S
YSTEM
C
OMPONENT
C
OMPARISON

................................
................................
.....................

52

7.1.1.

Machine Learners

................................
................................
................................
.....

53

7.1.2.

Text Representation Methods

................................
................................
.................

54

7.1.3.

Stemming Strategy

................................
................................
................................
...

55

7.1.4.

Dimension Reduction

................................
................................
...............................

56

7.2

F
EATURE
M
ANIPULATION

................................
................................
................................
.....

57

7.2.1.

N
-
Gram

................................
................................
................................
......................

57

7.2.2.

TTSCT Concept ID integration (Negation included)

................................
............

58

7.2.3.

Text Section Exclusion

................................
................................
.............................

59

CHAPTER 8.

FUTURE WORK AND CONC
LUSIONS

................................
..............

67

8.1

F
UTURE
E
XPERIMENTS AND
U
TILIZATIONS
................................
................................
...........

67

8.1.1.

Text Secti
on Chunking and Merging

................................
................................
......

67

8.1.2.

Inversed Training Data Collecting

................................
................................
..........

67

8.1.3.

Web
-
Based Interface

................................
................................
...............................

67

8.2

C
ONCLUSION

................................
................................
................................
........................

68

CHAPTER 9.

REFERENCE LIST

................................
................................
.......................

69


LIST OF FIGURES

Figure 2
-
1 TTSCT Convertion Example

................................
................................
...............

15

Figure 4
-
1. The Database Structure

................................
................................
.......................

26

Figure 4
-
2. Table "swaps_hosrep.exam_table"

................................
................................
.....

26

Figure 4
-
3. Table "
swaps_hosrep.resultdetails_table
"

................................
...

27

Figure 4
-
4. Table "
swaps_hosrep.resulttext_table
"

................................
..........

27

Figure 4
-
5. Table “
swaps_hosrep.snomed_codes_table


................................
.....

28

Figure 4
-
6. The histogram of codes assigned to the 10K reports.

................................
.........

31

Figure 4
-
7 The histogram of Mark
-
ups.

................................
................................
................

32

Figure 4
-
8 The sections within the 10K reports.

................................
................................
...

33

F
igure 5
-
1 The overview of SNOMED Coding System.
................................
.......................

34

Figure 5
-
2 Decision Tree

................................
................................
................................
.......

40

Figure 5
-
3 Support Vector Machine class boundary
estimation method.

..............................

41

Figure 8
-
1 System Design: Classifier Generation Module

................................
....................

64

Figure 8
-
2 System Design: Classification Product
ion Module

................................
.............

66

Automatic Classification of Pathology Reports into SNOMED Codes


LIST OF TABLES

Table 4
-
1. A sample report from SWAPS database (“Request
-
ID” = ”1”)

............................

29

T
able 7
-
1 The SNOMED 3 codes histogram and distribution of the sample 10,000 reports.

51

Table 7
-
2 An experiment result table example.

................................
................................
.....

52

Table 7
-
3. Comparison of SVM, MaxEnt and DT classifers on the anatomical pathology
10K subset.

................................
................................
................................
....................

53

Table 7
-
4 Comparison of Attribute Representation Methods

................................
..........

54

Table 7
-
5 The Comparison between Stemming Strategies for the 10K Anatomical Pathology
corpus.

................................
................................
................................
...........................

55

Table 7
-
6. Frequency Thresholds Comparison

................................
................................
......

56

Table 7
-
7 Comparison of N
-
Grams

................................
................................
.......................

58

Table 7
-
8 SNOMED Concept Impact

................................
................................
...................

59

Ta
ble 7
-
9 Exclusion Trial for the CLINICAL HISOTRY Section

................................
........

60

Table 7
-
10 Exclusion Trial for the CLINICAL HISOTRY Section

................................
......

61

Table

7
-
11 Ideal Argument list

................................
................................
..............................

62

Table 7
-
12 Optimal Argument list

................................
................................
.........................

63

Table 7
-
13 Optimized System Performance

................................
................................
..........

63


CHAPTER 1.


INTRODUCTION

1.1

Motivation

The medical notes and patient reports prepared by clinicians
are
usually
present
ed

in the form of natural language text (free text). They contain a great
amount of information such as description
s of

clinical history,

microscopic and
macroscopic pathology observation report, and diagnosis conclusion. A great
deal of formal terminology is contained in
such
report
s but a substantial
amount of which is used in non
-
standardised forms.

In principle, clinical notes could be
correctly recorded in a coded form with a
medical terminology system.
I
n practice the notes
a
re written and stored in
free text

where

text categorization technology is needed to convert these
medical notes into
predefined

medical codes which make the futur
e retrieval
for medical information more accurate, and help clinical research
practice

and
decision making.

The Australia
government has

adopted th
e

SNOMED CT system to encode
clinical description
s

and patient health records

One of the

main

pre
-
processing

task
s

of this project has emerged as the conversion
of
the

narrative in

clinical records to a set of SNOMED codes.

While the task itself is
to predict the SNOMED 3 codes for whole anatomical pathology reports.

Automatic Classification of Pathology Reports into SNOMED Codes

1.2

Contribution

The source material fro this stud
y consisted of 10,000 anatomical pathology
reports coded for SNOMED 3 codes for morphology and topology. The aim of
the experiments was to build a supervised classifier that could correctly
predict these two types of class labels.

The key contribution of t
his thesis lays on a domain
-
driven method for feature
selection, where domain knowledge and natural
language

processing
technique
s

have been integrated

to

achieve higher performance

in the
classification task
.

During the development, several machine learne
rs were evaluated with their
classification capability. Both
stratified

and non
-
stratified resample strategy
ies
were

adopted for 10
-
fold cross validation to
determine

which training method
would produce
better

classifiers for this multi
-
label classificatio
n task.

P
revious work and the
initial
experiment results showe
d

that
simple methods

would not give a satisfying result if such sort of categorization system were
merely implemented with the classification algorithms from machine learners
and the statistica
l methods from classical categorization implementations.

In order to acquire
higher

classification accuracy, we
employed

two key
techniques


domain knowledge adoption and natural language processing


to
add to

th
e

task.
The subsequent
experiment results
proved that
by
applying
both domain knowledge and natural language processing onto the information
retrieval steps g
a
ve
a

b
etter
result than the
traditional

statistical

text
categorization framework.

A further variation in identifying suitable features for

the classifier is to
consider the segmentation of the text created by section headings written into
most reports. Hence features could be formed for each section rather than
treating all features as coming from a single
report
.
As the
report

is
segregate
d

by section titles, a
novel

way of deal with the se
ction
s is trying to
detect all medical concepts in sections, while distinguishing the different
perspectives of literally identical concepts.

1.3

Thesis Structure

This thesis is composed with 7 chapters exclud
ing the introduction. Chapter 2
describes the background of two main components this project has adopted,
the SNOMED terminology and TTSCT system. In Chapter 3 we describe
some previous
ly

related work on medical code classification tasks, and
compared some

popular terminology systems such as ICD, SNOMED, and
UMLS. we describe the data we are trying to classify in Chapter 4
, and

how
we build the experiment framework for this project in Chapter 5, As the
text
-
features extracted from
the
data source heavily i
mpact the text
categorization result
s
, we describe several applied methods and techniques
of feature extraction in Chapter 6. In Chapter 7 we discuss the experiment
results and the conclusion
s are

given in Chapter 8.

Automatic Classification of Pathology Reports into SNOMED Codes


CHAPTER 2.


BACKGROUND

2.1

SNOMED

Terminology

2.1.1.

Introd
uction

Using
t
erminologies

in health care domain
s

has gained a growing attention in
recent years.
Different
needs
are
outlined

in

many scenarios: the researcher
s
need

comorbidity information in existing claims

data;
health
care
organization
s

need

more deta
iled data on

how
to

take care of patients
;

medical

information system vendor
s

need

a

terminology for
representing

a
problem list in

a
computerized

medical record product.

A
s

awareness of

reuse of
t
erminologies has grown, there has also

developed a

much mo
re

sophisticated

understanding
of it from many aspects
.
Since t
here are several different kinds of

terminology

divided
by
different
us
age
, one type

may
be used for implementing
a user
-
friendly

structured data
entry interface
;

another may

be adopted to
opti
mize natural language
processing
; or the terminology is used to
enable storage, retrieval and
analysis of clinical

data.

The la
st

terminology set gave birth to the concept
called

a “reference

terminology” for clinical data.

Having been under the developme
nt for over
4
0 years,
the Systematized

Nomenclature of Human and Veterinary Medicine
(a.k.a
SNOMED
International
)
,

gathered

a

comprehensive set of over 150,000 records in
twelve

specified

axes.
Its
concepts
consist of
chemicals, drugs, normal and
abnormal
functions, anatomy (topography), morphology (pathologic

structure),
enzymes and

other body proteins,

symptoms

and signs of disease, living
organisms, physical

agents, spatial relationships, occupations, social

contexts,
diagnoses and procedures. Thus SNOME
D forms a s
uitable
starting point

for
reference

terminology development

[1]
.

2.1.2.

Reference Terminology

(RT)

A
R
eference
T
erminology for clinical data is a
cluster

of

concepts and
relationships that provide a common

reference
entry

for comparison and
aggreg
ation of

data
which are related to

the
human health

care process
.
The

p
urpose
of a reference terminology for clinical data is
to

retriev
e

and analy
se

the

data

covering

disease causes, patient treatment, and the outcomes of the
overall health care

process.
Different

reference terminologies may be

optimized for other health care information

applications, for example,

primary
care..

2.1.3.

The Fundamental Features

SNOMED RT
developed

several

more

enhanced features

over prior editions
of SNOMED, these
features are
[2]
:



Hierarchies

in SNOMED RT
maintain

strict
generation
(
supertype
-
subtype
)

relationships.
Following the structure
, a child
concept is always

covered by

the parent concept. F
or
instance,
it
would be correct to
describe

“Blood
-
pressure
-
education” as a kind of
“Nursing
-
procedure”

using
the structure where
“Blood
-
pressure
-
education”

is represented as a child of

“Nursing
-
procedure”.



Concepts

are defined by their
placement

in

h
ierarc
hies and by
additional
constraint
s called "Relationship Types" or "Roles",
whose
target values are also SNOMED concepts. For
instance
,
Appendectomy (P1
-
57450)
is restricted by

a relationship named
"ASSOCIATED
-
TOPOGRAPHY", whose value is Appendix (T
-
59200).




SNOMED RT
collected

textual definitions, which are especially
useful

Automatic Classification of Pathology Reports into SNOMED Codes

when the underlying description logic

fail

to
fully
define a procedure.



A fully
-
specified name is provided for
each
concept.

2.1.4.

SNOMED CT

The College of American Pathologists, the owners

of SNOMED, and the UK'S
Ministry of Health
integrate
d the UK'S' Clinical Terms Version 3 (formerly
known as the READ Codes)
into
SNOMED RT
, and
create
d

the

SNOMED
Clinical Terms (SNOMED CT).

In January 2002
,
SNOMED CT

was
first released
. The main
strength

of
SNOMED CT

is the combination of
the
power

from

SNOMED RT

which is
covering

the basic sciences, laboratory and specialty medicine, including
pathology,
and
the richness of the UK'S work in primary care.
This outcome
product bec
a
me

an
accurate

and compre
hensive clinical reference
terminology that provides unexcelled clinical expressivity
and
understandability
for clinical
recording

and reporting
.

2.2

TTSCT System

TTSCT system
[3]

provides an interface for users to detect medical concepts
in a free text s
tring, and match the medical concepts to SNOMED CT codes in
real time.

This system
utilized

NLP techniques to enhance lexical concept terminology
mapping; also, it implemented two

r
ecognisers


qualifier
recogniser
and
negation

recogniser


to recognize ne
gative concepts and composite terms,
with the goal of creating more effective information extraction and retrieval.


Figure
2
-
1

TTSCT Convertion Example


F
igure

2
-
1

shows the resu
lt as a medical note text processed by TTSCT, the
words underlined are matched to concept ids, which are show
n

as numbers
on the right, and completed with the concept descriptions.

With the support from TTSCT system, the key concepts from clinical notes
ca
n be automatic
ally

mapped into SNOMED concept ids, by which means the
clinical information locked in
the
clinical notes and patient reports can be
extracted in real time.


Automatic Classification of Pathology Reports into SNOMED Codes


CHAPTER 3.


LITERATURE REVIEW

3.1

Previous e
fforts on Automating Medical Text
Categorization

It i
s self evident that manual classification of diagnoses is a labor intensive
process that consumes significant resources. Hence it is worthwhile for
researchers to develop a system for automating such a medical text
classification task.

3.1.1.

Early Systems

Based
on expert knowledge concepts, Yang et al.
[4]

produced a system
named ExpNet which used category
-
ranking methods for automati
cally
coding the diagnosis reports at the Mayo Clinic. The ExpNet technique
extended and enhanced previous techniques (Linear Least Squares Fit and
Latent
Semantic

Indexing) and reached a level where the average precision
was 83% and recall was 81%. One we
akness of this system was its
automatically coding method only worked well with short phrases (less than
six words) merely a single diagnostic rubric.

To evaluate the expert knowledge system, Chapman et al.

[5]

compared the
outcomes from expert
-
crafted rules, based on Bayesian networks and
decision trees, against a collection of chest X
-
ray reports that support acute
bacterial pneumonia. They randomly selected 292 reports encoded by a
N
atural
L
anguage
P
roce
ssing (NLP)

system, and mistakes that occurred in
reports were manually corrected. In their implementations three expert
systems were employed to determine whether the encoded observations
supported pneumonia. The output from one expert system was compared

to
two other systems to vote a result, and further the result was judged by four
physicians. The conclusion showed that all three expert systems performed
comparably to physicians.

3.1.2.

The Focus on Domain Knowledge

Such studies typically have focused only on
standard components of inductive
learning, for instance, the chosen algorithm or the amount of training data.
NLP systems have been used commonly in the preparation phase to structure
free text clinical data by extracting observations and descriptive modif
iers.
Alternatively, to prevent substantial variation in data preparation, expert
knowledge can be used to determine the subset of attributes or features for
the classification work.

Wilcox and Hripcsak
[6]

suggested using domain
knowledge for the feature
selection to enhance the performance of machine learning algorithms. Later
they delivered an analysis of the effect of expert knowledge on the inductive
learning process in creating classifiers for medical text reports

[7]
. They
randomly selected 200 reports form a set of chest radiograph data, which had
been classified by physicians with six clinical conditions. Using NLP, they

converted medical text reports to a structured form, and then created
classifiers based on various degrees and types of expert knowledge and
different inductive learning algorithms. In producing their results, they
measured the performance of the differen
t classifiers, the costs to induce
classifiers as well as training
-
set size efficiency. Their results showed that for
medical text report categorization performance, expert knowledge acquisition
is more significant and more effective than knowledge discove
ry. Therefore
building classifiers should focus more on knowledge acquisition from experts
Automatic Classification of Pathology Reports into SNOMED Codes

than trying to learn this knowledge inductively.

3.2

Medical Coding Implementations and Analyses

3.2.1.

Concept Indication Engagement

Building on the groundwork laid by Yang, P
akhomov

et al.

[8]

implemented an
automatic diagnosis coding system which made it possible to use specially
trained medical coders to categorize diagnoses for billing and research
purposes.

Their system uses the certainty concept indicated by
example
-
based
classification to instruct the following processing, and then to
assign classification codes from a pre
-
defined classification scheme to the
natural language stated diagnoses which were generated by the MI
-
indexed
EMR database at the Mayo Clinic. They assu
med that diagnostic statements
were highly repetitive, and that new diagnosis reports should be accurately
and automatically coded by simply looking them up in the database of
previously classified entries. Therefore, codes will be generated only by
matchi
ng the diagnostic text to frequent examples in the database of 22
million manually coded entries. Manual review is needed only when the codes
are generated with a lower certainty level. Their highest result achieved
macro
-
averaged 98.0% precision, 98.3% re
call and an f
-
score of 98.2%. Over
two thirds of all diagnoses are coded automatically with high accuracy.

3.2.2.

ICD

The Development of ICD


The International Statistical Classification of Diseases and Related Health
Problems (ICD) provides medical codes to cla
ssify diseases and a wide
variety of signs, symptoms, abnormal findings, complaints, social
circumstances and external causes of injury or disease. The goal of the ICD
system is to unify each health condition, and group diseases categories.

The
ICD
-
9 was d
elivered by the WHO in 1977. Development on ICD
-
10 began in
1983 and finished in 1992, and the first draft of the ICD
-
11 is expected in
2008.

Application



A
utomatic
C
ode
A
ssignment
S
ystem”

Crammer

et al.
[9]

integ
rated three coding systems, and then presented a
system for assigning ICD
-
9
-
CM medical codes to unstructured radiology
reports. In their implementation, three automated systems were developed at
first, along with a learning system which adopted natural lan
guage features.
And then, a rule based system was designed to match the codes to the
medical texts and ICD code descriptions. The rule based system requires no
training but uses a description of the ICD
-
9
-
CM codes and their types. For a
given report, the s
ystem parses both the clinical history and impression into
sentences, and then checks the sentences with the code description to set
flags if all of the description words occurred. If a matched code is a disease
and no negation words appeared in the senten
ce, the flag will be removed. In
the final stage, a specialized system judged the most common codes as the
assigned values. Their system was evaluated on the Computational Medicine
Center’s challenge of labeled training data of 978 radiology reports . It
p
erformed outstandingly with the test data of 976 documents, compared with
both human annotators and other automated systems. This combined system
was an improvement over each individual system.

Application



S
hared
-
task
I
nvolving
M
ulti
-
label
C
lassificatio
n
S
ystem”

Pestianet al.
[10]

reported their system

on the same task as above. They
presented a shared
-
task involving multi
-
label classification system. First,
medical jargon, abbreviations, and acronyms,
which are ambiguous, are
filtered out; secondly, f
or
patient privacy and machine
-
learning methods, all
Automatic Classification of Pathology Reports into SNOMED Codes

human names are all replaced with “Jane” or “John” depending the gender,
surnames with “Johnson”. However, manual inspection was adopted before
medical c
ode assignment were made. All data were manually reviewed, and
as the result, the data potentially violating privacy regulation were deleted,
and geographical words changed. And then data were annotated by the
coding staff and two independent coding compa
nies. Finally , they conducted
majority annotation process to p
roduce
agreement statistics, to
reate the
final
codes. By macro
-
averaged F
-
measure of 1167 label assignment and
cost
-
sensitive measure, this system even performed better than Crammer’s
[9]
.

3.2.3.

SNOMED

The Systematized Nomenclature of Medicine (SNOMED) is another
classification system possessing its own multi
-
axial and hierarchical structure.
SNOMED exists in a number of versions the most recent being SNOMED
C
linicial Terms (SNOMED CT)

Comparison with ICD

Helen Moore
[11]

made a comparison between SNOMED
-
CT and ICD
-
10
-
AM.
She extracted medical terms from 160 paper
-
based medical records, and
coded the terms in SNOMED CT and ICD
-
10
-
AM. A rating process compared
the two systems on two features: where a match existed, and whether the
coded terms specifically
related to clinical concepts. The outcome of her work
indicated that ICD
-
10
-
AM exactly matched 2.7% of the source terms while
SNOMED
-
CT achieved 48.6%; but the source term in most cases was more
specific than the SNOMED
-
CT (72.2%). She suggested SNOMED
-
CT
would
be suitable for consideration to be adopted as the clinical terminology system.

Application

Melton et al.

[12]

applied SNOMED

CT to build their Patient
-
based similarity
metrics, which were presented as an important case
-
based reasoning tool
and an assistant of patient care applications. All patient cases (1989
-
2003)
from the Columbia University Medical Centre data repository wer
e converted
to SNOMED CT concepts, using automated tools. The demographic and
ICD9
-
CM codes were converted to SNOMED CT concepts using MRCONSO
from the UMLS. Metrics were computed overall and along each of the 18
SNOMED CT axes and four metrics of the five

used the SNOMED CT defining
relationships. This application showed that for distance metrics, the defined
relationships of the terminology and principles of information content provided
valuable information, and SNOMED CT axes were helpful to narrow in on

the
features used in expert determination of similarity.

Evaluations

In 2003 Wasserman et al.
[13]

evaluated SNOMED CT in its terminologies
an
d concepts coverage which are needed for the comprehensive encoding of
a medical diagnosis in a real
-
world. They used a computerised physician
order entry (CPOE) system, to check all submitted requests for clinical terms
which w
ere
not represented in SNOME
D CT. The result showed that
SNOMED CT had collected 88.4% of their prepared diagnosis and problem
list terms, and achieved the concept coverage of 98.5%. Such scores
indicated that SNOMED CT is “a relatively complete standardized terminology
on which to b
ase a vocabulary for the clinical problem list”.

Richesson et al.

[14]

did the simi
lar estimation on SNOMED CT. They
evaluated the coverage provided by SNOMED CT for clinical research
concepts, and further the semantic nature of those concepts, using 17
case
-
report forms (CRFs) from which a set of 616 items were identified and
coded by t
he presence and nature of SNOMED CT coverage. A basic
frequency analysis showed more then 88% of the core clinical concepts from
Automatic Classification of Pathology Reports into SNOMED Codes

these data items were covered by SNOMED CT. It was reported that although
less suited for representing the whole information re
corded on CRFs,
SNOMED CT did the job well to represent clinical concepts.

3.2.4.

UMLS

ICD
-
9, Read Codes, MedDRA, CPT, and others represent in terms of various
“coding schemes”, with which researchers are beginning to relate
the

disparate biomedical ontologies

[15]
.

The Unified Medical Language System (UMLS) is a compendium of many
controlled voc
abularies in the biomedical field. It provides a mapping structure
between these vocabularies and therefore supports translation of terms
between the various terminology systems. It may also be considered as a
comprehensive thesaurus and ontology of biomed
ical concepts.

Synonymy Description

Michael Schopen
[16]

reported the integrated vocabularies are the Medical
Subject Headings (MeSH) in eight languages, ICPC
-
93 in 14 languages,
WHO Adverse Drug Rea
ction Terminology in 5 languages, SNOMED
-
2,
SNOMED
-
3, and the UK Clinical Terms (former Read Codes). The WHO
version of ICD
-
10 is available in two languages: English (plus an
Americanized version) and German. Furthermore, the Australian modification
ICD
-
10
-
AM has been integrated (also with an additional Americanized
version). ICD
-
9 is only available in its US clinical modification.

Focusing on English as an example, the UMLS is built on one view of
synonymy, but its structure also contains all the individua
l views of synonymy
from its source vocabularies. Although the development became a
knowledge
-
based automatic process, as well as vocabulary maintenance,
manual correction is still heavily used in determining synonymy. Aiming at
investigating the result of

human judgment of synonymy, Fung et al.
[17]

wanted to point out a way of integrating SNOMED CT into the UMLS,
considering the alignment of two different views

of synonymy due to the fact
the two vocabulary systems have different designed purposes and editing
principles. 60 pairs of potentially controversial SNOMED CT synonyms were
reviewed by 6 UMLS editors and 5 non
-
editors. They were scored
or accuracy
depen
ding on the degree of synonymy.. In order to evaluate accuracy the
synonymy scores of each subject were compared to the overall
-
averaged
score of all subjects. The difference of score between UMLS editors and
non
-
editors was agreed on by their mean synonym
y scores. The result
showed average accuracy was 71% for UMLS editors and 75% for
non
-
editors showing both of their judgments of synonymy were comparable.

Application

Based on natural language processing, Friedman et al.
[18]

reported a
method to
automatically map an entire clinical document to codes with
modifiers, and to quantitatively evaluate the method. The studies used
818,000
-
sentences c
orpus
of discharge summaries of de
-
identified patients
admitted to New York Presbyterian Hospital during 2
000 to obtain two
150
-
randomly
-
selected
-
sentences test sets. The MedLEE NLP system was
employed to encode clinical documents. An Encoding Table was created to
select terms which were complementary to UMLS terms. The known types of
errors produced by MEDLEE

were automatically removed during the table
generation. All the remaining entries remained in the coding table and were
subsequently used to parse and encode sentences. The parsed sentences of
the corpus were used for mapping medical text to codes. One te
st set
reached a UMLS codes recall of .77 (95% CI .72

.81) based on MedLEE
processing, compared with.83 (.79

.87) by seven experts manual processing.
The second set was measured by precision, and the comparison of the
Automatic Classification of Pathology Reports into SNOMED Codes

automatic system .89 (.87

.91), and th
e experts ranged from .61 to .91. This
method, which was combined with information extraction, UMLS coding and
NLP, appeared to be comparable to or better than six experts. The advantage
of the method is that it maps text to codes along with other related
information,
rendering the coded output suitable for effective retrieval.



CHAPTER 4.


THE DATA

4.1

The
SWAPS Medical Records

The data for this project are presented as a collection of
approximately
400,000 pathology
report
s
from the SWAPS Anatomical Pathology Database
.
All of these medical notes are prepared by clinicians in the form of
natural

language text.

4.1.1.

Internal Structure of Pathology Reports

Each report has a nominal internal structure although not all reports adhere to
it. T
he sections of a report are indicate
d by typographical dividers
. Such
dividers are wrapped by the

<Title>


mark
-
ups.
Residing in such a
semi
-
structure, the sections of
report

demarcate
information such as clinical
history description
s
, microscopic

observations,

macroscopic observation
s
,
and

diagnosis conclusion
s
.

4.2

Data Inspection

Automatic
process
ing of

the text

will only be successful if it exploits aspects of
its organisation which we must discover by inspection.

4.2.1.

Database Structure

As

all the data provided by SWAPS are stored in a relation
al database, it is
necessary to p
resent
an overview of the database structure in which the
Automatic Classification of Pathology Reports into SNOMED Codes

tables
are
arranged
.


Figure
4
-
1
.

T
he

Database Structure


F
igure

4
-
1

shows the four tables which
make up
the databas
e and the details
of tables will be
elaborate
d in the following paragraphs.

Table

swaps_hosrep.exam_table


stores the examination descriptions
.

F
igure 4
-
2
is this table, where the first eight
r
ecords
are sho
wn
.


Figure
4
-
2
. T
able "
swaps_hosrep.exam_table
"


The second table is named as

swaps_hosrep.resultdetails_table

(Figure 4
-
3)
. In this table, two columns are

i
mportnat for this study


the

RequestID


and

SnomedCode

.
T
he first column is the primary key
from
another table which
maintains an index of

the entire set of pathology

r
eports
,
and the

SnomedCode


column stores the SNOMED codes which have been
assigned to

t
he report
. Here we can see clearly that, one
report
can have
several codes. The automatic c
oding system f
or

this project
as the objective
of assigning
to an unseen
eport
a set of codes
from a

multi
-
label
classification.


Figure
4
-
3
.

T
able "
swaps_hosrep.resultdetails_table
"

Another table named

swa
ps_hosrep.resulttext_table


stores the
pathology
report
s

(
Figure 4
-
4)
.
A
ll
report
s are uniquely indexed by

RequestID

. The number of the
report
s is about 400
,
000
, and the first ten
report
s are shown.


Figure
4
-
4
.

T
able "
swaps_hosrep.resulttext_table
"

The last table gives the description of the meaning of the SNOMED code
s
(Table 4
-
5)
.
S
ome entries are left as blank because before the information
was stored in a database application, it was saved in a simple f
ile system.
Automatic Classification of Pathology Reports into SNOMED Codes

However, the previous data
transformation

work has
guaranteed

all SNOMED
codes are attached with descriptions as long as they are registered.
Figure
4
-
5

shows the first 12 entries of this table.


Figure
4
-
5
.

T
able

swaps_hosrep.snomed_codes_table


This database has provided sufficient information for the supervised training
part in the classification
process
.
As
we are concentrating more on which
codes are assigned to a

r
eport
, we
have only

use
d

t
he two tables ,
"
swaps_hosrep.resultdetails_table
"
and

swaps_hosrep.resulttext_table
".

A relatively small dataset
of
10,000
reports was selected
from this database


for the development of the experimental coding system
.

This decision was
made as
the ent
ire collection of the
report
s will
r
equire
an

unacceptabl
y

large
amount of time
to completely process,
as well as the present hardware
environment
w
ill
not

readily

support the process
ing

on this
s
ubstantial
corpus.
.

4.2.2.

Text
S
ample

Description


A sample repor
t
from SWAPS database

is presented in Table 4
-
6.
.

<Title>CLINICAL HISTORY</Title>

Biopsy of discoid erythematosus like lesion from right cheek ? DLE.

<Title>MACROSCOPIC</Title>

LABELLED `RIGHT CHEEK LESION'. An ellipse 12 x 3mm with subcutis to 3mm.
A
poorly defined pale nodular lesion 3 x 3mm. It appears to abut the surgical margin.

Representative sections embeded, A tips face on, B lesion and surgical margin. (MR
17/4)

<DOT>TA</DOT>

<Title>MICROSCOPIC</Title>

Section shows hyperkeratosis with occ
asional follicular plugging, epidermal atrophy and
severe sundamage to dermal collagen. A dense chronic inflammatory cell infiltrate, both
superficial and deep is present, mainly in a perivascular and periadnexal distribution. No
liquefaction degeneratio
n of the basal layer, no dermal oedema and no interface dermatitis
are seen. PAS stain reveals no thickening of the epidermal basement membrane and only
an occasional fungal spore on the skin surface.

Immunofluorescence for immunoglobulins and complement
fractions are negative.

The differential diagnosis rests between chronic discoid erythematosus, lymphocytic
infiltration of skin of Jessner and the plaque type of polymorphous light eruption. The
presence of marked solar damage to collagen, the absence of

basal liquefaction degeneration
and the negative immunofluorescence favours polymorphous light eruption. A reaction to
drugs or an insect bite is also a possibility. No evidence of malignancy.





Reported 24/4/98


Table
4
-
1
.

A sample
report

from SWAPS database (
“Request
-
ID”

=

1

)


Automatic Classification of Pathology Reports into SNOMED Codes

The sample
report

shows

some features of the pathology
report
s:

1.

Each section of the
report

is given a title, which is wrapped by <Title>
mark
-
up

tag
;

2.

Other

mark
-
ups
tags
are use
d for
framing non
-
title

material.

O
ne
example here is the <DOT> tag, which was used for decorate the text
it

wrapped;

3.

All sections
headings represent a

special medical meaning,
which
distinguishes them
formatting purpose

headings

like

header

, or

footer

;

4.

While
the

CLINICAL HISTORY


and other sections
of
the

report

is
helpful to give some reference for clinicians, it is
uncertain that they
can be
useful for
a
utomatic processing to
calculate the text categories.

4.2.3.

Sample Selection

You need to write a sectio
n on how you selected your sample

4.2.4.

Histograms for the Data

We did several statistical
studies
in order to
i
dentify
the
descriptive
character
i
s
tics

of the data set
. They are

describe
d

in the following sections.

SNOMED Code Distribution

As
only 10K
of
patholo
gy
report
s were selected for the development of this
project, we are merely concerned with how many codes and which codes are
assigned to these
reports
.
T
he histogram of the codes

are presented in Table
4
-
7
.


Figure
4
-
6
.

The histogram of codes assigned to the 10K
report
s.


The whole set of the 10K
report
s are assigned with 29961 codes, which are
composed by 885 types of codes including the

null
” value

(which means the
event that a
report

had been examined by

a clinician, but was not assigned a
code). In this set of codes, 684 types have
a frequency
occurrence which is
less than 10
,

in which 129 types have
a frequency
of 2, and 308 types only
occur once
. However, in order to train this system well, we need
to
have a
significant amount of training information.
The
top 10 codes are
used in this
study
.

Distribution
of
Mark Up
Tags

Automatic Classification of Pathology Reports into SNOMED Codes


Figure
4
-
7

The histogram of Mark
-
ups.


As we saw in the previous description, many m
ark
-
up

tag
s are contained
inside the
report
s. In order to find the meaning of each mark
-
up

tag
,
a
histogram

of their frequencies was produced (Figure 4
-
8)
.
T
he whole
corpus

(400K

r
eports
)

was searched
.
The
top 5 mark
-
up

tags

are
only ones of
concern since
they have
t
he
significant amount of occurrences. The

<Title>


tag
divides
report

into the sections, and the other 4

tags
are format
decorations.

Sections Histogram

Since

the
report

is composed
of

sections, two questions raised our
curiosity



how clinicia
ns consider these sections, and how
utomatic processing can
exploit
them?


Figure
4
-
8

The sections within the 10K
report
s.


Figure 4
-
9 shows that
almost every
report

has the three sections:

CLINICAL
HISTOR
Y

,

MICROSCOPIC


and

76% contain


MARCROSCOPIC

.
This
result opens the possibility

these sections
can be exploited
for
r
estricting the
data needed for
Natural Language Processing, and

removing false positives
from classifications by only using sections o
f each report for classification.

Automatic Classification of Pathology Reports into SNOMED Codes


CHAPTER 5.


E
XPERIMENT

FRAMEWORK

Intially the experimentation needs to designed within an appropriate
framework. The aim of this framework is to simulate the whole text
categorization work
flow;

in addition, the evaluation mechanis
m is
necessary

to evaluate the performance of this coding system.
D
escriptions of this
framework, as well as the evaluation methods

are presented below
.

5.1

The Overview of Coding Work Flow


Figure
5
-
1

The overview of SNOMED Coding System.


In

F
igure 5
-
1
, an overview of the SNOMED Coding System
u
sed in
this
project is
presented
. The work flow of this system consists of four steps


pre
-
process
ing, text
-
vector representing (Indexing),

feature
selectin
on
(filtering), and classifying.

Vector Representation


of Text

(Indexing)


Feature Filtering to

Reduce Dimensionality



Machine Learning

(classifiers)

SNOMED Code






Pre
-
processing

P
re
-
process
ing

The feature extraction task aims to extract the most useful information from
the
report
s for text categorization, and it is done during the pre
-
processing
stage. This stage will be described i
n detail with the methods in Chapter 6

b
ecause it adopted many techniques such as tokenizing, and stemming,.

5.2

Text
-
vector

Representation (Indexing)

The information
in the pathology reports
which
is
presented as literal string
s

is
significantly important for

is not understandable
b
y

classifiers
. In order to
make
it recognisable by classifiers
,
the strings have to be converted the into a
frequency vector
.

5.2.1.

Vector Space Document Representation

Since the first usage in the SMART Information Retrieval System
[19]
,

the
Vector
S
pace

M
odel

bec
o
me
an algebraic model for representing

literal

documents as vectors

of terms
.

In the vector, each dimension corresponds to
a specific term
.
This strategy has the side effect that a
literal docume
nt
is
represented as
a set of words without regard to word order
[20]
. Based on
these two issues, we convert a set of documents (i.e.

report
s) into the form of
a term
-
by
-
document matrix
A
, where each entry stands for the weight of a
term in a document.
U
sing the conception that
a
ik

is the weight of term
i

in
document
k
, we can express this matrix as below:

I

In
th
is matrix the number of rows
M

i
s the number of unique terms, while the
number of columns
N

corresponds to the number of documents.
T
his matrix is
usually sparse because every term does not reside in every
document
, and
Automatic Classification of Pathology Reports into SNOMED Codes

the value of
M

can be so large that

there is the

difficulty that high
the
dimensionality of this space model usually brings
s
ignificant
calculation
complexities.

5.2.2.

Representation

Schemes

There exists several ways to specify the value of
a
ik



the weight of term
i

with
document
k
, however most
of the implementations are according to two
empirical observations
[21]
:




The more times a word occurs in a document, the more relevant it is
to the topic of th
e document

;




The more times the word occurs throughout all documents in the
collection, the more poorly it discriminates between documents

.

Additionally, we define
f
ik

to b
e the frequency of term
i

in document
k
, and
n
i

the total number of occurences of
term

i
. According to

As et al

[21]
, three
commonly used weighting schemes are described below:

Boolean Weighting:

The simplest
ethod
is
to
treat the weight as
the symbol of whether a term
appears in a document or not.
T
his scheme can be expressed as:


Word Frequency Weighting:

The frequency of the term is considered as the weight of itself:


Entropy Weighting:

This i
s credited as the

most effective and sophisticated weighting scheme,
which is based on information theoretic ideas
[22]
.
Entropy Weighting scheme
scans through the corpus, and counts

the number of documents
N
,

n
i
,
the
total number of times word
i

occurs in the corpus, and

f
ik
,

the frequency of
word
i

in document
k
.
By its scheme, the value of the weight for term
i

is
defined as

a
ik
:


W
here

is called
the average uncertainty

(i.e. entropy)
of term
i
, and it will be assigned with
-
1 if every document has term
i
; and 0 if
only one document has the term.

Based on the representation schem
es
, the document is transformed from a
full text version to a document vector form, by which, the contents of the
document are described as a specifie
d collection of term weights in computer.
As the result, computers now can simply start calculating these weight figures,
instead of the
complicate

text strings.

5.3

Feature Selection


Dimensionality Reduction

As mention
ed

in
the
previous section, most of the

complexity is caused
by
the
very high dimensionality

the
feature
matrix
A
.

Hence,
before
the inference of a
classifier
’s structure

can
begin the process of reducing the dimensionality is
commonly needed to resize the term
-
by
-
document matrix
A

into a far s
maller
one.

From another
perspective
, dimensionality reduction (DR) is also helpful to
Automatic Classification of Pathology Reports into SNOMED Codes

reduce the
overfitting

problem, that

is

a classifier

is tuned also to the
contingent, rather than
merely

the constitutive

characteristics of the training
data
[23]
. In such cases

the classifier will be too good at re
-
classifying the
training d
ata while
being
worse at recogni
sing

unseen data. Some earlier
experimentation ha
s

shown that
this problem can be eased if
the size

of

the
training data
set is

approximately proportional to the number of
f
eatures used
.
Fuhr and Buckley
[24]

show

that 50
-
100 training instances per
f
eature
wo
uld
be fine in text
categorization

tasks. Thus, if DR is performed, overfitting may
be avoided even though less training instances are used

if this ratio is
maintained
. However, we have to be aware of a fact that some potentially
useful information,
e
mbedd
ed in

some rare terms is
at

risk of being
removed.

In
[
25]

five feature selection methods are evaluated: Document Frequency
Threshold
ing (DFT),
Information

Gain (InfoGain), X
2
-
statistic, Mutual
Information, and Term Strength.
E
xperiments by the authors found the first
three to be more effective.
In this proje
ct
we decided to use two methods, DFT
and InfoGain
because the

Information Gain and X
2
-
statistic methods are both
raised from Information
-
Theoretic Term Selection Functions,

5.3.1.

Document Frequency
Threshold
ing (DFT)

The document frequency for a term can be sim
ply explained as the number of
documents in which the term occurs. The terms in the training corpus will be
removed if their document frequencies are less than a predetermined
threshold. By this method, only the terms which have the highest number of
occur
rences are
retained
, and rare terms are ignored since they are either
non
-
informative for class prediction or not influential in global performance.

5.3.2.

Information

Gain (InfoGain)

Based on Entropy Theory, Information Gain
[26]

measures the number of bits
of information obtained for category prediction by the o
ccurrence of a word in
a document. It is defined to be:

where
c
j

denotes the possible category,
P(c
j
)

denotes the fraction of
documents in the total collection that belongs to class
c
j
,
P(w)

can be
estimated from the fraction of docu
ments in which the term
w

occurs.

is the fraction of documents from category
c
j

that have at least one
occurrence of term w and

can be computed as the fraction of
documents from category
c
j

that does not have

the term
w
.

5.4

Classifier Construction

A
pathology report
from the database
i
s
assigned several codes, thus, the
automatic coding system is supposed to do the same job


classify the
r
eport
with
a set of SNOMED codes.
T
his multi
-
label task can be broken into

disjoint
binary classification solutions, which classify the
report

in
a
code
-
by
-
code

m
anner
. To classify a new
report
, the system needs to apply all the binary
classifiers and combine their prediction together, then finally produce the set
of predictions

as

a

system result.

There are several Machine Learners we can use to develop these classifiers,
such as Decision Tree, Maximum Entropy, Support Vector Machine,
Naive

Bay
e
s, and K
-
Nearest neighbour. In what follows we describe the first three
machine learn
ers used for this project.

Automatic Classification of Pathology Reports into SNOMED Codes

5.4.1.

Decision Tree (DT)



Figure
5
-
2

Decision Tree

http://en.wikipedia.org/wiki/Decision_tree_learning


In a Decision Tree classifier there are three components


internal nodes,
branches and leaves. All internal nodes are
labelled

by terms; the branches
are departing from inte
rnal nodes, and are
labelled

by the weight of terms;
meanwhile the leaves stand for categories.

Typically, the weight of
a
node is measured by the value of
its
information gain.
A
DT classifies a document
d

by recursively calculating the weights that the
t
erms
labelling

the internal nodes have in vector
d

, until a leaf node is
reached.
T
he label of this node is then assigned to
d
.

5.4.2.

Maximum Entropy (MaxEnt)

Another Machine Learner used in this project is called Maximum Entropy
[27]
.
The principle of
it

is a method for analysing available q
ualitative information
aiming to cover

a unique epistemic probability distribution.

Based on this,
there are two main tasks this technique needs to complete: the first task is to
determine a set of statistics that simulates the
behaviour

of a random proces
s;
and after that, the second task is to
i
terate
to an accurate model of the
process.
Then

a prediction for a future output of the random process can be
safely generated by th
e

model.

5.4.3.

Support Vector Machine (SVM)

SVM has shown
s
uccessful
performance on a w
ide range of classification
problems, as well as on text categorization
[28]
.
A special

property

of SVM is
that
it

minimize
s

the empirical classification error and maximize
s

the
geometric margin

in the same process
.


Figure
5
-
3

Support Vector Machine

class boundary estimation method.

http://en.wikipedia.org/wiki/Image:Svm_max_sep_hyperplane_with_margin.png


SVM is only applicable for binary classification tasks, as
shown in the
F
igure

5
-
3
,
Given
n

data points

-

each of which belongs to one of two classes; the
goal is to

classify

which

category

a new data point
belongs to
. In the case of
support vector machines, we are interested in finding out if we can achieve
maxi
mum separation (margin) between the two classes.

W
e define the class
c
i

of a point
x
i

from a two
-
class set {1,
-
1}, and
w
stands for a normal vector of
the margin. After we choose

w

and a constant
b

to maximize this margin (i.e.
minimize ||
w
||), the class
c
i

of a point
x
i

will be constrain
ed

by the formula
listed below:

, for all

Automatic Classification of Pathology Reports into SNOMED Codes


5.5

Evaluation Method

An important issue of this experiment framework design is how to measure
the system classification performance.

Si
nce a common approach for multi
-
label categorization is to break the task
into several disjoint binary categorization subtasks, for each category and
each text, four quantities are employed to evaluate the performance of the
classifiers:

Tp


True Positive
: the number of documents correctly assigned to

a
class;

Fp


False Positive: the number of documents incorrectly assigned to
a
class;

Tn


True Negative: the number of document correctly rejected from
a
class;

Fn


False Negative: the number of document
incorrectly rejected from
a
class;

From the
d
efinitions
above, we define the following performance measures:



Another commonly used evaluation criterion is
F
-
measure
, which combines
the two measures together:


W
here
β

is a parameter allowing different weighting of recall and precision,
0.5, 1, 2 are the candidate values, within w
hich

1 is the most commonly used.

However, the F
-
measure is the metric for a single category. In order to
measure the system performance, we
have to average all performances of
each classifier, and average them
to
evaluate the system performance. There
are two conventional methods, micro
-
averaging and macro
-
averaging.
T
hus,
we have two
kinds

of F
-
measures, namely the micro
-
F, and macro
-
F.

Micro
-
F

To calculate micro
-
F value, we have to count the sum of
Tp
, the sum of
Tn
,

and the sum of
Fn

from all classifiers, and use the sums to get the
Micro
-
averaged recall and precision. After that, the Micro
-
F value can be
calculated by the same formula as
F
-
measure
. It is clear that Micro
-
F treats
every categorization equally important
ly
, so it biases against the majority
classes.

Macro
-
F

T
o treat every category equally important
ly
, we can use Macro
-
F which biases
against rare classes. To calculate Macro
-
F is

the average of all class
F
-
scores
:


W
here N stands for the total of classes.
Automatic Classification of Pathology Reports into SNOMED Codes


CHAPTER 6.


THE FEATURE EXTRACTI
ON

In this chapter, we specifically describe a set of methods used in
the
feature
extraction process.

Due to the fact that
a
ny
text c
ategorization result is heavily
impacted by the text feature set which the system is
using
, it is worthwhile to
discuss what features are expected, as well as how the features are
extracted.

6.1

Word Stemming

In English, one word can have many
morphological
f
orms. For example, one
verb ha
s

several

varieties due to time
-
tense or third
-
person
-
subject.
F
or
example, the word

appear


can be varied as

appears


,

appeared

.
E
ven
though different in grammatical
forms, they

c
arry
the same
abstract notion of
the word
.

We can easily tell the word

appeared


is the word

appear


in
past
-
tense form because it is

signalled by the segment “
-
ed
”, which
is

call
ed

a
morphological suffix.

For e
xamples
,

t
he stem of the
verb


stem
” is “
stem
”: it is the part that is
common to all its inflected variants.

1.

stem

(infinitive)

2.

stem

(imperative)

3.

stem
s (present, 3rd person, singluar)

4.

stem

(present, other persons and/or plural)

5.

stem
m
ed (simple past)

6.

stem
m
ed (past participle)

7.

ste
m
m
ing (progressive)


Getting the abstract notion of the word helps to reduce
the number of
attributes in
the
Vector Space, the process of

strip
ping

off the suffix

is
called
stem
ming.

The stem need not be identical to the morphological root of the
word. It
is usually sufficient that related words map to the same stem, even if
this stem is not in itself a valid root.

However, stemming
has the

risk that the
grammatical information of a word will be lost.


The most frequently used two stemmers (i.e. stemming al
gorithms) are the
Porter Stemmer,
and
the Lancaster Stemmer
. Many
types of stemming
algorithms

have been created

in stemming processing,

which differ
on

performance and accuracy and the maner in which
they

overcome stemming
obstacles.


6.2

N
-
gram Tokenization

The full text is not the basic unit

used.

I
nstead, the text
is broken
into basic
units, and

which are sent

to

the

text categorization system. Such a basic unit
is called token. In another word
s
, a

token is a categorized block of text.

Th
e

process of segmen
ting
text

into
tokens

is known as tokenization
, which

is
a prelude to
almost
everything else needed in
text processing
.

The algorithm
with

a
tokenizing strategy is called

a

tokenizer. There are many tokenizers
vailable
for us
e
, such as sentence
-
tokenizer,

whitespace
-
tokenizer, and
word
-
tokenizer. Taking the word
-
tokenizer as
an
example, it conveniently
extracts

each word from a text, and
wraps

the word as a token. For instance,
the text:






“This is a sentence (maybe a paragraph).”

will be tokenized as t
he tokens

This

,

is

,

a

,

sentence

,

maybe

,

a


and

paragraph

.

Automatic Classification of Pathology Reports into SNOMED Codes

This tokenizer simply extracts

each single word into the basic unit, and the
result
provides the

words

present in

the text
. Usually the frequencies if each
word are counted as well.

H
o
wever, two kinds of information are lost


the
relationship between two words is ignored, and some
multi
-
word expressions,
like
entity names
,

which
are

composed
of

several

words are broken down, for
example, the entity name

New

South

Wales


which stands f
or a state in
Australia is tokenized as three units

New

,

South


and

Wales

, the meaning
of the entity
being lost
.

If we consider
tokenizing

the words in pairs

or bigrams
, the two
-
word
-
length
phrases and entity names will be kept, such as

get

off


and

heart

attack

.
This kind of tokenization strategy is known as
Bigram
s
, while the strategy we
discussed above is called
Unigram
.

6.3

Stopwords Exclusion

By Zipf

s Law
[29]
, the words with the highest frequency
a
re mostly function
words, such

as

the

,

of

,

and

, which usually are conjunctions, prepositions,
determiners, or pronouns. These words can be found in almost any
genre
s of
documents if they have reasonable length.

These high
-
frequent words are not generally related to the context of

a
document, and contribute little value in text categorization.
By

convention
,
these words are called stopwords

and
exclude
d from any anlysis
. We need to
r
emove
these stopwords before
computing
the process of
the
text vector
representation.

A traditional

way to find stopwords is by ranking the words in descending
frequency order, and set a threshold to gather the words with
the
highest
frequency from the list. The threshold
needs

to
be
set accurately. If it is too
high, very few words will be part of the
stopwords list; too low, an excessive
amount of words will be added into the stopwords list.

A NLP toolkit we adopted in this project has provided a 570 words collection of
stopwords. Below is a partial list of the English stopwords list:

a a's able about
above according accordingly across actually after
afterwards again against ain't all allow allows almost alone along already
also although always am among…

…would wouldn't x y yes yet you you'd
you'll you're you've your yours yourself yourselves z zero

6.4

Neg
ation and
Concept Detection

Crammer

et al.
[9]

has
shown

that
by

pay
ing

attention
to
Negations during
the
text categorization process, the system will gain a higher performance.
Hence,
TTSCT
was used
to identify the

negation description of
symptoms
, and in
addition, the SNOMED concepts in the
report

we
re detected at the same time.

In this process, two kinds of negation concepts are identified


the
pre
-
coordinated

concepts, and the concepts which have been asserted
e
xplicitly as negative by negation phrases. In the first case, the description
like

no headache


will be detected because

no headache


is a term which
has been collected in the Clinical Finding Absent category of SNOMED

CT
,
which holds the terms indicatin
g the absence of findings and diseases. In
the other case, the SNOMED concept id is assigned to each medical term at