Biomedical Text Information Extraction across Domains - BIRN Wiki

skirlorangeΒιοτεχνολογία

1 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

85 εμφανίσεις


Biomedical Text Information E
xtraction

across Domains

Hao Wang, Eduard Hovy, Tommy Ingulfsen, and Gully Burns

USC/ISI


1.

Introduction

1.1.

Background

Despite much promise, the field of biomedical text mining has yet to deliver truly practical tools
that help
scientists with the process of making sense of the mountain of textual information they
must now process and synthesize [
{Rebholz
-
Schuhmann, 2005 #214}
,
{Cohen, 2008 #802}
]. A
substantial issue which much be confronted is that, in general, any Natural Langu
age Processing
(NLP) application is highly context
-
dependent: typically, the application of a NLP techniques
and tool in any new domain requires that a substantial amount of preparation be invested for the
technology to be effective in that domain. With In
formation Extraction (IE) technology, this
might take the form of the development of new hand
-
crafted linguistic patterns or newly
prepared sets of training data from the new domain [REF].



Our research focuses on the development of natural language proc
essing (NLP) software for
information extraction (IE) out of biomedical research articles. Given a particular domain, such
as tract tracing of the neuronal pathways in rats’ brains, our software ingests research articles
from various journals and other so
urces. Guided by a domain specialist, the software learns
which kinds of information in the articles is desired (for example in the tract tracing domain, the
brain regions into which tracer chemicals are introduced, and the particular chemicals
employed),

and then proceeds to extract and format this information from all available texts,
storing the results in a database. The researcher can then access the database, and use its
contents to locate desired papers, to discover long
-
term trends over a series o
f years, to search for
lacunae within past research, etc.

We have developed various components of the software to date; see
(Burns, Feng, Ingulfsen, &
Hovy, 2007)
; REFS]. In general, the IE system architecture is as follows: [***DESCRIBE***,
wit
h flowchart].






2

1.2.

The Problem

Though it may be helpful to researchers, the IE software is still rather limited. The CRF
mechanism learns to recognize the presence of desired information by combinations of words
and other features that are characteristics

of them. (For example, the names of desired chemicals
are learned, as well as phrase patterns such as “we injected <noun
-
phrase> into”.) Typically, and
quite naturally, these features are rather specific not only to the domain, but also to the text genr
e
(research article
-
type language) and even the actual journal (article formatting, style, typical
section heading conventions, etc.).

This specificity limits the applicability of the software, increasingly so the more it is trained to a
specific domain
and journal. But the natural response

to limit training

results in lower
performance (both lower correctness, or Precision, and lower coverage, or Recall). Given this
tradeoff, one is led to the following questions:



With two similar but different neuro
science subdomains (e.g., tract tracing and activation), is
it feasible to label texts in one (the
source

subdomain) using a machine learning model
which was learned from the other (the
target
)? How good is performance, and what is the
typical performance

drop?



What kinds of fields are susceptible to transfer across subdomains? What knowledge is
learned in the source domain, and what of this is carried over to the target, and why?



How can one improve the performance on the aspects not carried over? If

one has to
manually train the machine learning model on the target subdomain, how much of the source
can/should be carried over to save on manual effort but still achieve optimal performance?
And how much effort/text is required for teaching the target?



Does the quality or diversity of training materials in the source (and the target) subdomain
matter?

We explore these questions in this paper. After describing the two subdomains and the training
regime in Section 2, we provide the results of a series of

carryover experiments in Section 3.

2.

Domains, Materials, and Methods

2.3.

The Source Domain

Tract
-
tracing experiments (TTE) are often used to study the interconnectivity of the brain. The
general method is to inject tracer chemicals into one region of the br
ain and then identify the
corresponding regions where the tracer is has been

transported to

(the so
-
called labeled regions),
as well as to analyze how much transport (labeling) has occurred. Our previous work
(Feng,
Burns, & Hovy, 2007)

reported that key information like the tracer chemicals, injection
locations, and labeling locations of a TTE can be successfully extracted by a machine learning
model using Conditional Random

Fields (CRFs).




3

In that work, 51 TTE article from
the

Journal of Comparative Neurology (JCN)

(from 1982 to
1995) were selected as materials for the source domain. The mean number of sentences per
article was 115. TTE articles describe the specific techniq
ue used (labeling description), the
brain regions that were identified (labeling location), and other reference brain regions
(neuroanatomical location). On average, there were 95 labeling locations, 74 labeling
descriptions, and 10 neuroanatomical locati
ons in each article. Figure 1 shows the annotations of
a paragraph in one of the articles.


Figure
1
. Annotations of a paragraph in a TTE article (Red: labeling location; Fuchsia: labeling description;
Yellow: neuroanatomical
location).

As for many science articles, a typical TTE journal article usually includes an Introduction,
Method, Results, and Discussion section. One or more TTEs may be reported in a single article.
The key elements of TTEs are often found in the Result
s section. Therefore, we included in our
experiments only the text from the Results section of each article.

For the work reported in this article, we used the same materials as we used in
(Feng, Burns, &
Hovy, 2007)
.

2.4.

The Target Domain

Besides tract
-
tracing, another method commonly used by neuroscientists to determine the
connectivity of brain regions is to present a stimulus, such as a threat to the subject animals, and
then to observe the regions in the brain that have become activated. We refer to experiments that
employ this technique as activation experiments (AE) in this paper.

Similar to TTE articles, AE articles also contain labeling descriptions, labeling locat
ions, and
neuroanatomical locations.

We selected 24 AE articles from the journal
Brain R
esearch

(from

2000 to 2001
) as materials for
the target domain. The mean number of sentences in each article is 44. On average, there are 27
labeling locations, 20
labeling descriptions, and 7 neuroanatomical locations per article.

As described in Section 2.9, there articles were first tagged by a CRF model that had been
trained with the 51 TTE articles. After performance was measured, the errors were then
corrected

by a domain expert following the same annotation guidelines as with TTE papers. As
with the TTE articles, only the Results sections from each article were used in our experiments.
Figure 2 shows an example of tagging in the AE domain.




4


Figure
2
. Annotations of a paragraph in an AE
article (Color code is the same as in Figure 1)
.

2.5.

Annotation Guidelines

Three kinds of labels were tagged in both TTE and AE articles. A

neuroanatomicalLocation

is a
noun phrase describing a place in

the brain.

Some neuroanatomicalLocations are also

labelingLocation
s

(for example, “the hypothalamus”

in Figure 1
), which

is a noun phrase
describing a specific region
where

labeling

occurs
.

Typically, this phrase will name a specific
brain region with t
opographic

modifiers

(for example, “the stria terminalis”

in Figure 1
). T
he
labelingDescription

is a phrase

(which

can be quite complex and unstructured
) that

describes
qualities of the labeling

(e.g
.,

density and

presence of boutons
) or
type
s

of labeling

(e.g.,

cells/fibers
) (for example, “increases in c
-
fos and NGFI
-
A”

in Figure 2
).

2.6.

The IE Model

Our task is to identify and label relevant brain regions and descriptions in sentences
automatically. We employ the technique Conditional Random F
ields

(Lafferty, McCallum, &
Pereira, 2001)
,

which is a probabilistic fra
mework for labeling and segmenting sequential data,
and hence fits our need very well. It has had many applications in bioinformatics, like protein
name recognition in biology abstracts
(Settles, 2005)
, RNA structural alignment
(Sato &
Sakakibara, 2005)
, protein structure prediction
(Liu, Carbonell, Weigele, & Gopalakrishnan,
2005)
, and brain region
recognition in neuroscience literature
(French, Lane, Xu, & Pavlidis,
2009)
. A conditional random field (CRF) is similar to a Hidden Markov Model (HMM) with
primarily two advantages: 1) the independence assumption required by HMMs to ensure
tractable inference is relaxed in CRFs because
of their conditional nature; 2) the label bias
problem, which is a weakness of maximum entropy Markov models and other conditional
Markov models based on directed graphical models, is avoided in CRFs.

A

linear chain
CRF

was utilized in our experiments us
ing the Mallet software toolkit
(McCallum
& Kachites, 2002)
. Like any supervised machine learning methods, a CRF model needs to be
trained on labeled materials; and then it can b
e applied to label new materials. In our
experiments, each sentence in an article was segmented into word tokens. Various features
described in the next section were attached to each token. Finally, all the tokens were fed to a
CRF sequentially. The re
sulted model was saved to a file for labeling other materials.

2.7.

Features used by the CRF Model

From Feng et al.
(2007)
, we know that several features are inf
ormative for the labeling in CRF
model. The same set of features is adopted in this study. Each word token in an article was



5

labeled with the corresponding features before feeding to the model. The features are explained
below individually.

2.5.1 Doma
in specific lexicon

The lexicon includes 1123 names and abbreviations of brain structures taken from brain atlases
(Swanson, 2004)
, 30 standard terms to denote neuroanatomical topographical relationships (e.g.,
rostral), 9

names and abbreviations of the tracer chemicals used in a TTE (e.g., PHAL), and 17
common descriptive words for labeling (e.g., dense and sparse, axonal and dendritic).

2.5.2 Surface word and root form

Words may have different forms in English (e.g., plur
al or past tense). Both the original word
and its root form (if different) were added as features.

2.5.3 Window words

The neighboring words are of great help in a sequence labeling task. The previous and next
single words are provided as features to the

current word token.

2.5.4 Syntactic features

Syntactic features such as sentence subject and object were added to a word token if they was
recognized by the syntactic parser Minipar
(Lin, 1998)

as one of the syntactic roles. All the
words that have a dependency relation to the governing verb were given the feature “govern
-
verb
token”.

2.5.5 Context window

For TTE, the context wo
rds
injection

and
deposit

and various forms of the two words are an
indicator for their surrounding words. So, the previous and next four words of these context
words were added a feature injection context for TTE articles.

2.8.

Evaluation

To evaluate the la
beling results in both source and target domains, the labels produced for each
word token by the system were compared to the gold standard labels produced by the domain
expert annotator. Then the three usual IE evaluation metrics

precision, recall and F1

were
computed for each article. Precision indicates how many units are correct in all the units
predicted by the system. Recall is the proportion of units correctly predicted by the system to the
total number of units in the gold standard. F1 is the (ha
rmonic) mean of Precision and Recall
that (more than linearly) penalizes bad performance of either, and thus gives a comprehensive
score for the system’s labeling performance.





6



Sometimes a labeling descripti
on may contain many word tokens and a labeling location may
contain only one or two tokens. To avoid any bias caused by the length of the fields, we also
measured the system’s performance based on labeled fields. In many case, the system’s labeling
only
partially overlaps with the gold standard. In one set of statistics, we count only complete
overlaps, when the system has
exactly

the same labeling as gold standard. In another, more
lenient, set, we also count each partial overlap as a hit: if a predicted field has any overlap with a
field in the gold standard, it is counted. We report three kinds of scores

token
-
based, field
-
bas
ed complete overlap, and field
-
based partial overlap.

2.9.

Experimental Procedure

First
, a leave
-
one
-
out cross
-
validation was performed on all the articles in the source domain to
verify the source materials. A CRF model was trained with any 50 of the 51
TTE articles and
tested on the remaining one article. This allowed us to identify problematic articles, and to train
the system on different numbers of ‘good’ ones.

Second
, we trained and tested a CRF model on the source materials to see how the mode
l
performs on the source domain. Five articles with an F1 equal or lower than 0.5 in the previous
experiment were excluded in this experiment. The remaining 46 TTE articles in the source
domain were divided into a training set of 15 articles and a test s
et of 31 articles. 5, 10, and 15
articles from the training set were used to train three models, respectively. Then the three
models were separately tested with the test set of 31 TTE articles. Performance is reported in
Table 1.

Third
, to see how t
he training materials affect the labeling on the target domain, three CRF
models were trained respectively with all 51 TTE articles, the 21 TTE articles with the best F1
scores in the initial cross
-
validation experiment, and the 11 TTE articles with the be
st F1 scores,
respectively. 15 of the 24 AE articles were kept for later training; the remaining nine AE articles
were used for testing the three CRF models.

Fourth
, each of the three models mentioned immediately above was continually trained also wi
th
5, 10, and 15 of the 15 AE articles included in the training set. When trained, they were tested
again on the same test set of nine AE articles. For reason of comparison, we also trained another
three models with only 5, 10, or 15 AE files in the tran
sfer domain alone, to see how much help
the pre
-
training with TTE articles has.




7

3.

Results

3.1 Validating Source M
aterials

In the leave
-
one
-
out cross
-
validation experiment on the source domain, the mean token
-
based F1
for the 51 TTE articles is 0.65 (precisio
n=0.71, recall=0.62). The mean scores are satisfying but
we observed a large variation among the scores of those articles with the F1 scores ranging from
0.38 to 0.86. Some articles have really low scores that may be due to the writing style or the
lexic
ons of these papers being distinctive from the others. In particular, five articles had an F1
score of 0.5 or below.

3.2 Testing on the Source D
omain

The three models that were trained with 5, 10, or 15 TTE articles were tested on the remaining
31 TTE a
rticles. As shown in Table 1 they all produced acceptable results. There is a small gain
in F1 when five more TTE articles were added to the training set of the first model. The overall
trend is that, when more articles were used in training, recall inc
reases while precision drops. In
general, good labeling on the same domain can be achieved with a surprisingly small set of
articles.

Table
1
. Mean scores for models both trained and tested on the source domain.


Field
-
based full

overlap

Field
-
based partial overlap

Token
-
based

# of TTEs
in training

Precision

Recall

F1

Precision

Recall

F1

Precision

Recall

F1

5

0.89

0.61

0.72

0.92

0.64

0.75

0.93

0.73

0.82

10

0.85

0.68

0.75

0.89

0.71

0.79

0.91

0.79

0.84

15

0.82

0.7
0

0.75

0.87

0.75

0.8
0

0.88

0.81

0.84


3.3 Applyin
g Directly to the Target D
omain

Now we know that CRF models can accurately identify labels in the TTE domain. Another three
models were trained with all 51 TTE articles, the best 21 TTE articles, and the best 11 TTE
articles, respectively. These were all tested on the same set of 9 AE
articles. Surprisingly, there
is almost no difference between the scores for the three models (Table 2). They all have
extremely high precision and average recall, which suggests that those models have made
exceptionally accurate predictions though they
also missed a large number of labels.

Table
2
. Mean scores for models trained on the source domain and tested on the target domain.


Field
-
based full

Field
-
based partial

Token
-
based


Precision

Recall

F1

Precision

Recall

F1

Precision

Recall

F1




8

51 TTE

1

0.5

0.65

1

0.5

0.65

1

0.5

0.63

best 21 TTE

1

0.51

0.65

1

0.51

0.65

1

0.5

0.63

best 11 TTE

0.99

0.5

0.64

1

0.5

0.64

1

0.5

0.63


When examining the confusion matrices of the labeling results, it becomes clear that the models
did very well on labelingLocations. In contrast, they did not make any prediction on
labelingDescription, as demonstrated in Figure 3.



Target/Predict

O

LD

L
L

NAL

O

-

0

0

0

LD

121

0

0

0

LL

0

0

70

0

NAL

93

0

0

0


Target/Predict

O

LD

LL

NAL

O

-

0

0

0

LD

168

0

0

0

LL

0

0

221

0

NAL

225

0

0

0

Figure
3
. Confusion matrices of labeling for two AE articles (O: the default label; LD: labelingDescription;
LL: labelingLocation; NAL: neuroanatomicalLocation).

3.4 Ad
ding Knowledge from the Target D
omain

The models’ performance on the target domain suggests tha
t what they have learned about
labeling description from the source domain did not transfer well to the target domain. The
question then is how to incorporate knowledge of labeling description in the target domain into
the model, since it is not available

in the source materials. One solution is to continually train
the model with some AE articles. Following from the previous experiment, the three models
were trained with 5, 10, and 15 AE articles as well. The current results and those from Table 2
were

incorporated into Table 3 to provide a comprehensive view of scores under various
conditions. There is a noticeable improvement on performance when the original models were
trained with five AE articles. It has also been demonstrated that the more AE ar
ticles used in
training, the better the F1 score.

Table
3
. Mean scores for models that were further trained with AE articles (In each cell, the numbers are
field
-
based full overlap F1 / field
-
based partial overlap F1 / token
-
based

F1).

Number of AEs in training

0

5

10

15

Model trained with 51 TTEs


0.65/0.65/0.63

0.57/0.71/0.68

0.63/0.76/0.73

0.65/0.79/0.76

Model trained with best 21 TTEs

0.65/0.65/0.63

0.57/0.71/0.68

0.63/0.76/0.75

0.64/0.77/0.74




9

Model trained with best 11 TTEs

0.64/0.64/0.63

0.57/0.70/0.68

0.64/0.76/0.75

0.67/0.80/0.77

Model trained with no TTE

-

0.61/0.66/0.64

0.64/0.7/0.68

0.65/0.77/0.73


In Table 3, we include in the bottom line three models that were trained only with AE articles.
Their F1 scores are slightly lower than the F1 scores of the models that were trained with
additional TTE articles, although their performance is reasonably goo
d.

Figure 4 shows the confusion matrices for the same two AE articles as in Figure 3 that were
labeled by a model trained with 51 TTE articles and 15 AE articles this time. It is apparent that
the model has improved a lot on predicting labelingDescriptio
ns and maintained its exceptional
performance on labelingLocations.


Target/Predict

O

LD

LL

NAL

O

-

8

0

11

LD

64

57

0

0

LL

0

0

70

0

NAL

93

0

0

0



Target/Predict

O

LD

LL

NAL

O

-

89

0

1

LD

63

105

0

0

LL

0

0

221

0

NAL

198

6

0

21

Figure
4
. Confusion matrices for two AE articles after further training with 15 AE articles.

Table

4 and 5 give a comprehensive view of the scores for labelingDescription as well as
labelingLocation and neuroanatomicalLocation.

Table
4
. Scores for models that were continually trained with AE articles (labelingLocation and
neuroanatomicalLocation) (The three numbers in each cell are field
-
based complete overlap F1, field
-
based
partial overlap F1 and token
-
based F1,
respectively)

Number of AE files in training

0

5

10

15












Model trained with 51 TTE


xx/xx/0.63


0.68


0.73


0.76

Model trained with the best 21 TTE


0.63


0.68


0.75


0.74

Model trained with the best 11 TTE


0.63


0.68


0.75


0.77













Model trained with no TTE

-


0.64


0.68


0.73


Table
5
. Scores for models that were continually trained with AE articles (labelingDescription only) (The
three numbers in each cell are field
-
based complete overlap F1, field
-
based pa
rtial overlap F1 and token
-
based F1, respectively)




10

Number of AE files in training

0

5

10

15












Model trained with 51 TTE





Model trained with the best 21 TTE





Model trained with the best 11 TTE

















Model trained with no TTE

-






4.

Discussion

Table 2 shows that the models did very well on labelingLocations

that information transfer
over from TTE to AE domains without problem. This is to be expected, since the names of brain
regions stays more or less the same in the two (closel
y related) domains.

Table 3 and Figure 4 show that even a relatively small amount of training data in the new domain
(AE) significantly improved performance on labelingDescriptions, without hurting the
labelingLocations in any way.

Why do the models beh
ave differently on these two kinds of labels? The reason might be that
words and contexts for labeling locations are very similar between the two domains; but labeling
descriptions have fewer words and phrases in common between the two domains compared to

labeling locations. We computed the overlap of word tokens for labeling location and labeling
description between the 51 TTE articles and the 24 AE articles. 83% of word tokens occurring
in a labeling location field in the AE articles also occur in the
TTE articles. In contrast, the word
tokens in labeling descriptions in the AE articles have only 57% overlap with the TTE articles.
This suggests that the ability to carry knowledge from the source to the target domain is (at least
partially) due to the
similarity of word tokens in the corresponding fields of the two domains.
[***THIS IS VERY NICE INFO TO INCLUDE. IT HELPS INDICATE HOW DIFFERENT
THE TWO DOMAINS ARE


IT PROVIDES SOME MEASURES OF ‘DIFFERENCE’.
ARE THERE ANY SIIMILAR MEASUREMENTS WE CAN
MAKE FOR NAL, TO
EXPLAIN WHY THINGS ARE SO POOR THERE?***]

Comparing Tables 2 and 3, it is clear that including more target domain training material helps,
while the relative amount of source domain training material seems almost irrelevant. But the
imba
lance between source and target training material from is important: too much source
material hurts. In Table 3, reading across the columns labeled 5, 10, and 15 shows the
improvement as the percentage of training material from the source decreases.

The puzzle is with neuroanatomicalLocations, which did poorly in both cases, whether target
domain material was present for training or not. The bottom right cell shows that even with zero



11

TTE and all AE training material, the model still achieves nowhere

near the performance for the
target domain as for the source domain. ***THIS IS EITHER DUE TO NAL SIMPLY BEING
IMPOSSIBLE IN THE TARGET, BUT NOT THE SOURCE, DOMAIN, OR DUE TO 15
BEING TOO LITTLE MATERIAL. BUT SINCE THE TTE DOMAIN ACHIEVED VERY
HIGH OVER
ALL F
-
SCORE WITH AS LITTLE AS 11 TEXTS TRAINING MATERIAL, ONE
WOULD HAVE TO COMCLUDE THE FORMER EXPLANATION.***

[***HAO, HOW DID NAL DO IN THE SOURCE DOMAIN ALONE? THERE IS NO
CONFUSION MATRIX FOR SECTION 3.2, SO WE CAN’T SEE IF THE NAL IS JUST
HARD, E
VEN ON SOURCE MATERIAL. I SUSPECT IT IS***]

Overall, then, we make the following conclusions. For at least the two domains of TTE and AE,
and probably others that are as closely related as they are: when the same fields are sought in
both domains, and th
e basic nomenclature is roughly the same, then as little as a dozen texts of
training material from one domain will assist with transfer to the other. But it is valuable to
include also more than 5 texts from the target domain in training, in order to ide
ntify the
differences between domains and instruct the system on how to behave in those cases. It is
important, however, to perform field
-
by
-
field analysis, as in the confusion matrices of Figures 3
and 4, in order to ascertain if any specific field is ca
using problems.



References

Burns, G., Feng, D., Ingulfsen, T., & Hovy, E. Infrastructure for Annotation
-
Driven Information
Extraction from the Primary Scientific Literature: Principles and Practice. In
Proceedings
of the Services, 200
7 IEEE Congress on
.

Feng, D., Burns, G., & Hovy, E. Extracting Data Records from Unstructured Biomedical Full
Text. In
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural
Language Processing and Computational Natural Language Learning
(EMNLP
-
CoNLL)
Association for Computational Linguistics.

French, L., Lane, S., Xu, L., & Pavlidis, P. (2009). Automated recognition of brain region
mentions in neuroscience literature.
Frontier in neuroscience
.

Lafferty, J., McCallum, A., & Pereira, F. Con
ditional Random Fields: Probabilistic Models for
Segmenting and Labeling Sequence Data. In
Proceedings of the International Conference
on Machine Learning (ICML
-
2001)
.

Lin, D. Dependency
-
based evaluation of MINIPAR. In
Proceedings of the Workshop on the
Ev
aluation of Parsing Systems, First International Conference on Language Resources
and Evaluation
.

Liu, Y., Carbonell, J., Weigele, P., & Gopalakrishnan, V. Segmentation conditional random
fields (SCRFs): A new approach for protein fold recognition. In
Proc
eedings of the ACM



12

International conference on Research in Computational Molecular Biology
(RECOMB05)
.

McCallum, & Kachites, A. (2002). MALLET: A Machine Learning for Language Toolkit.

Sato, K., & Sakakibara, Y. (2005). RNA secondary structural alignment w
ith conditional random
fields.
Bioinformatics, 21
(suppl_2), ii237
-
242.

Settles, B. (2005). Abner: an open source tool for automatically tagging genes, proteins, and
other entity names in text.
Bioinformatics, 21
(14), 3191

3192.

Swanson, L. W. (2004).
Brain

Maps: Structure of the Rat Brain

(3rd ed.): Elsevier Academic
Press.