NickPrise_FinalReport2x - University of Washington

splashburgerInternet and Web Development

Oct 22, 2013 (3 years and 7 months ago)

68 views

1


PRISEAnnotate


Capstone Project Report


Nicholas Spence

Institute of Technology

University of Washington Tacoma

nspence@uw.edu

Committee

Steve Hanks

Martine De Cock

Ankur Teredesai

June 10, 2010


Abstract

The Protein Interaction Search Engine (PRISE, http://prise.insttech.washington.edu) is a
biomedical text mining system built to help researchers find interactions between proteins
described in the abstracts of biomedical articles from the
PubMed
1

database.

PRISE learns to
detect these protein interactions by looking at training examples, i.e. sentences in which
human annotators have previously indicated protein interactions. Collecting training
examples and carefully selecting which ones to use in the learn
ing process is key to the
success of PRISE.

In this project, we developed PRISEAnnotate, a module within PRISE that allows the users of
the system to provide useful feedback in the form of annotations.
Through experimentation
,

we have found that
choosing
sentences similar to ones that have been misclassified by PRISE
is more effective than adding training examples at random. This allows PRISE to be trained
using fewer examples, which is important because
human experts must provide these
examples
. This anno
tation tool was created to gather the needed information about each
sentence chosen as a potential training example. We devised and implemented a method to
determine a consensus among multiple end users and explored what constitutes similarity
between one
sentence and another.

2


Contents

Contents

2

Introduction

3

Annotations

5

Choosing Sentences

8

Experimental Results and Analysis

10

Educational Statement

14

Conclusion

14

Future Research

14

References

15

3


Introduction


Protein
-
protein interactions are important for most, if not all, biological functions
and

are

of central importance for virtually every process in a living cell.
K
nowing
wh
ich

proteins interact and in which way
s

is vital
to

biomedical research. Protein
interactions are documented across thousands of
research papers
that

have been
published on PubMed
1
. However this still presents a problem to scientists looking
for every way
one protein

may interact

with another. This is the purpose of PRISE,
to process the papers on PubMed and automatically identify and store the
documented protein interactions contained in the papers.

This task is accomplished
through a series of different
modules shown in Figure 1.
The Query Module checks daily for new papers on PubMed and downloads the
abstracts. These are passed along to the Filter Module, which filters out irrelevant
abstracts. Next, the PR (Protein Recognition) Module detects whether th
e sentences
within the abstract contain any proteins. Then the PID (Protein Interaction
Detection) Module determines if any of the identified proteins actually interact with
each other by th
e use of a classifier. Th
e results are stored in a database, which

is
accessible
by the

w
eb
i
nterface
to allow
researchers to search
through the findings.
The Annotator Interface, which is the focus of this project, is a part of the web
interface.


Figure
1
: PRISE System Overview

The classifier
in the PID Module is currently trained on five public datasets:
AIMed
3
, Bioinfer
4
,
IEPA
6
, HPRD50
2
, and LLL
5
.

These datasets have been labeled and
verified by experts.
Currently
,

the classifier is working

and produces
mediocre

4


results
;

h
owever, there is sti
ll much room for improvement. This project aims to
improve the overall accuracy of the PRISE classifier

by utilizing an active learning
process. The key idea behind active learning is that a
machine
-
learning

algorithm
can achieve greater accuracy with fewe
r training labels if it is allowed to choose the
data from which it learns
7
. Besides the original five training datasets and what the
PID module has already classified, there is no other source of labeled training data.
For
an active learning

approach
to
be
effective,

our algorithm
needs
a
large set of
unlabeled data

to ch
o
ose from
, such
as

the abstracts being querie
d from PubMed
daily; these
chosen sentences
must then be
labeled by a human expert
.

This

allows
the

annotation tool to serve a
dual
purpose:
identify misclassifications and create
verified training examples.

Once misclassifications have been found, they can be
used as input to the active learning algorithm
,

which then chooses sentences to be
labeled by experts using the annotation tool. The lab
eled sentences are then ready
to be added to the training set and improve the classifier’s results.



5


Annotations


In order to facilitate an active learning process that utilizes unlabeled data
,

we
needed a way to obtain labels for the chosen data.
We imp
lemented a
tool that

presents a sentence to a human expert and collects
labels

in the form of annotations.
An annotation takes the form of [
p1
,

p2
,

it
] where
p1

and
p2

are interacting proteins
and
it

is the interaction type. It is important to note that
th
e order in which
p1

and
p2

are presented implies the direction of the
interaction
:
p1
acts on
p2
.

An unlabeled

sentence

may
contain no interactions,
a single interaction,
mu
ltiple interactions, or
proteins that do not interact. Each of these cases must be
handled so that the labeled
example can be added to the training data.

One user annotating one sentence,
called
an
annotation event
, follows
this process
: the user is presented with a sentence and
is
asked to either state

that

there are no interactions or
to identify the two proteins
interacting and the interaction type. In the case of identifying an interaction, the
user highlights the text in the sentence that represents each of the entities they have
identified. This gives us the exact positioning of eac
h entity in the text
,

which is not
only important to training the classifier, but also important later on in the
annotation process.

Each of the entities within an annotation [
p1, p2, it
]
are
automatically mapped to entities stored in the database. If the
highlighted text
matches an existing entity, that entity is
used;

otherwise a new entity is created.

In
the case of identifying the interaction type there is an additional feature to help the
mapping of different forms of the

same word to the same entity:
a
fter highlighting
the interaction type, the user chooses from a list containing the full ontology of
interaction types.

This annotation tool
is

designed to be deployed on the web where the largest
number of experts can easily participate. This also introd
uces a problem

of
collecting incorrect annotations, whether left maliciously or by a
misinformed

expert. To account for this
,

the annotation process is collaborative, meaning that a
single sentence must be annotated several times

by different experts

befor
e the
results
will

be used.
Naturally this presented two problems: choosing the order in
which sentences
should be
displayed and determining a consensus from the
annotations for each sentence. The active learning process will determine which
sentences need

to be labeled
, adding them to a queue. When a user logs in to the
annotation interface
,

a vector of sentences is created
, which ensures that the user

is
being presented with sentences that they have not
yet
labeled and in the order
indicated by the active

learning algorithm.

6


We devised a method to find the consensus among the annotations left for a single
sentence.

First, a basic comparison metric
was needed
between two annotations to
determine

in wha
t state of agreement they are
. We settled on three scores to
express agreement:
strict agreement (represented by a
+
), strict disagreement
(represented by a
-
), and ambiguity (represented by a
?
).

To compute these scores
,

each of the three entities within an annotation
is

compared, re
spectively, to the
three entities in the second annotation.
If each of the three entities
match
es
the
entities in the second annotation exactly, then the comparison score is strict
agreement (+).
To achieve a strict disagreement score

(
-
)
, at least one set

of
corresponding

entities

in the two annotations

must not match and must have

an

overlapping text position within the sentence.

If two experts select different entities
in the same sentence that share the same text, the only explanation is that they
disag
ree about what is stated in the sentence.

Apart from the obvious way two
users

might select distinct overlapping segments of text,
there is another case
where two
users may highlight the same text for the interaction type, but select a different
interactio
n type from the list. Finally, if the two annotations do not strictly agree or
disagree then they are assigned the third score, ambiguity (?). This means that one
or more of the corresponding entities are different but do not share the same text
positionin
g. The reason for the ambiguity score is that a sentence may contain
multiple interactions; this score allows us to extract multiple opinions about
different parts of the sentence without
interfering with
others.

Using this scoring method, each annotation
is compared to every other annotation
for the same sentence.

If comparing two annotations yields a strict agreement score,
they are the same
.

For this reason all of the annotations
in strict agreement are
grouped.

Each group of annotations, which contains
one or more annotations,
represents an
opinio
n about the sentence. For each opinion,

the number of
annotations in strict agreement and
strict
disagreement are
tallied
.
Confidence
scores are then calculated for each
opinion

using the equation below.
P

represents
the count of
strict agreement scores,
d

represents the count of strict disagreement,
and
penalty

represents a scaling mechanism
,

which e
nsures that an opinion
has the
support of multiple users.

Once the penalty threshold is reached, the confide
nce
score is not scaled back.








(




)




)




Once
the confidence score
of an opinion
reaches a preset threshold,
k
,
it

has
gained
enough support to
be trusted
.

Presently, we have found that setting
k

to 0.5 has
yi
elded accurate results
after

gathering 850 annotations.

At this p
oint we refer to
the

opinion

as a verified interaction
, which represents a consensus on the sentence
.
It is quite possible that a sentence may contain multiple verified interactions.

7


Verified

interactions are ready to be used as labeled training examples.
8



Choosing Sentences

for Annotation


The strategy for
determining

which sentences
should be
label
ed

and add
ed

to the
training set determines
whether or not
our approach

improves the classifier
. We
hypothesized that sentences similar to ones that were misclassified are the best
source of subsequent training instances. The theory is that the classifier is
misclassifying some sentences based on a certain feature that it has
been trained on
.
Adding

new sentences to the training set that contain the same feature should
correct the error in the training set.

The first step is finding misclassified sentences.

There are two ways to obtain misclassified sentences
. The

first is to gather
annotations on se
ntences that have already been evaluated by the classifier. The
second way is to break up the current training set consisting of the AIMed, BioInfer,
IEPA, HPRD40, and LLL datasets into a training set and test set. The benefit
of
the
latter
method is
that
labels already
exist

for the test set
,

which can be compared to
the

results from the classifier.
A problem with the
former

method is that it is time
-
consuming and we needed to run enough experiments to be confident that our
approach was working. For
these

reason
s,

we

opted
to use the

second
method

so
the
number of annotations that could be gathered in a short time would not limit us
.

Algorithm 1: Finding similar sentences


m is defined as a misclassified sentence, U is the unlabeled set of sentences

stem m

remove stop words from m

remove non alphanumeric characters from m

f
or each s in U


stem s


remove stop words


remove non alphanumeric characters


add terms to TF
-
IDF initialization

end for


f
or each s in U


let c = cosign_similarity(tfidf_vector(m), tfidf
_vector(s))


add [s, c] to R

end for

sort R by cosign similarity descending

return R


9


Algorithm
2
:
Compiling Results


M

is defined as

the set of

misclassified sentence
s


U is the unlabeled set of sentences

let

R = a

hash [sentence,count]

for each m in M


let T = Algorithm 1(m, U)


for i = 0 to
(
0.03 * length(T)
)



increment R[T[i][0]]


end for

end for

sort R by count descending

return R



Figure 2: Example sentence similarity scores


Algorithm 1 looks for the most similar sentences in an unlabeled set of
sentences
using some preprocessing and TF
-
IDF
8
, then

returns a rank ordered list of the
unlabeled sentences, most similar first.

Algorithm 2 runs Algorithm 1 for each of the
mis
classified sentences and keeps

a count of how many times each sentence appears
in the top 3% of the results from Algorithm 1. When this is complete, the counts are
rank ordered
,

highes
t count first
,

and returned.
Figure 2 shows example results
from Algorithm 1, which
displays
the top 3%

most similar sentences

used as
the
outlying mos
t similar sentences in Algorithm 2.

0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
1
37
73
109
145
181
217
253
289
325
361
397
433
469
505
541
577
613
649
685
721
757
793
829
865
901
937
973
Similarity Score

Sentence Number

10


Experimental Results and Analysis


I
n order to
evaluate

if
our method
will

improve the classifier, we designed
experiments that could
test if

our hypothesis was correct
efficiently and thoroughly
.
To start each experime
nt we chose
the following sets of sentences at random from
the five public datasets mentioned
previously
:



Training set: 1000 sentences



Test set: 500 sentences



Candidate set
:
1500 sentences

The candidate set represents the set of unlabeled sentences
;
for
the purpose of the
experiments, the labels the sentences already contained were ignored
.

In the live
PRISE environment the candidate set
will be all of the sentences in the abstracts
retrieved from PubMed. The sentences chosen by the active learning algori
thm will
be added to the queue
,

which determines the order that sentences are presented to
each expert. Finally, once
a
sentence has gathered enough annotations to become
verified, it
will
be added to the training set.
By using a candidate set consisting o
f
already labeled sentences in our experiments, we were able to skip the longer
process that will be used in the live application.


The experiment followed these steps:

1.

Choose datasets randomly

2.

Train the classifier on the training set

3.

Have the classifier e
valuate the test set

4.

Measure the precision and recall

5.

Choose 20 misclassified sentences as follows

a.

10 classified as having interactions, but actually do not

b.

10 classified as not having interactions, but actually do

c.

Sentences are chosen in the order of high
est classifier confidence
first

6.

Run Algorithm 2 with the 20 misclassified sentences and the candidate set

7.

Remove

the first

50 sentences from the Algorithm 2 result set and add to the
training set

8.

Train the classifier on the updated training set

9.

Evaluate
the test set

10.

Measure the precision and recall

11.

Repeat steps 6
-
9 four more times

11


12.

Move the 250 sentences added to the training set back to the candidate set

13.

Add 50 random sentences from the candidate set to the training set

14.

Train the classifier on the updated

training set

15.

Evaluate the test set

16.

Measure the precision and recall

17.

Repeat

steps 12
-
15 four more times

The entire experiment, including selecting the
training
,

test, and candidate sets
randomly, was repeated three times and the results averaged.

Steps 12
-
16 were
used as a control set
with

which
to compare our
active learning
results
.


Experiment 1 Results

The results reveal several interesting points.

First is that adding sentences at
random yielded virtually no change in t
he precision or recall. Second
,
adding similar
sentences also did not improve recall. Finally, there was a significant increase in
precision after adding the first and second

set of

50 similar sentences.
However,
adding any more than 100 similar sentences actually starts to decrease the
precision.
We

treated the classifier as a black box, which only
allows

us
to
speculate
about why each measurement except similar precision did not improve, and then
why the similar precision dropped after adding 100 sentences. Our theory is that the
classifier is overfitting and

that

adding too many similar sentences corrects one
feature in the training model, but hurts others.

We had expected to see the precision
and recall increase as we added more sentences to the training set
, by our method
or ran
domly
.

Overall the results support our hypothesis that choosing sentences
0
10
20
30
40
50
60
1000
1050
1100
1150
1200
1250
Precision / Recall

Total Training Examples

Random Precision
Random Recall
Similar Precision
Similar Recall
12


similar to ones that were misclassified are be
st source of subsequent training
instances.

After analyzing the results, it
seemed

that
a
plausible
method for improving the
classifier
would be to stop adding training examples after 50 or 100 similar
sentences. Once this is completed the process should be restarted, choosing new
misclassified sentences
, in hopes that the drop in precision can be avoided.

However, running this experiment
ended with the same results

as the first
experiment
.
Our

method for choosing which misclassified sentences to be used picks
the sentences with the highest classifier confidence, however after several iterations
of re
-
choosing misclassified sentences and re
-
training,
roughly
half of the
misclassified sentences were
not

corrected. In an attempt to compensate for this, we
tried having the misclassified sentences chosen at random.
T
his too ended with the
same results, with
recall holding steady and precision ga
ining 9
-
10% and t
hen
dropping back down. This le
d to several more theories to of why this was
happening: possibly the candidate set was too small and did not contain similar
enough examples
,
the training set
was already too large and the active learning
pr
ocess should have started earlier, or the same theory on overfitting.

With a limited set of labeled sentences, we were able to

put
two of these theories
to
the test
by both reducing the size of the training set and increasing the size of the
candidate set.

The second experiment used the following sets of randomly chosen
sentences:



Training set: 250 sentences



Test set: 500 sentences



Candidate set: 2250 sentences

The second experiment followed these steps:

(Changes from experiment 1 in
bold

face)

1.

Choose datas
ets randomly

2.

Train the classifier on the training set

3.

Have the classifier evaluate the test set

4.

Measure the precision and recall

5.

Choose 20 misclassified sentences as follows

a.

10 classified as having interactions, but actually do not

b.

10 classified as not
having interactions, but actually do

c.

Misclassified sentences are chosen randomly

6.

Run Algorithm 2 with the 20 misclassified sentences and the candidate set

13


7.

Remove the first 50 sentences from the Algorithm 2 result set and add to the
training set

8.

Train the c
lassifier on the updated training set

9.

Evaluate the test set

10.

Measure the precision and recall

11.

Repeat steps
5
-
9

eleven

more times

12.

Move the 250 sentences added to the training set back to the candidate set

13.

Add 50 random sentences from the candidate set to the

training set

14.

Train the classifier on the updated training set

15.

Evaluate the test set

16.

Measure the precision and recall

17.

Repeat
steps 12
-
15
eleven

more times


Experiment 2 Results

This experiment gave better results, not only did the precision steadily incre
ase for
our approach but the recall held a higher score. It also achieved higher precision and
recall than experiment 1 with far fewer training examples. The results from adding
sentences at random had a higher precision but a very low recall. This reflect
s the
actual distribution of sentences containing interactions in the

datasets
.

We believe
that with a steadily increasing candidate set there will be similar enough examples
to continuously improve the results of the PRISE classifier, especially if the tr
aining
set can be built from the ground up using our active learning approach.
0
10
20
30
40
50
60
70
250
300
350
400
450
500
550
600
650
700
750
800
Precision / Recall

Total Training Examples

Similar Precision
Similar Recall
Random Precision
Random Recall
14


Educational Statement

The challenges faced in this project included redesigning the existing PRISE
database schema, migrating the old data and application to the new schema,
cr
eating and integrating a web interface to support distributed annotating,
developing methods for determining a consensus among annotations and rating
sentence similarity, and
implementing

active learning to improve classifier results.

Web technologies, act
ive learning, and information retrieval evaluation techniques
were studied as a part of this project. The Web Search, Masters Seminar, an
independent study in Ad Auctions, and Database Management Systems classes were
especially useful in providing the know
ledge and experience to complete this
project. The project resulted in the design, implementation, and

documentation of a
method to improve the results of the PRISE classifier.


Conclusion


In this paper we have presented a system to collect annotations in

a distributed
environment and find the consensus among the results. In addition
,

we presented
our method of choosing new training examples to improve the results of the PRISE
classifier
. Our experiments supported our hypothesis that sentences similar to o
nes
misclassified are the best source of subsequent training instances. PRISE now has a
process to continuously become more accurate in its predictions about the protein
interactions contained in PubMed abstracts.


Future Research


In
this

project we found one working method to compute sentence similarity: TF
-
IDF. However, many other approaches exist and some may produce better results
than ours. For example, using a similarity measure that is based on the classifier’s
internal model for co
mparing
examples would theoretically be the best. However,
time constraints in this project did not allow for this theory to be tested

thoroughly
.
The hypothesis was supported for the current PRISE classifier; it would be
interesting to see if the same con
clusion would be reached using different classifiers
or within a different application.

15


References


1.

PubMed


US National Library of Medicine.
http://www.ncbi.nlm.nih.gov/pubmed/

2.

Fundel K, Küffner R, Zimmer R: RelEx
-

relation extraction using dependency
parse trees.
Bioinformatics

2007, 23(3):365
-
371.


3.

Bunescu R, Ge R, Kate RJ, Marcotte EM, Mooney RJ, Ramani AK, Wong YW:
Comparative experiments on learning information extractors for proteins
and their interactions. Artificial Intelligence in Medicine 2005
, 33(2):139
-
155.

4.

Pyysalo S, Ginter F, Heimonen J, Björne J, Boberg J, Järvinen J, Salakoski T:
BioInfer: A Corpus for Information Extraction in the Biomedical Domain.
BMC Bioinformatics 2007, 8:50

5.

Nedellec C: Learning language in logic
-

genic interaction

extraction
challenge. Proceedings of the ICML
-
2005 Workshop on Learning Language
in Logic (LLL05) 2005, 31
-
37.

6.

Ding J, Berleant D, Nettleton D, Wurtele ES: Mining MEDLINE: Abstracts,
Sentences, or Phrases? Pacific Symposium on Biocomputing 2002, 326
-
337.

7.

Settles, Burr: Active Learning Literacture Survey. Computer Sciences
Technical Report 1648
, University of Wisconsin
-
Madison.

8.

Introduction to Information Retrieval by Christopher D. Manning, Prabhakar
Raghavan and Hinrich Schütze, Cambridge University Pre
ss. 2008.

9.

PRISE: The protein interaction search engine.
http://prise.insttech.washington.edu