Using Biomedical Literature Mining to Consolidate the Set of Known Human Protein-Protein Interactions

journeycartAI and Robotics

Oct 15, 2013 (3 years and 11 months ago)

74 views

University of Texas at Austin

Machine Learning Group

Using Biomedical Literature Mining to Consolidate
the Set of Known Human Protein
-
Protein
Interactions










Razvan C. Bunescu


Raymond J. Mooney


Machine Learning Group

Department of Computer Sciences

University of Texas at Austin

{razvan, mooney}@cs.utexas.edu





Arun K. Ramani

Edward M. Marcotte

Institute for Cellular and Molecular Biology and
Center for Computational Biology and
Bioinformatics

University of Texas at Austin

{arun, marcotte}@icmb.utexas.edu

2

University of Texas at Austin

Machine Learning Group

Outline


Introduction & Motivation.


Two benchmark tests of accuracy.


Framework for the extraction of interactions.


Future Work.


Conclusions.

3

University of Texas at Austin

Machine Learning Group

Introduction


Large scale protein networks facilitate a better understanding of the
interactions between proteins.


Most complete for yeast.


Minimal progress for human.



Most known interactions between human proteins are reported in
Medline.





Reactome, BIND, HPRD: databases with protein interactions manually
curated from Medline.

In synchronized human osteosarcoma cells,
cyclin D1

is induced in early
G1 and
becomes associated

with

p9Ckshs1
, a Cdk
-
binding subunit.

4

University of Texas at Austin

Machine Learning Group

Motivation


Many interactions from Medline are not covered by current databases.


Databases are generally biased for different classes of interactions.


Manually extracting interactions is a very laborious process.


Aim
: Automatically identify pairs
of interacting proteins with high
accuracy.

5

University of Texas at Austin

Machine Learning Group

Outline


Introduction & Motivation.


Two benchmark tests of accuracy.


Functional Annotation.


Physical Interaction.


Framework for the extraction of interactions.


Future Work.


Conclusions.

6

University of Texas at Austin

Machine Learning Group

Accuracy Benchmarks


Shared Functional
Annotations


Accuracy of interaction datasets correlates well with %
of interaction partners sharing
functional annotations
.



Functional annotation


a pathway between the two
proteins in a particular ontology:


KEGG
: 55 pathways at lowest level.


GO
: 1356 pathways at level 8 of biological process
annotation.

7

University of Texas at Austin

Machine Learning Group

Accuracy Benchmarks


Shared Known Physical
Interactions


Assumption: Accurate datasets are more enriched in pairs
of proteins known to participate in a
physical interaction
.



Reactome

and
BIND

are more accurate than others


use
them as source of known physical interactions.



Total:
11,425 interactions between 1,710 proteins.

8

University of Texas at Austin

Machine Learning Group

Accuracy Benchmarks


LLR Scoring Scheme


Use the log
-
likelihood ratio (LLR) of protein pairs with
respect to:


Sharing functional annotations.


Physically interacting.



P(D|I)

and
P(D|
-
I)

are the probabilities of observing the data
D

conditioned on the proteins sharing (
I
) or not sharing (
-
I
)
benchmark associations.


Higher values for LLR indicate higher accuracy.

9

University of Texas at Austin

Machine Learning Group

Outline


Introduction & Motivation.


Two benchmark tests of accuracy.


Framework for the extraction of interactions.



Future Work.


Conclusions.

10

University of Texas at Austin

Machine Learning Group

Framework for Interaction Extraction

Interactions

Database


Medline abstract


Protein

Extraction

Medline abstract

(proteins tagged)

Interaction

Extraction



Extensive comparative experiments in
[Bunescu et al. 2005]



Protein Extraction:
Maximum Entropy

tagger.



Interaction Extraction:
ELCS

(Extraction using Longest
Common Subsequences).



Current framework aims to improve on the previous approach
on a much larger scale (750K Medline abstracts).


11

University of Texas at Austin

Machine Learning Group

Framework for Interaction Extraction

[Protein Extraction]


Identify protein names using a
Conditional Random Fields
(CRFs)

tagger trained on a dataset of 750 Medline
abstracts, manually tagged for proteins.


[Interaction Extraction]

2) Keeping most confident extractions, detect which pairs of
proteins are interacting. Two methods:

2.1)
Co
-
citation analysis

(document level).

2.2)
Learning of interaction extractors

(sentence level).

[Lafferty et al. 2001]

12

University of Texas at Austin

Machine Learning Group

1) A CRF tagger for protein names


Protein Extraction


a sequence tagging task, where each word is
associated a tag from:
O
(
-
utside),
B
(
-
egin),
C
(
-
ontinue),
E
(
-
nd),
U
(
-
nique).

O O O O O O

B E

O O O O O

In synchronized human osteosarcoma cells ,
cyclin D1

is induced in early G1


The input text is first preprocessed:


Tokenized


Split in sentences
(Ratnaparki’s MXTerminator)


Tagged with part
-
of
-
speech (POS) tags
(Brill’s tagger)

13

University of Texas at Austin

Machine Learning Group

1) A CRF tagger for protein names


Each token position in a sentence is associated with a vector of binary
features based on the
(current tag, previous tag)

combination, and
observed values such as:


Words before, after or at the current position.


Their POS tags & capitalization patterns.


A binary flag set on true if the word is part of a protein dictionary.

IN VBN JJ NN NNS ,

NN NNP
VBZ VBN IN JJ

In synchronized human osteosarcoma cells ,
cyclin D1

is induced in early

words after

words before

current word

POS before

POS after

current POS

14

University of Texas at Austin

Machine Learning Group

1) A CRF tagger for protein names



The CRF model is trained on 750 Medline abstracts manually annotated
for proteins.



Experimentally, CRFs give better performance then Maximum Entropy
models


they allow local tagging decisions to compete against each other
in a global sentence model.



The model is used for tagging a large set (750K) of Medline abstracts
citing the word
‘human’
.



Each extracted protein is associated a normalized confidence value.



For the
Interaction Extraction

step, we keep only proteins scoring
0.8

or better.

15

University of Texas at Austin

Machine Learning Group

2.1) Interaction Extraction using Co
-
citation Analysis


Intuition: proteins co
-
occurring in a large number of
abstracts tend to be interacting proteins.


Compute the probability of co
-
citation under a random
model (
hyper
-
geometric distribution
).

N



total number of abstracts (750K)

n



abstracts citing the first protein

m



abstracts citing the second protein

k



abstracts citing both proteins

16

University of Texas at Austin

Machine Learning Group

2.1) Interaction Extraction using Co
-
citation Analysis


Protein pairs which co
-
occur in a large number of abstracts
(high
k
) are assigned a low probability under the
random
model
.



Empirically, protein pairs whose observed co
-
citation rate
is given a low probabilty under the
random model

score
high on the
functional annotation benchmark
.



RESULT: Close to
15K interactions

extracted that score
comparable or better than
HPRD

on the functional
annotation benchmark.

17

University of Texas at Austin

Machine Learning Group

2.1) Co
-
citation Analysis with Bayesian Reranking

1.
Use a trained Naïve Bayes model to measure the likelihood that an
abstract discusses physical protein interactions.

2.
For a given pair of proteins, compute the average score of co
-
citing
abstracts.

3.
Use the average score to re
-
rank the 15k already extracted pairs.



Medline abstract


CRF tagger

Medline abstract

(proteins tagged)

Co
-
citation

Analysis

Re
-
ranked

Interactions

Ranked

Interactions

Naïve Bayes

scores

18

University of Texas at Austin

Machine Learning Group

Integrating Extracted Data with Existing Databases

Extracted
: 6,580 interactions
between 3,737 human proteins


Total:

31,609 interactions
between 7,748 human proteins.


19

University of Texas at Austin

Machine Learning Group

2.1) Co
-
citation Analysis: Evaluation

20

University of Texas at Austin

Machine Learning Group

2.1) Co
-
citation Analysis: Evaluation

21

University of Texas at Austin

Machine Learning Group

2.2) Learning of Interaction Extractors


Proteins may be co
-
cited for reasons other than interactions.


Solution:
sentence level

extraction, with a binary classifier.


Given a sentence containing the two protein names, output:


Positive
: if the sentence asserts an interaction between the two.


Negative
: otherwise.


If the sentence contains
n > 2

protein names, replicate it into
(n choose 2)

sentences, each with only two protein names.



Training data:
AImed
, a collection of Medline abstracts, manually
tagged.

22

University of Texas at Austin

Machine Learning Group

AImed


Total of 225 documents (200 w/ interactions + 25 wo interactions)


Annotations for proteins and interactions

In synchronized human osteosarcoma cells,
cyclin D1

is induced in early G1
and
becomes associated

with

p9Ckshs1
, a Cdk
-
binding subunit.

Immunoprecipitation experiments with human osteosarcoma cells and
Ewing’s sarcoma cells demonstrated that
cyclin D1

is associated with

both
p34cdc2
and
p33cdk2
, and that
cyclin D1

immune complexes exhibit
appreciable histone H1 kinase activity …

cyclin D1


becomes associated with

p9Ckshs1

=>
Interaction

cyclin D1
is associated with

both
p34cdc2

=>
Interaction

cyclin D1
is associated with

both
p34cdc2

and

p33cdk2

=>
Interaction

23

University of Texas at Austin

Machine Learning Group

ELCS (Extraction using Longest Common
Subsequences)


A new method for inducing rules that extract interactions between
previously tagged proteins.


Each rule consists of a sequence of words with allowable word gaps
between them, similar to [
Blaschke & Valencia, 2001, 2002
].




-

(7)

interactions
(0)

between
(5)

PROT

(9)

PROT

(17)

.


Any pair of proteins in a sentence if tagged as interacting forms a
positive

example, otherwise it forms a
negative

example.


Positive examples are repeatedly generalized
to form rules until the
rules become overly general and start matching negative examples.

[Bunescu et al., 2005]

24

University of Texas at Austin

Machine Learning Group

ERK (Extraction using a Relation Kernel)


The patterns (features) are sparse subsequences of words constrained to
be anchored on the two protein names.


The feature space can be further pruned down


in almost all examples,
a sentence asserts a relationship between two entities using one of the
following patterns:


[FI]

F
ore
-
I
nter: ‘
interaction of

P
1

with

P
2
’, ‘
activation

of

P
1

by

P
2



[
I]

I
nter: ‘
P
1

interacts with

P
2
’, ‘
P
1

is activated by

P
2



[IA]

I
nter
-
A
fter: ‘
P
1



P
2

complex
’, ‘
P
1

and

P
2

interact



Restrict the three types of patterns to use at most 4 words (besides the
two protein anchors).

25

University of Texas at Austin

Machine Learning Group

ERK (Extraction using a Relation Kernel)


The kernel
K(S
1
,S
2
)



the number of common patterns between
S
1

and
S
2
, weighted by
their span in the two sentences.


K(S
1
,S
2
)

can be computed based on the dynamic procedure from [
Lodhi et al., 2002
].


Train an SVM model to find a max
-
margin linear discriminator between positive and
negative examples

S
1



In synchronized human osteosarcoma cells,
cyclin D1

is induced in early
G1 and
becomes associated

with

p9Ckshs1
, a Cdk
-
binding subunit.

S
2



Experiments with human osteosarcoma cells and Ewing’s sarcoma cells
demonstrated that
cyclin D1

is associated with

both
p34cdc2
and
p33cdk2
,
and



[FI] patterns:

human cells

P
1

associated with

P
2
”, …



[I] patterns:
“P
1

associated with

P
2
”, …



[IA] patterns:
“P
1

associated with

P
2

,
”, …

26

University of Texas at Austin

Machine Learning Group

Evaluation: ERK vs ELCS vs Manual



Compare results using the standard measures of
precision
and
recall
:



All three systems were tested on Aimed, using gold
-
standard
proteins.

27

University of Texas at Austin

Machine Learning Group

Evaluation: ERK vs ELCS vs Manual

28

University of Texas at Austin

Machine Learning Group

Future Work & Conclusions

Future Work:


Analyze the complete set of 750K abstracts using the
relational kernel and integrate results into an improved
composite dataset.

Conclusions:


Created a large database of interacting human proteins by
consolidating interactions automatically extracted from
Medline abstracts with existing databases.


Final database:



31,609 interactions

between
7,748 human proteins.


29

University of Texas at Austin

Machine Learning Group

For Further Information


Consolidated database available on line:


http://bioinformatics.icmb.utexas.edu/idserve/


Papers available online:


http://www.cs.utexas.edu/users/ml/publication/bioinformatics.html


“Consolidating the Set of Known Human Protein
-
Protein Interactions in
Preparation for Large
-
Scale Mapping of the Human Interactome,” Ramani,
A.K., Bunescu, R.C., Mooney, R.J. and Marcotte, E.M.,
Genome Biology
, 6, 5,
r40(2005).


“Using Biomedical Literature Mining to Consolidate the Set of Known
Human Protein
-
Protein Interactions,”Arun Ramani, Edward Marcotte,
Razvan Bunescu, Raymond Mooney, to appear in the
Proceedings of ISMB
BioLINK SIG: Linking Literature, Information and Knowledge for Biology
,
Detroit, MI, June 2005.


“Collective Information Extraction with Relational Markov Networks,”
Razvan Bunescu and Raymond J. Mooney,
Proceedings of the 42nd Annual
Meeting of the Association for Computational Linguistics

(ACL
-
2004), pp. 439
-
446, Barcelona, Spain, July 2004.


“Comparative Experiments on Learning Information Extractors for Proteins
and their Interactions.,” Razvan Bunescu, Ruifang Ge, Rohit J. Kate, Edward
M. Marcotte, Raymond J. Mooney, Arun Kumar Ramani, and Yuk Wah
Wong,
Artificial Intelligence in Medicine

(Special Issue on Summarization and
Information Extraction from Medical Documents), 33, 2 (2005), pp. 139
-
155.

30

University of Texas at Austin

Machine Learning Group

The End

31

University of Texas at Austin

Machine Learning Group

Protein Interaction Datasets


Normalization


Need a shared convention for referencing proteins and their
interactions.


Map each interacting protein to a LocusLink ID => small loss of
proteins.



Consider interactions symmetric => many duplicates eliminated.


Omit self interactions


cannot be evaluated on functional annotation
benchmark.


Example:


HPRD reduced from 12,013 to 6,054 unique symmetric, non
-
self
interactions
.

32

University of Texas at Austin

Machine Learning Group

Protein Interaction Datasets


Normalization

Dataset

Version

Total Is (Ps)

Self Is (Ps)

Unique Is (Ps)

Reactome

08/03/04

12,497
(6,257)

160
(160)

12,336
(807)

BIND

08/03/04

6,212
(5,412)

549
(549)

5,663
(4,762)

HPRD

04/12/04

12,013
(4,122)

3,028
(3,028)

6,054
(2,747)

Orthology (all)

03/31/04

71,497
(6,257)

373
(373)

71,124
(6,228)

Orthology (core)

03/31/04

11,488
(3,918)

206
(206)

11,282
(3,863)



Dataset statistics after normalization (Is


interactions, Ps


proteins):

33

University of Texas at Austin

Machine Learning Group

Accuracy of manually curated interactions

Functional Annotation
Benchmark

Physical Interaction
Benchmark

Database

LLR

Database

LLR

Reactome

3.8

N/A

N/A

BIND

2.9

N/A

N/A

HPRD

2.1

Core orthology

5.0

Core orthology

2.1

HPRD

3.7

Non
-
core orthology

1.1

Non
-
core orthology


3.7