Based on Correlation Coefficient

hartebeestgrassAI and Robotics

Nov 7, 2013 (3 years and 11 months ago)

67 views

JCKBSE2010

Kaunas

Predicting Combinatorial Protein
-
Protein Interactions

from Protein Expression Data

Based on Correlation Coefficient



Sho

Murakami, Takuya Yoshihiro,

Etsuko Inoue and Masaru Nakagawa


Faculty of Systems Engineering, Wakayama University

JCKBSE2010

Kaunas

Wakayama University

2

2

Agenda


Background


Combinatorial Protein
-
Protein Interactions


The Proposed Data Mining Method


Evaluation


Conclusion


JCKBSE2010

Kaunas

Wakayama University

Background


Finding

Interactions among

genes/proteins

are

important


Many data
-
mining algorithms to discover gene
-
gene

(or protein
-
protein) interactions are proposed so far.


One of the main source is
gene or protein expression data



3

2D
Electorophoresis


for

protein expression


Microarray


for

gene expression)

Color

strength is

expression level

Size of

spot

is

expression

level

JCKBSE2010

Kaunas

Wakayama University

Related Work for Interaction Discovery


Bayesian Networks


Discovering

interactions from expression data

based on
conditional probability among events




4

A

C

B

A

B





8
7

B
A
C
B
A



C

Ex. to discover protein
-
protein interactions among proteins A, B and C,


1. Define events A, B and C


2. Compute conditional probability related with A, B and C

samples

Event “C is expressed”

If high,

Interaction is

predicted

JCKBSE2010

Kaunas

Wakayama University

Problems of Bayesian Networks


Bayesian Networks Require large Number of Samples


For gene: microarray supplies cheap and high
-
speed experiment


For protein: 2D
-
electrophoresis takes time and expensive

5





8
7

B
A
C
B
A



A

C

B

sufficient samples

in the area ?

Many Samples are Necessary

to obtain statistically reliable results

A

B

C

ex. to discover protein
-
protein interactions among proteins A, B and C,


1. Define events A, B and C


2. Compute conditional probability related with A, B and C

JCKBSE2010

Kaunas

Wakayama University












6

The Objective of our study

Finding

combinatorial protein
-
protein interactions

from
small
-
size protein expression data

JCKBSE2010

Kaunas

Wakayama University

7

7

Expression Data


2D
-
electrophoresis processed for each sample

which includes expression levels of each protein.


Expression levels: obtained by measuring size of areas


As pre
-
processing, normalization is applied

サンプル
個体
タンパク質
ID
A
B
C
D

Sample1
0.50
0.20
0.17
0.06

Sample2
0.30
0.40
0.12
0.02

Sample3
0.75
0.10
0.08
0.02
































Each black area indicates a protein:

size of areas
represent expression levels

sample3

sample2

sample1

Proteins

JCKBSE2010

Kaunas

Wakayama University

8

8

Model of Protein
-
Protein Interaction Considered


Model: two proteins A and B effect on other protein C’s
expression level
only when both A and B are expressed

We want to estimate the
combinatorial Effect
!

A

B

C

C

A

B

C

Effect on

expression levels

Complex of A and B

A

B

A

B

A

B

Sole effect
from A,B on C
is usually considered

Only If both A and B exist,

Combinatorial effect

works on C!

JCKBSE2010

Kaunas

Wakayama University

9

9

Predicting Interactions by Correlation Coefficient


Computing correlation coefficient of (A,B) and C


Correlation coefficient requires less number of samples


The amount of complex (A,B) is estimated by min(A,B)


Total effect on C will be high if correlation is high

Expression


level

A

B

Expression level of A and B

of a sample

Estimated amount

of complex of A and B

Compute correlation of

min(
A,B
) and
C

This amount would

Effect on C

min(
A,B
)

C

JCKBSE2010

Kaunas

Wakayama University

10

10

The problem of scale difference


Amount of expression level for 1 molecular is different among
proteins, so
the same amount of A and B not always combined.


Therefore, taking min cannot express correct amount of complex

Exp.level

A

B

Proteins

A

and

B

Estimated number of complex

A

B

Proteins

A

and

B

The amount of
complex is

not correct

Taking min

leads
correct

amount of complex

Solution



correct the scale of A

Scaling problem and solution


is the expression
level required for a
complex

Exp.level

JCKBSE2010

Kaunas

Wakayama University

11

11

How to determine correct scale?

Expression

level

A

B

k
1
A

k
2
A

k
3
A

We compute

Score S
: the
total effect
of (A, B) on C

Compute

Correlation


Select the scale which leads the maximum correlation coefficient
of
min(A,B) and C


If interaction of our model exists, high correlation value must appear.

min(
A,B
)

min(
A,B
)

min(
A,B
)

min(
A,B
)

Score S

Correlation

0.1

Correlation

0.2

Correlation

0.3

Correlation

0.7

JCKBSE2010

Kaunas

Wakayama University

Estimating Combinatorial Effect from Score S


Score S consists of
“Sole Effect”
and
“Combinatorial Effect”


Compute Score S’: Score S assuming no combinatorial effect


Difference between S and S’ is the level of Combinatorial Effect

12

Level of combinatorial effect

B

C

A

The difference between score S and S’ is the combinatorial effect

A

B

C

B

C

A

C

Assuming no
combinatorial Effect

A

B

C

C

Score S

B

C

A

Score S


Computing
Statistic
Distribution

JCKBSE2010

Kaunas

Wakayama University


Assume that expression levels of proteins A, B and C follow
normal distribution


Computer simulation leads the distribution of Score S’

How to compute distribution of score S’?

13

上が平均、下が標準偏差
     A→C
B→C
0.05
0.10
0.15
0.20
0.06921
0.10113
0.14296
0.18771
0.03089
0.03220
0.03206
0.03179
0.12239
0.15603
0.19713
0.03376
0.03392
0.03327
0.17806
0.21105
0.03498
0.03506
0.23262
0.03618
0.05
0.10
0.15
0.20
Correlation

α

Correlation

β

Distribution of A

Distribution of B

Distribution of C

Score S’ of α=0.5, β=0.3



Obtain distribution of score S’



Randomly create a distribution of A, B and C



where correlation coefficient of
A
-
B is α
, that of
B
-
C is β



Create the table
of average and
stddev


for each α and β

Repeat computation

of score S

Score S’

of

α=0.5, β=0.4

We can obtain the
distribution for
each α and β.

Upper: average

Lower:
stddev

JCKBSE2010

Kaunas

Wakayama University


Place the score S in distribution of S’


Z
-
score: Measure difference between score S and
average of S’
as the count of standard deviation

B
C
A
A
B
C
B
C
A
C
A
B
C
C
スコア
S
B
C
A
スコア
S’
Score S

Computing Combinatorial Effect as Z
-
score

14

The higher z
-
score is, the stronger the combinatorial effect is !

Distribution of score S’

Compute score S

corresponding

The amount of

combinatorial effect level

Z
-
score

(score S
-
avg
(S’)) /
stddev
(S’)

Measurement as count of standard deviation

average

Score S

Z
-
score

Score S’

JCKBSE2010

Kaunas

Wakayama University


Trying all combination of A, B and C


Compute the maximum correlation coefficient among
all scale of A and B

to compute Score S


Compute z
-
score and create ranking by them

15

Compute z
-
scores

from distribution of S’

Summary of the proposed algorithm

A

B

C

D

sample1

sample2

sample3

Expression

Data

(A,B)→C

(A,B)→D

(A,B)→E

(A,B)→F




(A,C)→B

(A,C)→D

(A,C)→E

(A,C)→F












(B,C)→A

(B,C)→D

(B,C)→E

(B,C)→F











Trying all combinations

1

Compute max correlation

among every scale

2



A

B



A

B



A

B

Try every scales

correlation

0.3

correlation

0.8

correlation: 0.5

S

Z
-
score

=
5.5

list of all combinations

3

Ranking by z
-
score



rank

Combinations

Z
-
score



(A,C)→B

5.5



(B,C)→E

4.9



(A,B)→F

4.7

Score S = 0.8

S’

JCKBSE2010

Kaunas

Wakayama University

16

Evaluation


Applying our method into real expression data


Protein expression data of black cattle


# of samples is 195, # of proteins is 879

finding combinatorial protein
-
protein interactions

using our method

JCKBSE2010

Kaunas

Wakayama University

The Expression Data Follows Normal Distribution


By way of
Jarque
-
Bera

test with confidential level of 95%,

we test if expression data follows normal distribution.





Result:

454 proteins out of 879 proteins follow normal distribution



Thus, we use 454 proteins for evaluation



17

JCKBSE2010

Kaunas

Wakayama University

Results


We found

so

many

combinations of

proteins

which would have

combinatorial effect


The maximum value of z
-
score is 11.0


The combinations where z
-
value is more than about 5.5

(p
-
value is

less than 0.000000019(=0.05/
454
C
3
)))

would have combinatorial effect with confidential level of 95%.


18


データ

ヒストグラム
0
50
100
150
200
250
300
11
.
5

11
11

10
.
5
10
.
5

10
.
0
10

9
.
5
9
.
5

9
.
0
9
.
0

8
.
5
8
.
5

8
.
0
8
.
0

7
.
5
7
.
5

7
.
0
z
スコア
組合


The histogram of z
-
score

# of combinations


Z
-
score

JCKBSE2010

Kaunas

Wakayama University

Comparing z
-
scores with normal distribution

19


We compare the

histogram with that of without combinatorial effect


Created

by

augmenting normal distribution with the number of trials (
454
C
3
)


It is inferred that

this data includes considerable amount of combinatorial effect

Distribution of z
-
score under

assumption no combinatorial effect


Estimated distribution of z
-
score
obtained from real data


データ

ヒストグラム
0
50
100
150
200
250
300
11
.
5

11
11

10
.
5
10
.
5

10
.
0
10

9
.
5
9
.
5

9
.
0
9
.
0

8
.
5
8
.
5

8
.
0
8
.
0

7
.
5
7
.
5

7
.
0
z
スコア
組合


# of combinations

Z
-
score

Histogram of real data

複合体

作用
がない
場合

ヒストグラム
0
0.01
0.02
0.03
0.04
0.05
11
.
5

11
11-10.5
10.5-10.0
10

9
.
5
9
.
5

9
.
0
9
.
0

8
.
5
8
.
5

8
.
0
8
.
0

7
.
5
7
.
5

7
.
0
6.5
6
z
スコア
組合



期待値
# of combinations

Z
-
score

Histogram without

combinatorial

effect

JCKBSE2010

Kaunas

Wakayama University

The Ranking based on Z
-
score

20


The ranking

table

shows

that


Combinations

with

low

score

S

are

retrieved.


Same

protein

tends to appear

many times.


The ranking of Z
-
score obtained from real data

順位
A
(スポット番号)
B
(スポット番号)
C
(スポット番号)
Cor(A,C)
Cor(B,C)
Sabc
zスコア
1
5146
6239
4470
0.092
0.317
0.674
11.02
2
6154
6239
5418
0.071
0.339
0.674
11.01
3
2572
4292
6239
0.137
0.293
0.561
10.94
4
5146
6239
1468
0.173
0.371
0.729
10.29
5
5661
6281
5342
-0.007
0.390
0.504
10.20
6
5146
6239
4478
0.089
0.315
0.648
10.19
7
5661
6281
5730
0.058
0.434
0.613
10.17
8
5026
6239
1333
0.052
0.350
0.560
10.15
9
5026
6239
3626
0.029
0.314
0.470
10.14
10
5695
6143
6042
0.148
0.444
0.640
10.12
















































B

C

A

C

Correlation

of B
-
C

Score S

B

C

A

Z
-
score

Rank

A

Protein Num

B

Protein Num

C

Protein Num

Correlation

of A
-
C

JCKBSE2010

Kaunas

Wakayama University

Conclusion

21


Summary


We propose

a

method to
estimate

combinatorial

effect

of

three

proteins

from protein

expression data


Applying

the method

into

real data,

we

found

many

combinations

which would

have

combinatorial

effect



Future

work


To confirm

the reliability, we are

planning

to study

whether

the

found

combinations

include

well
-
known

protein
-
protein

interactions

or not.