Protein Secondary Structure Prediction:

spectacularscarecrowΤεχνίτη Νοημοσύνη και Ρομποτική

17 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

73 εμφανίσεις

Protein Secondary Structure Prediction:

A New Improved Knowledge
-
Based Method

Wen
-
Lian Hsu

Institute of Information Science

Academia Sinica, Taiwan

2
/29

Outline


Introduction


PSSP


Motivation


Knowledge
-
Based Method


PROSP


An Improved Hybrid Method


PROSP II


HYPROSP II+


Conclusion

3
/29

Protein Structures


Primary sequence





Secondary structures







Tertiary structures

MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE

helices

strands

loops

Three dimensional packing of secondary structures

4
/29

Introduction to PSSP


P
rotein
S
econdary
S
tructure
P
rediction
(
PSSP
) is to predict protein secondary
structure based only on its sequence.


Each amino acid is assigned a structure
element (
SSE
):


Helix

(H),
Strand

(E) or
Coil

(C or L).

5
/29

Motivation


PSSP

plays an important role in tertiary
structure predictions


Fischer (1996) improved the tertiary structure
prediction accuracy from
59.0

to
71.0

by using
PHD

to predict SSE.


In Yang’s 2003, the tertiary structure prediction
accuracy was improved from
71.9

to
79.0

by using
PSIPRED
to predict SSE.


Predicted SSE can also be employed in other
prediction algorithms as features to improve
performance

6
/29

Outline


Introduction


PSSP


Motivation


Knowledge
-
Based Method


PROSP


An Improved Hybrid Method


PROSP II


HYPROSP II+


Conclusion

7
/29

Treat PSSP as a Translation Problem


Secondary structure prediction


A language of 20 alphabets







A language of 3 alphabets

8
/29

Treating Genomic/Proteomic sequences

as a Language


For proteomic data:



Amino acid motif protein




Alphabet

word

sentence







paragraph


Protein structure or function

Sentence meaning




Finding the interrelationships of data


Data Mining, Knowledge Discovery


9
/29

Matching by Semantics
(prediction based on evolutionary information)

Existing sentences in database (understood):

His old father
gave me a book.

Joan
loves Andy


Understanding
Understanding
a new sentence

Mary

s lovely daughter
does not like John

Techniques

Corpus analysis

Pattern discovery and matching

Sequence, semantics (classification, transformation)

Structure prediction
Speech Recognition


Example

Sense Disambiguation in English


Selection of homonyms (or senses) in
speech recognition


































































11
/29

How do we represent the context in
a protein sequence (or sentence)?


Using motifs as Words?


Motifs could be
too specific
, do not provide
enough coverage



What about using k
-
mers?


Can build (k
-
mer, structure) pairs


How many k
-
mers can we get?


How do we define similar k
-
mers? (under the
context)


How do we combine the structural information
from the k
-
mers?




12
/29

PROSP

Our knowledge
-
based method for PSSP


Constructing a peptide
S
equence
-
S
tructure
K
nowledge
B
ase (
SSKB
)



Use PSI
-
BLAST to find all peptides similar
to those of the target protein



Use similar peptides found in the SSKB to
vote for the dominant structure of each
amino acid in the target protein.

13
/29

Using PSI
-
BLAST to Amplify the Effect of
DSSP Database (
create more synonyms
)


The number of peptide words is still small


(~ 5 million)


Identify
similar

peptides


For each protein
p

in the NR database, apply PSI
-
BLAST to find its HSPs
(high score segment pairs
).



HSP
: an alignment of subsequence of protein
p

and another protein
q
with unknown structure


Assign the structure of
“selected”

peptides of
p

to those of
q


These peptides comprise our dictionary (~ 100 million)

14
/29

SSKB construction (
synonyms
)

An example of
H
igh
-
scoring
S
egment
P
air (
HSP
)
from PSI
-
Blast Search result

known

unknown

15
/29



x

H(
x
)

E(
x
)

C(
x
)

Voting score

x

is assigned as helix

H

H

H

C

E

C

SSKB

PSI
-
Blast

Prediction at a position x


16
/29

Outline


Introduction


PSSP


Motivation


Knowledge
-
Based Method


PROSP


An Improved Hybrid Method


PROSP II


HYPROSP II+


Conclusion

17
/29

Two problems of searching for homologous
peptides in protein sequences databases


Redundant information generated by
duplicate peptides


The voting bias problem in PROSP



Poor prediction accuracy due to
insufficient knowledgebase matching


boost coverage


18
/29

The voting bias problem

Query

Sbject


The PSIBLAST results

KTYQCQY


KPYQCQY

KPYQCQY

KPYQCQY

KPYQCQY

KPYQCQY

KVYQCQY

QPYRCKY

SSKB

KTYQCQY


HHHHHH

HHHHHH

HHHHHH

HHHHHH

HHHHHH

CCHHHC

CCHHHC

Dominate

result

19
/29

Clustering HSPs

…MYKKILYPTDFSETAEIALK…

MYSKILL

MYSKILL

MYSKILL

MYKKIYL

MYKKIYL

MYKKIYL

MYKKIYL

MYSSILY

MYSSILY

Similar HSPs

20
/29

Measuring the amount of structural
information


Low Local match rate

HSPs

There is no information from SSKB
7

for this region

Found

Unfound

21
/29

Construct SSKB with different
lengths (to boost coverage)

HSPs

Training

Protein

PSI
-
BLAST search

SSKB

window length =
7

SSKB construction

window length =
7

HSPs

Training

Protein

PSI
-
BLAST search

SSKB

window length =
5

SSKB construction

window length =
5

22
/29

HSPs from SSKB
7

Boost match rate using different
length peptide record

Protein :
MYKKILYPTDFSETAEIALK


SSKB

Window length = 7

SSKB

Window length = 5

H

1 2 1 3 6 7 8…

E

1 2 2 0 0 0 1…

C

2 3 8 8 5 4 2…

H

1 3 2 5 5 5 2…

E

1 3 2 0 0 0 1…

C

2 4 7 7 6 6 7…

HSPs from SSKB
5

23
/29

NEW PROSP system

Protein :
MYKKILYPTDFSETAEIALK


SSKB

Window length = 7

SSKB

Window length = 5

H

1 2 1 3 6 7 8…

E

1 2 2 0 0 0 1…

C

2 3 8 8 5 4 2…

H

1 3 2 5 5 5 2…

E

1 3 2 0 0 0 1…

C

2 4 7 7 6 6 7…

H
PROSPII
(
x
)



LMR
7mer
(
x
)
×
H
7
(
x
)
+
(
1
-

LMR
7mer
(
x
))
×
H
5
(
x
)

E
PROSPII
(
x
)



LMR
7mer
(
x
)
×
E
7
(
x
)
+
(
1
-

LMR
7mer
(
x
))
×
E
5
(
x
)

C
PROSPII
(
x
)



LMR
7mer
(
x
)
×
C
7
(
x
)
+
(
1
-

LMR
7mer
(x
))
×
C
5
(
x
)


H

1 3 2 5 7 6 7…

E

1 3 2 0 0 0 1…

C

2 4 8 8 4 5 6…

24
/29

Hybrid by Neural Network

Query Protein

PSIPRED

PROSP

PSIPBLAST

H score

E score

C score

H score

E score

C score

PSSM

Neural Network

Final Result

3 features

3 features

20 features

25
/29

Data Sets


Two broadly used test sets


CB513


EVAc4


Derivation of the training sets


Get 4,572 unique protein chains (with less than 25%
mutual sequence identity) from DSSP database


Further remove protein chains of sequence identity
over 25% with the respective test datasets to obtain
their respective training datasets.


The final training datasets consist of 4395 and 4055
protein chains for EVAc4 and CB513, respectively.

26
/29

55
60
65
70
75
80
85
[0,10)
[10,20)
[20,30)
[30,40)
[40,50)
7-mer SSKB
5-mer SSKB
PROSP II
The respective performance improvement
using SSKB
5

and SSKB
7


LMR
7mer
(%)

Q
3
(%)

Performance of prediction on CB513 by SSKB
5
, SSKB
7

and
PROSP II with respect to LMR
7mer

lower than 50%.

27
/29

Performance of HYPROSP II+



Q
3

SOV

Q
H_o

Q
H_p

Q
E_o

Q
E_p

Q
C_o

Q
C_p

Info

HYPROSPII
+

80.35

78.66

78.65

83.85

61.10

71.27

81.79

76.35

0.44

Errsig

0.84

1.20

1.87

1.75

2.33

2.15

1.05

1.15

0.02

PROFsec

76.54

75.39

67.30

74.00

43.70

43.20

76.80

73.50

0.38

PSIPRED

77.62

76.05

72.90

71.50

38.60

42.30

73.50

76.40

0.38

SAM
-
T99sec

77.64

75.05

75.50

69.6
0

38.80

47.30

72.40

75.70

0.39

YASSPP

79.34

78.65

--

--

--

--

--

--

0.42

HYPROSPII

79.32

76.51

81.49

77.85

60.91

68.83

76.98

77.78

0.41

28
/29

Conclusion

HYPROSP II+


Using a more robust knowledge
-
based algorithm
PROSP II


More structural information, better prediction.


Incremental Learning



The general strategy developed in this paper
could be used to enhance the performance
of similar approaches in other prediction
problems.




People

Wen
-
Lian Hsu

Ting
-
Yi Sung

Hsin
-
Nan Lin

Jia
-
Ming Chang

Ei
-
Wen Yang