2 structure, TM regions, and solvent accessibility

runmidgeΤεχνίτη Νοημοσύνη και Ρομποτική

20 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

65 εμφανίσεις

2
o

structure, TM regions, and solvent accessibility

Topic 13

Chapter
29,
Du and Bourne “Structural Bioinformatics”

The
Truth

(Information) is
Out

(In) There

The
Truth

(Information) is
Out

(In) There

But we’re still having a tough time finding it.

Given a protein sequence (primary structure), predict
its secondary structures

G
HWIAT
R
GQ
LIREA
YED
Y
RHF
SS
ECPFIP

C
EEEEE
C
CC
EEEEE
CCC
HH
HH
HH
CCCCCC

E
:

-
strand

H
:

-
helix

C: coil

Assumption
: short stretches of residues have propensity to adopt certain

conformation


conformation of the central residue in a sequence fragment

depends only on flanking residues (sliding window
)

Protein Secondary Structure Prediction



H: ( H:

-

helix, G
: 3
10

helix, I
:

-
helix
)



E: (E
:

-
strand, B
:
bridge)



C: (T
:

-
turn, S
:
bend, C
:
coil)

--


Because we can (kind of).

--

Because it

could

be

a
first step towards prediction of protein tertiary
structure.

Why secondary structure prediction?


Have solution, need problem.
” Nearly every imaginable algorithm has been
applied to secondary structure prediction.

1.
First generation:
Single amino acid propensities




Chou
-
Fasman

method (1974), GOR
I
-
IV




~56
-
60% accuracy


2.
Second generation
:
Segments of 3
-
51 adjacent residues



NNSSP
, SSPAL




~65% accuracy


3. Neural network


PHD, Psi
-
Pred
, J
-
Pred


4. Support vector machine (SVM)


5. Hidden Markov Models (HMM)


Third generation
methods

using
evolutionary
information


~76% accuracy


Secondary Structure Prediction Methods

1. three
-
state per
-
residue prediction accuracy

M
ii
,
number of residues observed in state
i

and predicted in state
i


N
obs
, the total number of residues observed in 3 states

Secondary Structure
P
rediction
A
ccuracy

2. per
-
segment prediction accuracy (SOV,
S
egment of
OV
erlap
)

Per
-
stage segment overlap
:



S1: observed SS segment

S2: predicted SS segment

Calculate
the
propensity

for a given amino acid to adopt a certain
ss
-
type


Example: from a data set with 30 proteins

#Ala=2,000, #residues=20,000, #helix=4,000, #Ala in
helix=580

p
(

,
aa
) =
580/20,000
,
p
(

) = 4,000/20,000,
p(
aa
) =
2,000/20,000


P

=
580
/ (4,000/10) =
1.45

i
,
amino acid


, secondary
structure state


Single Residue Propensity Methods


Amino Acid Propensities to Secondary Structures

Chou
-
Fasman

method

*
The idea is simple:
predict SS of the central residue of a given
segment from homologous segments (neighbors).


For
example, from database, find some number of the closest sequences
to a subsequence defined by a window around the central residue, then
use
max

(N

, N

,
Nc
) to assign the SS.


Nearest Neighbor
M
ethods

RSTEVRAS
R
QLAKEKVN

Window size

Homologous

sequences

E

C

C

H

H

C

C

C

Key parameters:

1.
How to define similarity
?

2.
What size window of sequence should be examined
?

3.
How many close sequences should be selected?

The Devil is in the details…


D. Jones, J. Mol. Boil.
292,

195 (1999).


Method : Neural network


Input data : PSSM generated by PSI
-
BLAST


Bigger and better sequence database


Combining several database and data filtering


Training and test sets preparation


Ss prediction only makes sense for proteins with no homologous
structure.


No sequence & structural homologues between training and test sets
by CATH and PSI
-
BLAST (mimicking realistic situation).




Psi
-
Pred

Method


Window size = 15


Two networks


First network (sequence
-
to
-
structure):


315 =
(20 + 1)


15 inputs


extra unit to indicate where the windows spans either N or C terminus


Data are scaled to [0
-
1] range by using 1/[1+exp(
-
x)]


75 hidden units


3 outputs (H, E, L)



Second network (structure
-
to
-
structure):


Structural correlation between adjacent sequences


60 = (3 + 1)


15 inputs


60 hidden units


3 outputs


Accuracy ~76%


Psi
-
Pred

Method
--
Neural Network

Conf:
Confidence (0=low, 9=high)
---
very important!!!!

Pred
: Predicted secondary structure (H=helix, E=strand, C=coil)

AA: Target sequence # PSIPRED HFORMAT (PSIPRED V2.3 by David Jones)



Conf: 966899999997542002357777557999999716898188034435788873356776

Pred
: CCHHHHHHHHHHHHHHHCCCCCCCHHHHHHHHHHHCCCCCEEECCCCEEEEEEECCCCCC

AA: MMWEQFKKEKLRGYLEAKNQRKVDFDIVELLDLINSFDDFVTLSSCSGRIAVVDLEKPGD


10 20 30 40 50 60


Conf: 777179998337888888988751235636899718261220179868899999998557

Pred
: CCCCEEEEEECCCCCHHHHHHHHHCCCCCEEEEECCCEEEEECCCHHHHHHHHHHHHHCC

AA: KASSLFLGKWHEGVEVSEVAEAALRSRKVAWLIQYPPIIHVACRNIGAAKLLMNAANTAG


70 80 90 100 110 120


Conf: 200242314703799714651435541487355188999999999999999889999999

Pred
: CCCCCCEECCCEEEEEECCCEEEEEECCCCCEEECHHHHHHHHHHHHHHHHHHHHHHHHH

AA: FRRSGVISLSNYVVEIASLERIELPVAEKGLMLVDDAYLSYVVRWANEKLLKGKEKLGRL


130 140 150 160 170 180



Sample Psi
-
Pred

Output

***Compare the prediction for residues 9 and 17***

Sample Psi
-
Pred

Output
-
II

Again, voting rules methods tend to be best

ATKAVCVLKGDGPVQGTIHFEAKGDTVVVTGSITGLTEGDHGFHVHQFGDNTQGCTSAGP
2SOD

CCCCCCCCCCCCCCCC
EE
H
CC
HH
E
C
EEEEEEEEEEEE
CCCCCCCCCCCCCCCCCCCCCCC
BPS

CC
H
EEEEE
CCCCCCCC
EEE
HHH
CCC
EEEEEEEE
E
CECCCCCC
EEEE
CCCCCCCCCCCCCC
D_R

CCC
EEEEEE
CCCCC
EEEEEEEE
CCC
EEEEEEEEEEEE
CCCCC
EEEEE
CCCCCCCCCCCCC
DSC

CCC
EEEEE
CCCCCCC
EEEEEE
CCCC
EEEEEEEEE
CCCCCCCC
EEEEEE
CCCCCCCCCCCC
GGR

HHH
C
EEEE
CCCCCCC
EEEEEE
CCCC
EEEEEE
C
EEEEEE
CCCC
EEEEE
CCCCCC
EEE
CCCC
GOR

CCCC
EEEE
CCCCCCCCC
EEE
CCCCCC
EEEEE
C
EEE
CCCCCCC
EEEE
CCCCCCCC
EEE
CCC
H_K

CCCC
EEEEE
CCCCCCCCC
EEE
CCCCC
EEEE
CCCCCCCCCCC
EEEEEEEE
CCCCCCCCCCC
K_S

CCCC
EEEE
CCCCCCCC
EEEEE
CCCC
EEEEEEEEEEE
CCCCCC
EEEEE
CCCCCCCCCCCCC
JOI

---
EEEEE
------
EEEEEEEEE
--
EEEEEEEEE
-----
EEEEEEEE
-----------
--

2SOD



HFNPLSKKHGGPKDEERHVGDLGNVTADKNGVAIVDIVDPLISLSGEYSIIGRTMVVHEK 2SOD

CCCCCCCCCCCCCCCCCCCCCC
E
CCCCCCH
EE
CCCCCCCCC
E
CC
EE
C
EEEEEEEEEEE
CC
BPS

CCCCCCCCCCCCCCC
HH
C
E
CCCCC
E
CCCCCC
EEEEEEE
CC
EEEE
CCC
EEEEEEEEEEE
CC
D_R

CCCCCCCCCCCCCC
EEEEE
CCCCCCCCCCCC
EEEEEE
CCCCCCCCCC
EEEEEEEEEEE
CC
DSC

CCCCCCCCCCCCCCCC
EEE
CCCCCCCCCCCCC
EEEEE
CCCCCCCCCC
EEEE
C
EEEEEE
CC
GGR

CCCCCCCCCCCCCC
HH
EEE
CCCCCCCCCCCC
EEEEEEE
CC
EEE
CCCC
EEEEEEEEEE
CCC
GOR

CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
EE
CCCCCCCCCCCCCC
HHHHHH
EE
CCC
H_K

CCCCCCCCCCCCCCCC
EEE
CCCCCCCCCCCCC
EEEEEEEEEEEEE
CCC
EEE
CC
EEEEEEE

K_S

CCCCCCCCCCCCCCCC
EEE
CCCCCCCCCCCC
EEEEEE
CCCC
E
CCCCC
EEEEEEEEEEE
CC
JOI

--------------------
EEEEEE
------
EEEEEEE
--------------
EEEEE
--

2SOD



Prediction Accuracy (EVA)

EVA: Automatic evaluation of prediction servers


Currently ~76
%



Proteins with more than 100 homologues


80
%



Assignment is ambiguous (5
-
15%).
Recall DSSP
vs

STRIDE
.

--

non
-
unique protein structures (dynamic), H
-
bond cutoff, etc.



Different
secondary structures between homologues (~12
%).



Non
-
locality. Secondary structure is influenced by long
-
range interactions.



--

Some
segments can have multiple structure
types (
chameleon
sequences
).


How Far
C
an
W
e
G
o?


Conceptually similar problem to SS prediction:
B
uried vs.
E
xposed.


Weighted Ensemble Solvent
Accessibility predictor:
http
://pipe.scs.fsu.edu/
wesa.html


Solvent accessibility

E

E

E

E

E

E

B

B

B

B

B

B


To provide structural context for putative mutations that one wants to
characterize biochemically or biophysically.

Why bother?


Again, conceptually similar problem to SS prediction:
T
M vs.
N
ot.

Transmembrane

Segment Prediction