Bioinformatics Research and Resources at the University of ...

sparrowcowardΒιοτεχνολογία

2 Οκτ 2013 (πριν από 4 χρόνια και 10 μέρες)

82 εμφανίσεις

JM
-

http://folding.chmcc.org

1

Introduction to Bioinformatics: Lecture XI

Computational Protein Structure Prediction



Jarek Meller


Division of Biomedical Informatics,

Children’s Hospital Research Foundation

& Department of Biomedical Engineering, UC

JM
-

http://folding.chmcc.org

2

Outline of the lecture




Protein structure and complexity of
conformational search: from similarity based
methods to
de novo

structure prediction


Multiple sequence alignment and family profiles


Secondary structure and solvent accessibility
prediction


Matching sequences with known structures:
threading and fold recognition


Ab initio

folding simulations


JM
-

http://folding.chmcc.org

3

Polypeptide chains: backbone and side
-
chains

C
-
ter

N
-
ter

JM
-

http://folding.chmcc.org

4

Distinct chemical nature of amino acid side
-
chains

ARG

PHE

GLU

VAL

CYS

C
-
ter

N
-
ter

JM
-

http://folding.chmcc.org

5

Hydrogen bonds and secondary structures

a
-
helix

b
-
strand

JM
-

http://folding.chmcc.org

6

Tertiary structure and long range contacts: annexin

JM
-

http://folding.chmcc.org

7

Quaternary structure and protein
-
protein
interactions: annexin hexamer

JM
-

http://folding.chmcc.org

8

Domains, interactions, complexes:
cyclin D and Cdk

Cyclin Box

JM
-

http://folding.chmcc.org

9

Domains, interactions, complexes: VHL

JM
-

http://folding.chmcc.org

10

Protein folding problem


The protein folding problem consists of
predicting three
-
dimensional structure of a
protein from its amino acid sequence


Hierarchical organization of protein structures
helps to break the problem into secondary
structure, tertiary structure and protein
-
protein
interaction predictions


Computational approaches for protein
structure prediction: similarity based and
de
novo

methods

JM
-

http://folding.chmcc.org

11

Polypeptide chains: backbone and
rotational degrees of freedom





H




O


R2



|




||


|


NH3+
--
Ca

--

C
--

N
--

Ca

--

C
--
O
-




|


|




|




\
\




R1


H


H


O


The equilibrium length of the peptide bond (
C
--

N
) is about 2 [Ang].

The average Ca
-

Ca distance in a polypeptide chain is about 3.8 [Ang].

The angle of rotation around N
-

Ca bond is called
j
Ⱐ,湤n

瑨攠慮杬攠慲潵a搠瑨攠䍡t
-

䌠扯湤 楳⁣慬汥搠
f


T桥獥h瑷漠慮a汥猠摥晩d攠瑨攠潶敲慬t 捯湦潲浡c楯渠潦⁰潬o灥p瑩摥t捨慩湳n

卩浰汩晹楮本g瑨敲攠慲攠a桲敥⁤h獣s整攠獴慴敳
牯瑡r楯湳⤠景爠敡捨e潦o瑨敳t

獩湧汥 扯b摳Ⱐ業灬y楮朠9
N

possible backbone conformations.



JM
-

http://folding.chmcc.org

12

Scoring alternative conformations with
empirical force fields (folding potentials)

misfolded

native

E

Ideally, each misfolded structure should have
an energy higher than the native energy, i.e. :



E
misfolded

-

E
native

> 0

JM
-

http://folding.chmcc.org

13

Ab initio

(or
de novo
) folding simulations




When dealing with a new fold, the similarity base
methods cannot be applied


Ab initio

folding simulations consist of conformational
search with an empirical scoring function (“force field”)
to be maximized (or minimized)


Computational bottleneck: exponential search space
and sampling problem (global optimization!)


Fundamental problem: inaccuracy of empirical force
fields


Importance of mixed protocols, such as Rosetta by D.
Baker and colleagues (more when Monte Carlo
protocols for global optimization are introduced)

JM
-

http://folding.chmcc.org

14

Similarity based approaches to structure prediction:
from sequence alignment to fold recognition




High level of redundancy in biology:
sequence similarity

is often
sufficient to use the “guilt by association” rule: if similar sequence then
similar structure and function


Multiple alignments and family profiles can detect evolutionary
relatedness with much lower sequence similarity, hard to detect with
pairwise sequence alignments:
Psi
-
BLAST

by S. Altschul et. al.


For sufficiently close proteins one may superimpose the backbones
using sequence alignment and then perform conformational search (with
the backbone fixed) to find the optimal geometry (according to atomistic
empirical force field) of the side
-
chains:
homology modeling

(e.g.
Modeller by A. Sali et. al.)


Many structures are already known (see PDB) and one can match
sequences directly with structures to enhance structure recognition:
fold
recognition


For both, fold recognition and de novo simulation, prediction of
intermediate attributes such
secondary structure

or solvent
accessibility helps to achieve better sensitivity and specificity


JM
-

http://folding.chmcc.org

15

Protein families and domains

PFAM (7246 families as of April 2004):

http://www.sanger.ac.uk/Software/Pfam/


PRODOM:

http://prodes.toulouse.inra.fr/prodom/current/html/home.php


CDD:

http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi

Check:
pfam00134.11, Cyclin_N


The notion of
protein family

is derived from evolutionary considerations:

members of the same family are related, perform the same function and

are assumed to have diverged from the same ancestor.


The notion of
domain

is derived from structural considerations:

“A domain is defined as an autonomous structural unit, or a reusable

sequence unit that may be found in multiple protein contexts”,
Baterman et. al.


JM
-

http://folding.chmcc.org

16

Multiple alignment and PSSM

JM
-

http://folding.chmcc.org

17

Multiple alignment, clustering and families


DP search gives optimal solution scaling
exponentially with the number of sequences K,
O(nK), not practical for more than 3,4 sequences.


Standard heuristics start from pairwise alignments
(e.g. PsiBLAST, Clustalw)


Hidden Markov Model approach to family profiles
(profile HMM) as an alternative with pre
-
fixed
parameters, trained separately for each family.
Some initial multiple alignments necessary for
training (next lecture).

JM
-

http://folding.chmcc.org

18

Predicting 1D protein profiles from sequences:
secondary structures and solvent accessibility

SABLE server

http://sable.cchmc.org


POLYVIEW server

http://polyview.cchmc.org


a) Multiple alignment and family profiles improve prediction of local

structural propensities.


b) Use of advanced machine learning techniques, such as Neural

Networks or Support Vector Machines improves results as well.


B. Rost and C. Sander were first to achieve more than 70%

accuracy in three state (H, E, C) classification, applying a) and b).

JM
-

http://folding.chmcc.org

19

Predicting 1D protein profiles from sequences:
secondary structures and solvent accessibility

JM
-

http://folding.chmcc.org

20


Predicting transmembrane domains

JM
-

http://folding.chmcc.org

21

“Hydropathy” profiles and membrane domains prediction

Problem

Design a simple algorithm for finding putative trans
-

membrane regions based on “hydropathy” (or hydrophobicity)

profiles. Consider an extension based on prototypes and k
-
NN.

JM
-

http://folding.chmcc.org

22


Predicting transmembrane domains

JM
-

http://folding.chmcc.org

23

Going beyond sequence similarity:
threading and fold recognition

When sequence similarity is not

detectable use a library of known

structures to match your query

with target structures.


As in case of de novo folding,

one needs a scoring function

that measures compatibility

between sequences and structures.

JM
-

http://folding.chmcc.org

24

Why “fold recognition”?



Divergent (common ancestor) vs. convergent
(no ancestor) evolution


PDB: virtually all proteins with 30% seq.
identity have similar structures, however most
of the similar structures share only up to 10%
of seq. identity !



www.columbia.edu/~rost/Papers/1997_evolution/paper.html (B.
Rost)


www.bioinfo.mbb.yale.edu/genome/foldfunc/ (H. Hegyi, M.
Gerstein)

JM
-

http://folding.chmcc.org

25

Simple contact model for protein structure prediction

Each amino acid is represented by a point in 3D space and two amino acids are
said to be in contact if their distance is smaller than a cutoff distance, e.g. 7 [Ang].



JM
-

http://folding.chmcc.org

26

Sequence
-
to
-
structure matching with contact models


Generalized string matching problem: aligning a string
of amino acids against a string of “structural sites”
characterized by other residues in contact



Finding an optimal alignment with gaps using inter
-
residue pairwise models:


E =
S

k< l

e

k l

,


is NP
-
hard because of the non
-
local character of scores
at a given structural site (identity of the interaction
partners may change depending on location of gaps in
the alignment)


R.H. Lathrop, Protein Eng. 7 (1994)

JM
-

http://folding.chmcc.org

27

Hydrophobic contact model and
sequence
-
to
-
structure alignment

H
P
H
PP

-

Solutions to this yet another instance of the global optimization problem:

a)
Heuristic (e.g. frozen environment approximation)

b)
“Profile” or local scoring functions (folding potentials)

JM
-

http://folding.chmcc.org

28

Using sequence similarity, predicted secondary structures
and contact potentials: fold recognition protocols

In practice fold recognition methods are often mixtures of sequence
matching and threading, e.g., with compatibility between a sequence
and a structure measured by contact potentials and predicted
secondary structures compared to the secondary structure of a
template).


D.Fischer and D. Eisenberg, Curr. Opinion in Struct. Biol. 1999, 9: 208

JM
-

http://folding.chmcc.org

29

Some fold recognition servers


PsiBLAST

(Altschul SF et. al., Nucl. Acids Res. 25: 3389)


Live Bench evaluation

(http://BioInfo.PL/LiveBench/1
/
)

:

1.
FFAS
(L. Rychlewski, L. Jaroszewski, W. Li, A. Godzik (2000), Protein
Science 9: 232)

: seq. profile against profile

2.
3D
-
PSSM
(Kelley LA, MacCallum RM, Sternberg JE, JMB 299: 499 )

: 1D
-
3D profile combined with secondary structures and
solvation potential

3.
GenTHREADER
(Jones DT, JMB 287: 797)

: seq. profile
combined with pairwise interactions and solvation
potential


LOOPP
: annotations of remote homologs


http://www.tc.cornell.edu/CBIO/loopp