CAP5510 - Bioinformatics - CISE - University of Florida

dasypygalstockingsΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

92 εμφανίσεις

1

CAP5510


Bioinformatics

Protein Structures

Tamer Kahveci

CISE Department

University of Florida

2

What and Why?


Proteins fold into a three
dimensional shape


Structure can reveal functional
information that we can not find
from sequence


Misfolding proteins can cause
diseases


Sickle cell anemia, mad cow
disease


Used in drug design

Hemoglobin

Normal v.s. sickled blood cells

E → V

HIV protease

inhibitor

3

Goals


Understand protein structures


Primary, secondary, tertiary


Learn how protein shapes are


determined


Predicted


Structure comparison (?)

4

A Protein Sequence

>gi|22330039|ref|NP_683383.1| unknown protein; protein id: At1g45196.1 [Arabidopsis
thaliana]

MPSESSYKVHRPAKSGGSRRDSSPDSIIFTPESNLSLFSSASVSVDRCSSTSDAHDRDDSLISAWKEEFEVKKDDESQNL

DSARSSFSVALRECQERRSRSEALAKKLDYQRTVSLDLSNVTSTSPRVVNVKRASVSTNKSSVFPSPGTPTYLHSMQKGW

SSERVPLRSNGGRSPPNAGFLPLYSGRTVPSKWEDAERWIVSPLAKEGAARTSFGASHERRPKAKSGPLGPPGFAYYSLY

SPAVPMVHGGNMGGLTASSPFSAGVLPETVSSRGSTTAAFPQRIDPSMARSVSIHGCSETLASSSQDDIHESMKDAATDA

QAVSRRDMATQMSPEGSIRFSPERQCSFSPSSPSPLPISELLNAHSNRAEVKDLQVDEKVTVTRWSKKHRGLYHGNGSKM

5


Basic Amino Acid

Structure:


The side chain, R,

varies for each of

the 20 amino acids

C

R

C


H

N

O

OH

H

H

Amino

group

Carboxyl

group

Side chain

Amino Acid Composition

6

The Peptide Bond


Dehydration synthesis


Repeating backbone: N

C



C

N

C



C




Convention


start at
amino terminus

and proceed to
carboxy
terminus

O

O

7

Peptidyl polymers


A few amino acids in a chain are called a
polypeptide
. A
protein

is usually composed of 50 to 400+ amino acids.


We call the units of a protein
amino acid
residues
.

carbonyl

carbon

amide

nitrogen

8

Side chain properties


Carbon does not make hydrogen bonds with
water easily


hydrophobic


O and N are generally more likely than C to h
-
bond to water


hydrophilic


We group the amino acids into three general
groups:


Hydrophobic


Charged (positive/basic & negative/acidic)


Polar

9

The Hydrophobic Amino Acids

10

The Charged Amino Acids

11

The Polar Amino Acids

12

More Polar Amino Acids

And then there’s…

13

Phi (

)


the
angle of
rotation about
the N
-
C


bond.

Psi (

)


the
angle of
rotation about
the C

-
C bond.

The planar bond angles and bond
lengths are fixed
.

Planarity of the Peptide Bond

14


Primary structure

= the linear
sequence

of amino
acids comprising a protein:


AGVGTVPMTAYGNDIQYYGQVT…


Secondary structure


Regular patterns of hydrogen bonding in proteins
result in two patterns that emerge in nearly every
protein structure known: the

-
helix

and the


-
sheet


The location of
direction of these periodic, repeating
structures is known as the
secondary structure

of the
protein

Primary & Secondary Structure

15











60
°

The Alpha Helix

16

Properties of the Alpha Helix











60
°


Hydrogen bonds

between C=O of

residue
n
, and

NH of residue

n
+4


3.6 residues/turn


1.5
Å/residue rise


100
°
/residue turn

17

Properties of

-
helices


4


40+ residues in length


Often
amphipathic

or “dual
-
natured”


Half hydrophobic and half hydrophilic


If we examine many

-
helices,

we find trends…


Helix formers: Ala, Glu, Leu, Met


Helix breakers: Pro, Gly, Tyr, Ser

18

The beta strand (& sheet)







135
°





+
135
°

19

Properties of beta sheets


Formed of stretches of 5
-
10
residues in extended
conformation


Parallel/aniparallel
,

contiguous/non
-
contiguous

20

Anti
-
Parallel Beta Sheets

21

Parallel Beta Sheets

22

Mixed Beta Sheets

23

Turns and Loops


Secondary structure elements are
connected by regions of
turns

and
loops


Turns


short regions of non
-

, non
-


conformation


Loops


larger stretches with no
secondary structure.


Sequences vary much more than secondary
structure regions

24

Ramachandran Plot

25

Levels of
Protein
Structure


Secondary structure
elements combine to
form tertiary structure


Quaternary structure
occurs in
multienzyme
complexes

26

Protein Structure Example

Beta Sheet

Helix

Loop

ID: 12as

2 chains

27

Wireframe

Ball and stick

Views of a Protein

28

Views of a protein

Spacefill

Cartoon

CPK colors

Carbon = green,
black, or grey

Nitrogen = blue

Oxygen = red

Sulfur = yellow

Hydrogen = white

29

Common Protein Motifs

30


Four helical
bundle:


Globin domain:

Mostly Helical Folding Motifs

31



/


barrel:


/


Motifs

32

Open Twisted

Beta Sheets

33

Beta Barrels

34

Determining the Structure of a
Protein

Experimental Methods


X
-
ray


NMR

35

X
-
Ray Crystallography

Crystals diffract X
-
rays in regular patterns

(Max Von Laue, 1912)

Discovery of X
-
rays

(Wilhelm Conrad Röntgen, 1895)

The first X
-
ray diffraction pattern from a
protein crystal

(Dorothy Hodgkin, 1934)

Today, structure of ~15,000 proteins are determined

36

X
-
Ray Crystallography


Grow millions of protein
crystals


Takes months


Expose to radiation beam


Analyze the image with
computer


Average over many copies
of images


PDB


Not all proteins can be
crystallized!

37

NMR


Nuclear Magnetic Resonance


Nuclei of atoms vibrate when exposed to
oscillating magnetic field


Detect vibrations by external sensors


Computes inter
-
atomic distances.



Requires complex analysis. NMR can be used
for short sequences (<200 residues)


More than one model can be derived from NMR.

38

Determining the Structure of a
Protein

Computational Methods

39

The Protein Folding Problem


Central question of molecular biology:

“Given a particular sequence of amino
acid residues (primary structure), what will
the secondary/tertiary/quaternary structure
of the resulting protein be?”


Input:
AAVIKYGCAL…

Output:

1

1
,

2

2


40

Structure v.s. Sequence


Observation: A protein with the same
sequence (under the same circumstances)
yields the same shape.


Protein folds into a shape that minimizes
the energy needed to stay in that shape.


Protein folds in ~10
-
15

seconds.

41

Secondary Structure
Prediction

42

Chou
-
Fasman methods


Uses statistically obtained
Chou
-
Fasman
parameters
.


For each amino acid has


P(a): alpha


P(b): beta


P(t): turn


f(): additional turn parameter.

43

Chou
-
Fasman Parameters

44

C.
-
F. Alpha Helix Prediction (1)


Find P(a) for all letters


Find 6 contiguous letters, at least 4 of them have P(a) >
100


Declare these regions as alpha helix

A

E

A

T

T

L

C

M

Q

S

T

Y

C

Y

V

142

151

142

83

83

121

70

145

111

77

83

69

70

69

106

83

37

83

119

119

130

119

105

110

75

119

147

119

147

170

P(a)

P(b)

45

C.
-
F. Alpha Helix Prediction (2)


Extend in both directions until 4 consecutive
letters with P(a) < 100 found

A

E

A

T

T

L

C

M

Q

S

T

Y

C

Y

V

142

151

142

83

83

121

70

145

111

77

83

69

70

69

106

83

37

83

119

119

130

119

105

110

75

119

147

119

147

170

P(a)

P(b)

46

C.
-
F. Alpha Helix Prediction (3)


Find sum of P(a) (Sa) and sum of P(b) (Sb) in the
extended region


If region is long enough ( >= 5 letters) and P(a) > P(b) then
declare the extended region as alpha helix

A

E

A

T

T

L

C

M

Q

S

T

Y

C

Y

V

142

151

142

83

83

121

70

145

111

77

83

69

70

69

106

83

37

83

119

119

130

119

105

110

75

119

147

119

147

170

P(a)

P(b)

47

C.
-
F. Beta Sheet Prediction


Same as alpha helix replace P(a) with P(b)


Resolving overlapping alpha helix & beta
sheet


Compute sum of P(a) (Sa) and sum of P(b)
(Sb) in the overlap.


If Sa > Sb => alpha helix


If Sb > Sa => beta sheet

48

C.
-
F. Turn Prediction


An amino acid is predicted as turn if all of the following
holds:


f(i)*f(i+1)*f(i+2)*f(i+3) > 0.000075


Avg(P(i+k)) > 100, for k=0, 1, 2, 3


Sum(P(t)) > Sum(P(a)) and Sum(P(b)) for i+k, (k=0, 1, 2, 3)

f()

A

E

A

T

T

L

C

M

Q

S

T

Y

C

Y

V

142

151

142

83

83

121

70

145

111

77

83

69

70

69

106

83

37

83

119

119

130

119

105

110

75

119

147

119

69

170

66

74

66

96

96

59

119

60

98

143

96

114

119

114

50

i

i+1

i+2

i+3

P(a)

P(b)

P(t)

49

Other Methods for SSE Prediction


Similarity searching


Predator


Markov chain


Neural networks


PHD



~65% to 80% accuracy

50

Tertiary Structure Prediction

51

Forces driving protein folding


It is believed that
hydrophobic collapse

is
a key driving force for protein folding


Hydrophobic core


Polar surface interacting with solvent


Minimum volume (no cavities)


Disulfide bond formation stabilizes


Hydrogen bonds


Polar and electrostatic interactions

52


Simple lattice models
(HP
-
models or
Hydrophobic
-
Polar
models)


Two types of residues:
hydrophobic and polar


2
-
D or 3
-
D lattice


The only force is
hydrophobic collapse


Score = number of
H

H contacts

Fold Optimization

53

Scoring Lattice Models


H/P model scoring: count noncovalent hydrophobic
interactions.


Sometimes:


Penalize for buried polar or surface hydrophobic residues

54


For smaller polypeptides, exhaustive search can
be used


Looking at the “best” fold, even in such a simple
model, can teach us interesting things about the
protein folding process


For larger chains, other optimization and search
methods must be used


Greedy, branch and bound


Evolutionary computing, simulated annealing

Can we use lattice models?

55

The “hydrophobic zipper” effect

Ken Dill ~ 1997

56

Representing a lattice model


Absolute directions


UURRDLDRRU


Relative directions


LFRFRRLLFFL


Advantage, we can’t have
UD or RL in absolute


Only three directions: LRF


What about bumps?

LFRRR


Bad score


Use a better representation

57

Preference
-
order representation


Each position has two
“preferences”


If it can’t have either of the
two, it will take the “least
favorite” path if possible


Example: {LR},{FL},{RL},

{FR},{RL},{RL},{FL},{RF}



Can still cause bumps:

{LF},{FR},{RL},{FL},

{RL},{FL},{RF},{RL},

{FL}

58

More realistic models


Higher resolution lattices (45
°

lattice, etc.)


Off
-
lattice models


Local moves


Optimization/search methods and

/


representations


Greedy search


Branch and bound


EC, Monte Carlo, simulated annealing, etc.

59

How to Evaluate the Result?


Now that we have a more realistic off
-
lattice
model, we need a better
energy function

to
evaluate a conformation (fold).


Theoretical force field:



G =

G
van der Waals

+

G
h
-
bonds

+

G
solvent

+

G
coulomb


Empirical force fields


Start with a database


Look at neighboring residues


similar to known
protein folds?

60

Comparative Modeling

1.
Identify similar protein sequences from a
database of known proteins (BLAST)

2.
Find conserved regions by aligning these
proteins (CLUSTAL
-
W)

3.
Predict alpha helices and beta sheets from
conserved regions, backbone

4.
Predict loops

5.
Predict side chain positions

6.
Evaluate

61

Threading: Fold recognition


Given:


Sequence:
IVACIVSTEYDVMKAAR…


A database of molecular
coordinates


Map the sequence onto
each fold


Evaluate


Objective 1: improve
scoring function


Objective 2: folding

62

Folding : still a hard problem


Levinthal’s paradox


Consider a 100
residue protein. If each residue can take
only 3 positions, there are 3
100

= 5


10
47

possible conformations.


If it takes 10
-
13
s to convert from 1 structure to
another, exhaustive search would take 1.6


10
27

years.

63

Protein Classification


Class: Similar secondary structure properties


All alpha, all beta, alpha/beta, alpha+beta


Fold: major secondary structure similarity.


Globin like (6 helices, folded leaf, partly opened)


Super family: distant homologs. 25
-
30%
sequence identity.


Family: close homologs. Evolved from the same
ancestor. High identity.