CS790 – Introduction to Bioinformatics

disturbedtonganeseΒιοτεχνολογία

2 Οκτ 2013 (πριν από 4 χρόνια και 1 μήνα)

91 εμφανίσεις

From Sequences to Structure

Illustrations from: C
Branden

and J
Tooze
, Introduction to Protein Structure, 2
nd

ed. Garland Pub. ISBN 0815302703

Protein Functions


Mechanoenzymes: myosin, actin


Rhodopsin: allows vision


Globins: transport oxygen


Antibodies: immune system


Enzymes: pepsin, renin, carboxypeptidase A


Receptors: transmembrane signaling


Vitelogenin: molecular velcro


And hundreds of thousands more…

2

Proteins are Chains of Amino Acids


Polymer



a molecule composed of repeating units

3

The Peptide Bond


Dehydration synthesis


Repeating backbone: N

C



C

N

C



C



Convention


start at
amino terminus

and proceed to
carboxy

terminus

4

O

O

Peptidyl polymers


A few amino acids in a chain are called a
polypeptide
. A
protein

is usually composed of 50
to 400+ amino acids.


Since part of the amino acid is lost during
dehydration synthesis, we call the units of a
protein
amino acid
residues
.

5

carbonyl

carbon

amide

nitrogen

Side Chain Properties


Recall that the electronegativity of carbon is at
about the middle of the scale for light elements


Carbon does not make hydrogen bonds with water
easily


hydrophobic


O and N are generally more likely than C to h
-
bond to
water


hydrophilic


We group the amino acids into three general
groups:


Hydrophobic


Charged (positive/basic & negative/acidic)


Polar

6

The Hydrophobic Amino Acids

7

Proline

severely

limits allowable

conformations!

The Charged Amino Acids

8

The Polar Amino Acids

9

More Polar Amino Acids

10

And then there’s…

Planarity of the Peptide Bond

11

Phi and psi




=


= 180
°

is extended
conformation




: C


to N

H




: C=O to C


OCCBIO 2006


Fundamental
Bioinformatics

12

The Ramachandran Plot


G. N. Ramachandran


first calculations of
sterically

allowed regions of phi and psi


Note the structural importance of glycine

13

Observed

(non
-
glycine)

Observed

(glycine)

Calculated

Primary and Secondary Structure


Primary structure

= the linear
sequence

of amino
acids comprising a protein:


AGVGTVPMTAYGNDIQYYGQVT…


Secondary structure


Regular patterns of hydrogen bonding in proteins result in
two patterns that emerge in nearly every protein structure
known: the

-
helix

and the


-
sheet


The location of direction of these periodic, repeating
structures is known as the
secondary structure

of the protein

14

The alpha Helix

15











60
°

Properties of the alpha helix











60
°


Hydrogen bonds

between C=O of

residue
n
, and

NH of residue

n
+4


3.6 residues/turn


1.5
Å/residue rise


100
°
/residue turn

16

Properties of

-
helices


4


40+ residues in length


Often
amphipathic

or “dual
-
natured”


Half hydrophobic and half hydrophilic


Mostly when surface
-
exposed


If we examine many

-
helices,

we find trends…


Helix formers: Ala, Glu, Leu,

Met


Helix breakers: Pro, Gly, Tyr,

Ser

17

The beta Strand (and Sheet)

18







135
°





+
135
°

Properties of beta sheets


Formed of stretches of 5
-
10 residues in
extended conformation


Pleated



each C


a bit

above or below the previous


Parallel/aniparallel
,

contiguous/non
-
contiguous

OCCBIO 2006


Fundamental
Bioinformatics

19

Parallel and anti
-
parallel

-
sheets


Anti
-
parallel is slightly energetically favored

20

Anti
-
parallel

Parallel

Turns and Loops


Secondary structure
elements are connected by
regions of
turns

and
loops


Turns


short regions

of non
-

, non
-


conformation


Loops


larger stretches with
no secondary structure.
Often disordered.


“Random coil”


Sequences vary much more than
secondary structure regions

21

Levels of Protein
Structure


Secondary structure
elements combine to
form tertiary structure


Quaternary structure
occurs in multienzyme
complexes


Many proteins are active
only as homodimers,
homotetramers, etc.

Disulfide Bonds


Two cyteines in
close proximity will
form a
covalent

bond


Disulfide bond,
disulfide bridge, or
dicysteine bond.


Significantly
stabilizes tertiary
structure.

23

Protein Structure Examples

24

Determining Protein Structure


There are ~ 100,000 distinct proteins in the
human proteome.


3D structures have been determined for 14,000
proteins, from
all organisms


Includes duplicates with different
ligands

bound, etc.


Coordinates are determined by
X
-
ray
crystallography

25

X
-
Ray diffraction


Image is averaged

over:


Space (many copies)


Time (of the diffraction

experiment)

26

Electron Density Maps


Resolution is
dependent on the
quality/regularity of
the crystal


R
-
factor is a
measure of
“leftover” electron
density


Solvent fitting


Refinement

27

The Protein Data Bank


http://www.rcsb.org/pdb/

28

ATOM 1 N ALA E 1 22.382 47.782 112.975 1.00 24.09 3APR 213

ATOM 2 CA ALA E 1 22.957 47.648 111.613 1.00 22.40 3APR 214

ATOM 3 C ALA E 1 23.572 46.251 111.545 1.00 21.32 3APR 215

ATOM 4 O ALA E 1 23.948 45.688 112.603 1.00 21.54 3APR 216

ATOM 5 CB ALA E 1 23.932 48.787 111.380 1.00 22.79 3APR 217

ATOM 6 N GLY E 2 23.656 45.723 110.336 1.00 19.17 3APR 218

ATOM 7 CA GLY E 2 24.216 44.393 110.087 1.00 17.35 3APR 219

ATOM 8 C GLY E 2 25.653 44.308 110.579 1.00 16.49 3APR 220

ATOM 9 O GLY E 2 26.258 45.296 110.994 1.00 15.35 3APR 221

ATOM 10 N VAL E 3 26.213 43.110 110.521 1.00 16.21 3APR 222

ATOM 11 CA VAL E 3 27.594 42.879 110.975 1.00 16.02 3APR 223

ATOM 12 C VAL E 3 28.569 43.613 110.055 1.00 15.69 3APR 224

ATOM 13 O VAL E 3 28.429 43.444 108.822 1.00 16.43 3APR 225

ATOM 14 CB VAL E 3 27.834 41.363 110.979 1.00 16.66 3APR 226

ATOM 15 CG1 VAL E 3 29.259 41.013 111.404 1.00 17.35 3APR 227

ATOM 16 CG2 VAL E 3 26.811 40.649 111.850 1.00 17.03 3APR 228

Views of a Protein

29

Wireframe

Ball and stick

Views of a Protein

30

Spacefill

Cartoon

CPK
colors

Carbon = green, black

Nitrogen = blue

Oxygen = red

Sulfur

= yellow

Hydrogen = white

The Protein Folding Problem


Central question of molecular biology:


Given a particular sequence of amino acid
residues (primary structure), what will the
tertiary/quaternary structure of the resulting
protein be?”


Input:
AAVIKYGCAL…

Output:

1

1
,

2

2


= backbone conformation:

(no side chains yet)

31

Forces Driving Protein Folding


It is believed that
hydrophobic collapse

is a key
driving force for protein folding


Hydrophobic core


Polar surface interacting with solvent


Minimum volume (no cavities)


Disulfide bond formation stabilizes


Hydrogen bonds


Polar and electrostatic interactions

32

Folding Help


Proteins are, in fact, only marginally stable


Native state is typically only 5 to 10 kcal/mole more
stable than the unfolded form


Many proteins help in folding


Protein disulfide isomerase


catalyzes shuffling of
disulfide bonds


Chaperones


break up aggregates and (in theory)
unfold misfolded proteins

33

The Hydrophobic Core


Hemoglobin A is the protein in red blood cells
(erythrocytes) responsible for binding oxygen.


The mutation E6

V in the


chain places a
hydrophobic Val on the surface of hemoglobin


The resulting “sticky patch” causes hemoglobin S
to agglutinate (stick together) and form fibers
which deform the red blood cell and do not carry
oxygen efficiently


Sickle cell anemia was the first identified
molecular disease

34

Sickle Cell Anemia

35

Sequestering hydrophobic residues in the protein core
protects proteins from hydrophobic agglutination.

Computational Problems in Protein Folding


Two key questions:


Evaluation



how can we tell a correctly
-
folded protein
from an incorrectly folded protein?


H
-
bonds, electrostatics, hydrophobic effect, etc.


Derive a function, see how well it does on “real” proteins


Optimization



once we get an evaluation function, can
we optimize it?


Simulated annealing/monte carlo


EC


Heuristics

36

Fold Optimization


Simple lattice models (HP
-
models)


Two types of residues:
hydrophobic and polar


2
-
D or 3
-
D lattice


The only force is hydrophobic
collapse


Score = number of H

H contacts

37

Scoring Lattice Models

H/P model scoring: count noncovalent
hydrophobic interactions.






Sometimes:

Penalize for buried polar or surface hydrophobic
residues

38

What can we do with lattice models?


For smaller polypeptides, exhaustive search can
be used


Looking at the “best” fold, even in such a simple
model, can teach us interesting things about the protein
folding process


For larger chains, other optimization and search
methods must be used


Greedy, branch and bound


Evolutionary computing, simulated annealing


Graph theoretical methods

39

Learning from Lattice Models

The “hydrophobic zipper” effect:

40

Ken Dill ~ 1997

Representing a lattice model

Absolute directions

UURRDLDRRU

Relative directions

LFRFRRLLFFL

Advantage, we can’t have UD or RL in absolute

Only three directions: LRF

What about bumps?

LFRRR

Bad score

Use a better representation

41

Preference
-
order representation


Each position has two “preferences”


If it can’t have either of the two, it will take the “least
favorite” path if possible


Example: {LR},{FL},{RL},

{FR},{RL},{RL},{FR},{RF}



Can still cause bumps:

{LF},{FR},{RL},{FL},

{RL},{FL},{RF},{RL},

{FL}

42

More Realistic Models


Higher resolution lattices (45
°

lattice, etc.)


Off
-
lattice models


Local moves


Optimization/search methods and

/


representations


Greedy search


Branch and bound


EC, Monte Carlo, simulated annealing, etc.

43

The Other Half of the Picture


Now that we have a more realistic off
-
lattice
model, we need a better
energy function

to
evaluate a conformation (fold).


Theoretical force field:


G =

G
van der Waals

+

G
h
-
bonds

+

G
solvent

+

G
coulomb


Empirical force fields


Start with a database


Look at neighboring residues


similar to known
protein folds?

44

Threading: Fold recognition


Given:


Sequence:
IVACIVSTEYDVMKAAR…


A database of molecular
coordinates


Map the sequence onto
each fold


Evaluate


Objective 1: improve scoring
function


Objective 2: folding

45

Secondary Structure Prediction

46

AGVGTVPMTAYGNDIQYYGQVT


A
-
VGIVPM
-
AYGQDIQY
-
GQVT…

AG
-
GIIP
--
AYGNELQ
--
GQVT…

AGVCTVPMTA
---
ELQYYG
--
T…

AGVGTVPMTAYGNDIQYYGQVT


----
hhhHHHHHHhhh
--
eeEE…

Secondary Structure Prediction


Easier than folding


Current algorithms can prediction secondary structure
with 70
-
80% accuracy


Chou, P.Y. & Fasman, G.D. (1974).
Biochemistry
,
13
, 211
-
222.


Based on frequencies of occurrence of residues in
helices and sheets


PhD


Neural network based


Uses a multiple sequence alignment


Rost & Sander,
Proteins
, 1994 , 19, 55
-
72

47

Chou
-
Fasman Parameters

48

Chou
-
Fasman Algorithm


Identify

-
helices


4 out of 6 contiguous amino acids that have P(a) > 100


Extend the region until 4 amino acids with P(a) < 100
found


Compute

P(a) and

P(b); If the region is >5 residues
and

P(a) >

P(b)
identify as a helix


Repeat for

-
sheets [use P(b)]


If an


and a


region overlap, the overlapping
region is predicted according to

P(a) and

P(b)

49

Chou
-
Fasman, cont’d


Identify hairpin turns:


P(t) =
f
(i) of the residue


f
(i+1) of the next residue


f
(i+2) of the following residue


f
(i+3) of the residue at
position (i+3)


Predict a hairpin turn starting at positions where:


P(t) > 0.000075


The average P(turn) for the four residues > 100



P(a) <

P(turn) >

P(b) for the four residues



Accuracy


60
-
65%

50

Chou
-
Fasman Example


CAENKLDHVRGPTCILFMTWYNDGP


CAENKL


Potential helix

(!C and !N)


Residues with P(a) < 100: RNCGPSTY


Extend: When we reach RGPT, we must stop


CAENKLDHV:

P(a) = 972,

P(b) = 843


Declare alpha helix


Identifying a
hairpin turn


VRGP: P(t) = 0.000085


Average P(turn) = 113.25


Avg P(a) = 79.5, Avg P(b) = 98.25

51

Lots More to Come


Microarray analysis


Mass Spectrometry


Interactions/ Knockouts


Synthetic Lethality


RPPA


.....

52