# Protein structure prediction

AI and Robotics

Oct 19, 2013 (4 years and 6 months ago)

69 views

Protein structure prediction

May 26, 2011

HW #8 due today

Quiz #3 on Tuesday, May 31

Learning objectives
-
Understand the biochemical
basis of secondary structure prediction programs.
Become familiar with the databases that hold
secondary structure information. Understand
neural networks and how they help to predict
secondary structure.

Workshop
-
Predict secondary structure of p53.

Homework #9
-
Due June 2

What is secondary structure?

Three major types:

Alpha Helical Regions

Beta Strand Regions

Coils, Turns, Extended (anything else)

Can we predict the final structure?

http://en.wikipedia.org/wiki/Protein_folding

Some Prediction Methods

ab initio

methods

Based on physical properties of aa’s and bonding
patterns

Statistics of amino acid distributions in known
structures

Chou
-
Fasman

Sequence similarity to sequences with known
structure

PSIPRED

Chou
-
Fasman

First widely used procedure

Output
-
helix, strand or turn

Percent accuracy: 60
-
65%

Psi
-
BLAST
Pred
ict Secondary
Structure (PSIPRED)

Three steps:

1) Generation of position specific
scoring matrix.

2) Prediction of initial secondary
structure

3) Filtering of predicted structure

Conformational parameters for α
-
helical, β
-
strand, and
turn amino acids (from Chou and Fasman, 1978)

PSIPRED

Uses multiple aligned sequences for prediction.

Uses training set of folds with known structure.

Uses a two
-
stage neural network to predict
structure based on position specific scoring
matrices generated by PSI
-
BLAST (Jones, 1999)

First network converts a window of 15 aa’s into a raw
score of h,e (sheet), c (coil) or terminus

Second network filters the first output. For example, an
output of hhhhehhhh might be converted to hhhhhhhhh.

Can obtain a Q
3

value of 70
-
78% (may be the
highest achievable)

Neural networks

Computer neural networks are based on simulation of adaptive

learning in networks of real neurons.

Neurons connect to each other via synaptic junctions which are either

stimulatory or inhibitory.

Adaptive learning involves the formation or suppression of the right

combinations of stimulatory and inhibitory synapses so that a set

of inputs produce an appropriate output.

Neural Networks (cont. 1)

The computer version of the neural network involves

identification of a set of inputs
-

amino acids in the

sequence, which transmit through a network of

connections.

At each layer, inputs are numerically

weighted and the combined result passed to the next

layer.

Ultimately a final output, a decision, helix, sheet or

coil, is produced.

Neural Networks (cont. 2)

90% of training set was used (known structures)

10% was used to evaluate the performance of the neural

network after the training session.

Neural Networks (cont. 3)

During the training phase, selected sets of proteins of known
structure were scanned, and if the decisions were incorrect, the
input weightings were adjusted by the software to produce the
desired result.

Training runs were repeated until the success rate is
maximized.

Careful selection of the training set is an important aspect of
this technique. The set must contain as wide a range of
different fold types as possible without duplications of
structural types that may bias the decisions.

Neural Networks (cont. 4)

An additional component of the PSIPRED procedures involves
sequence alignment with similar proteins.

The rationale is that some amino acids positions in a sequence
contribute more to the final structure than others. (This has been
demonstrated by systematic mutation experiments in which each
consecutive position in a sequence is substituted by a spectrum of amino
acids. Some positions are remarkably tolerant of substitution, while
others have unique requirements.)

To predict secondary structure accurately, one should place less weight
on the tolerant positions, which clearly contribute little to the structure

One must also put more weight on the intolerant positions.

15 groups of 21 units

(1 unit for each aa plus

one specifying the end)

Row specifies aa position

three outputs are helix, strand or coil

Filtering network

Provides info

on tolerant or

intolerant positions

(Jones, 1999)

Example of Output from
PSIPRED

PSIPRED PREDICTION RESULTS

Key

Conf: Confidence (0=low, 9=high)

Pred: Predicted secondary structure (H=helix, E=strand, C=coil)

AA: Target sequence

Conf: 923788850068899998538983213555268822788714786424388875156215

Pred: CCEEEEEEEHHHHHHHHHHCCCCCCHHHHHHCCCCCEEEEECCCCCCHHHHHHHCCCCCC

AA: KDIQLLNVSYDPTRELYEQYNKAFSAHWKQETGDNVVIDQSHGSQGKQATSSVINGIEAD

10 20 30 40 50 60

How to calculate Q3?

Sequence:

MEETHAPYRGVCNNM

Actual Structure:

CCCCCHHHHHHEEEE

PSIPRED Prediction:

CCCCCHHHHHHEEEH

Q3 = 14/15 x 100 = 93%

Recognizing motifs in proteins.

PROSITE is a database of protein families and
domains.

Most proteins can be grouped, on the basis of
similarities in their sequences, into a limited
number of families.

Proteins or protein domains belonging to a
particular family generally share functional
attributes and are derived from a common
ancestor.

PROSITE Database

Contains 1612 documentation entries.

Signatures are produced by scanning the
PROSITE database with your query. A
“signature” of a protein allows one to place a
protein within a specific function class based on
structure and/or function.

An example of an documentation entry in
PROSITE is:

http://ca.expasy.org/cgi
-
bin/nicedoc.pl?PDOC50020

Signatures are produced from
profiles and patterns.

Profile
-
”a table of position
-
specific amino
acid weights and gap costs. These numbers
(also referred to as scores) are used to
calculate a similarity score for any
alignment between a profile and a sequence,
or parts of a profile and a sequence. An
alignment with a similarity score higher
than or equal to a given cut
-
off value
constitutes a motif occurrence.”

Sequences in one profile and the
PSSM associated with the profile

F K L L S H C L L V

F K A F G Q T M F Q

Y P I V G Q E L L G

F P V V K E A I L K

F K V L A A V I A D

L E F I S E C I I Q

F K L L G N V L V C

A
-
18
-
10
-
1
-
8 8
-
3 3
-
10
-
2
-
8

C
-
22
-
33
-
18
-
18
-
22
-
26 22
-
24
-
19
-
7

D
-
35 0
-
32
-
33
-
7 6
-
17
-
34
-
31 0

E
-
27 15
-
25
-
26
-
9 23
-
9
-
24
-
23
-
1

F 60
-
30 12 14
-
26
-
29
-
15 4 12
-
29

G
-
30
-
20
-
28
-
32 28
-
14
-
23
-
33
-
27
-
5

H
-
13
-
12
-
25
-
25
-
16 14
-
22
-
22
-
23
-
10

I 3
-
27 21 25
-
29
-
23
-
8 33 19
-
23

K
-
26 25
-
25
-
27
-
6 4
-
15
-
27
-
26 0

L 14
-
28 19 27
-
27
-
20
-
9 33 26
-
21

M 3
-
15 10 14
-
17
-
10
-
9 25 12
-
11

N
-
22
-
6
-
24
-
27 1 8
-
15
-
24
-
24
-
4

P
-
30 24
-
26
-
28
-
14
-
10
-
22
-
24
-
26
-
18

Q
-
32 5
-
25
-
26
-
9 24
-
16
-
17
-
23 7

R
-
18 9
-
22
-
22
-
10 0
-
18
-
23
-
22
-
4

S
-
22
-
8
-
16
-
21 11 2
-
1
-
24
-
19
-
4

T
-
10
-
10
-
6
-
7
-
5
-
8 2
-
10
-
7
-
11

V 0
-
25 22 25
-
19
-
26 6 19 16
-
16

W 9
-
25
-
18
-
19
-
25
-
27
-
34
-
20
-
17
-
28

Y 34
-
18
-
1 1
-
23
-
12
-
19 0 0
-
18

How are the patterns constructed?

ALRDFATHDDVCGK..

SMTAEATHDSVACY..

ECDQAATHEAVTHR..

Sequences necessary for structure

or function are aligned manually by

experts in field. Then a pattern is

created.

A
-
T
-
H
-
[DE]
-
X
-
V
-
X(4)
-
{ED}

This pattern is translated as: Ala, Thr, His, [Asp or Glu], any,

Val, any, any, any, any, any but Glu or Asp

Example of a pattern in a
PROSITE record

ID ZINC_FINGER_C3HC4; PATTERN.

PA C
-
X
-
H
-
X
-
[LIVMFY]
-
C
-
X(2)
-
C
-
[LIVMYA]

Scanning the PROSITE database

“Scan a sequence against PROSITE patterns
and profiles”

allows the user to scan the ProSite
database to search for patterns and profiles. It
uses dynamic programming to determine optimal
alignments. If the alignment produces a high
score (a hit), then the hit is shown to the user.

http://www.expasy.ch/prosite/

If a “hit” is generated, the program gives an output
that shows the region of the query that contains
the pattern and a reference to the 3
-
D structure
database if available.

Example of output from Prosite
Scan

RPSBlast

Reverse psi
-
blast, or rpsblast, is a program that searches a
query protein sequence or protein sequences against a
database of position specific scoring matrices. The PSSMs
are from conserved protein sequences that have known
functions/structure.

3D structure data

The largest 3D structure database is the
Protein Databank

It contains over 20,000 records

Each record contains 3D coordinates for
macromolecules

80% of the records were obtained from X
-
ray
diffraction studies, 20% from NMR.

ATOM 1 N ARG A 14 22.451 98.825 31.990 1.00 88.84 N

ATOM 2 CA ARG A 14 21.713 100.102 31.828 1.00 90.39 C

ATOM 3 C ARG A 14 22.583 101.018 30.979 1.00 89.86 C

ATOM 4 O ARG A 14 22.105 101.989 30.391 1.00 89.82 O

ATOM 5 CB ARG A 14 21.424 100.704 33.208 1.00 93.23 C

ATOM 6 CG ARG A 14 20.465 101.880 33.215 1.00 95.72 C

ATOM 7 CD ARG A 14 20.008 102.147 34.637 1.00 98.10 C

ATOM 8 NE ARG A 14 18.999 103.196 34.718 1.00100.30 N

ATOM 9 CZ ARG A 14 18.344 103.507 35.833 1.00100.29 C

ATOM 10 NH1 ARG A 14 18.580 102.835 36.952 1.00 99.51 N

ATOM 11 NH2 ARG A 14 17.441 104.479 35.827 1.00100.79 N

Part of a record from the PDB

Quiz #3 prep

BLAST

Three steps

Gapped BLAST

Heuristic program

Uses S
-
W algorithm for
final scoring

CLUSTAL W

Pairwise alignments

Difference matrix

Guide tree

Importance of having
highly similar sequences

Secondary Structure
prediction

Chou
-
Fasman

PSIPRED

Good for secondary str

Protein analysis

ProScan

RPBlast