Genetic Algorithms and
Protein Folding
Based on lecture by Dr. Steffen Schulze

Kremer
http://www.techfak.uni

bielefeld.de/bcd/Curric/ProtEn/proten.html
Genetic Algorithm
:
is a heuristic method that operates on pieces of
information like nature does on genes in the course of
evolution.
•
Individuals are represented by a linear string of letters of an alphabet
(in nature nucleotides, in genetic algorithms bits)
•
Individuals are allowed to
mutate
,
crossover
and
reproduce
.
•
Fitness function
evaluates individuals.
•
Depending on the generation replacement mode a subset of parents and
offspring enters the next reproduction cycle.
•
After a number of iterations the population consists of individuals that
are well adapted in terms of the fitness function.
•
It cannot be proven that the individuals of a final generation contain an
optimal solution for the objective encoded in the fitness function.
I.
Initialise a population of individuals.
This can be done either randomly or with domain specific background
knowledge to start the search with promising seed individuals.
(Where
available the latter is always recommended. )
•
Individuals are represented as a string of bits.
•
A
fitness function
must be defined that takes as input an
individual and returns a number (or a vector) that can be used as a
measure for the quality (fitness) of that individual.
The application should be formulated in a way that the desired
solution to the problem coincides with the most successful individual
according to the fitness function.
II.
Evaluate all individuals
of the initial population.
III.
Generate new individuals.
The reproduction probability for an individual
is proportional to its relative fitness within the current generation.
Crossover
two point crossover
0101001111000011010101011110111
1010101101011100101110001010101
uniform crossover
0101001111000011010101011110111
1010101101011100101110001010101
Genetic Operators:
Mutation
. Substitute one or more bits of an individual randomly by a new
value (0 or 1).
Variation
. Change the bits in a way that the number encoded by them is
slightly incremented or decremented.
Crossover
. Exchange parts (single bits or strings of bits) of one individual
with the corresponding parts of another individual. Originally, only
one

point crossover was performed but theoretically one can
process up to L

1 different crossover sites (with L as the length
of the
individual).
IV. Select individuals for the new parent generation.
Schemes:
1) Complete offspring is selected while all parents are discarded
(original genetic algorithm). This is motivated by the biological
model and is called total generation replacement.
2) The
n
best individuals (from old and new generation)
This method is called
elitist generation replacement
.
V. Go back to step 2 until either a desired fitness value was reached or
until a predefined number of iterations was performed
Init the
first
generation
Evaluate
Apply Genetic
Operations
Select the next
generation
Representation Formalism
•
hybrid approach

genetic algorithm is configured to
operate on numbers, not bit strings as in the
original genetic algorithm.
Disadvantages:
–
the mathematical foundation of genetic
algorithms holds only for binary representations,
although some of the mathematical properties
are also valid for a floating point representation.
–
Binary representations run faster in many
applications.
–
An additional encoding/decoding process may be
required to map numbers onto bit strings.
Protein Structure Prediction
Individuals

Protein Conformations
Fitness Function
–
Force Field
Representation
Cartesian 3D coordinates is not a good choice

> representation by torsion angles
•
The frequency of each torsion angle in intervals of 10
°
was determined and the
ten most frequently occurring
intervals
are made available for substitution of individual
torsion angles by the
MUTATE
operator.
•
At the beginning of the run, individuals were initialized
with either a completely extended conformation where all
torsion angles are 180
°
or by a random selection from the
ten most frequently occurring intervals of each torsion
angle.
•
For the w torsion angle the constant value of 180
°
was
used because of the rigidity of the peptide bond between
the atoms Ci and Ni+1.
Search Space
Generally molecules with
n
atoms have
3
n

6
degrees
of freedom

>
100
residues * approximately
20
atoms per residue =
5994
degrees of freedom
Systems of equations with this number of variables
are analytically intractable today.
Discrete approximation:
(
5
torsion angles per residue *
5
likely values per torsion angle) =
25
100
Fitness Function

Potential Energy
=
+
+
+
+
+
+
+
+
.
Charmm energy func:
=
+
+
Simplified to:
bond length potential (set to const)
bond angle potential
(set to const)
torsion angle potential
improper torsion angle potential
(set to const)
van der Waals pair interactions
electrostatic potential
hydrogen bonds
(set to const)
interaction with the solvent
(set to const

> in vacum)
(since there are no interactions with the solvent, there is not enough
force to drive the protein to a compact folded state)
Simplified Energy Function
=
+
+ +
.
pseudo entropic term
Empirical relation between the number of residues and the diameter:
First Test
protein Crambin, 46 a.a.
Table 3. Steric Energies in the Last Generation
Table 2. R.m.s. Deviations to Native Crambin
Simple summation of different components has the disadvantage
that components with larger numbers would dominate the fitness
function whether or not they are important or of any significance
at all for a particular conformation.
In other words

>
bad fitness function
The genetic algorithm favoured
individuals with lowest total energy
which in this case was most easily
achieved by optimising electrostatic
contributions.
Improvements
•
Instead of using separate phi psi value distributions, apply phi

psi (2D)
clustering procedure.
•
Use secondary structure prediction algorithm (70% accuracy).
•
Specialised Genetic Operators
LOCAL TWIST (local conformation changes by performing the ring
closure algorithm for polymers)
The
LOCAL TWIST
operator led to significant
improvements in prediction
accuracy and also to a
substantial decrease in
overall computation time.
Improvements(2)
Fitness Function

> vector
r.m.s. only for verification
Vector Fitness Function
Candidate selection for the next generation:
•
If there is an individual that has better (i.e. lower) values in each fitness
component, then we take it. Continue until no unambiguously better
individuals are found.
•
Then remove the worst individuals, i.e. those with higher values in each
fitness component than any other individual.
•
The remaining set of individuals is heuristically reduced until the exact
number of individuals for the next generation is reached. This is done by
iteratively removing an individual with the worst fitness value in a randomly
selected fitness component.
Tests on other proteins (Local Twist and rms fitness) gave also
close to native conformations (less than 3.0 A)
Capability of Genetic Algorithm in General?
Conclusion:
applying an appropriate fitness function genetic
algorithm achieves the desired results.
polar
,
,
,
,
hydro
,
Crippen
and
solvent

> Rms 6.27
,
hydro
,
Crippen
,
solvent
decreased with rms
polar
,
,
mislead the algorithm to non

native conformation
I. Fitness vector
II. Fitness vector
Crippen
,
clash
,
hydro
and
scatter
+ constraints on the secondary structures

> Rms 4.36
Test case
–
Crambin 46 a.a.
trypsin inhibitor

>
6.65
Conclusions
•
Genetic algorithms proved to be an efficient search tool for 3

D representations
of proteins. For a 3

D protein model with a simple, additive force field as fitness
function and using a rather small population the genetic algorithm produced
several
individuals (i.e. protein conformations) of dissimilar topology but each
with highly optimized fitness values.
•
Given an appropriate fitness function the genetic algorithm application described
here finds the desired solution within only small deviations.
•
The major problem lies in the fitness function. If there were one or a set of
indicators that return 1for
the object is native protein conformation
and 0 for
the
object is not a native protein conformation
one could expect the genetic
algorithm approach to deliver reasonably accurate
ab initio
predictions.
However, neither mathematical models, empirical, semi

empirical or statistical
force fields are yet accurate enough to reliably discriminate native from non

native conformations without additional constraints. Thus, the genetic algorithm
produces (sub

)optimal conformations in a different sense than that of
nativeness.
Notice:
the same problem (fitness

scoring function) exists in the Protein
Docking problem. The correct transformation (within 3

5A) is found in
realistic time (almost in all cases). However, to assign a high score to the
native complex is a problematic task. We don’t know yet a proper scoring
function.
Side Chain Placement
rms 1.86
Comments 0
Log in to post a comment