Genetic Algorithms and Protein Folding

grandgoatΤεχνίτη Νοημοσύνη και Ρομποτική

23 Οκτ 2013 (πριν από 3 χρόνια και 5 μήνες)

49 εμφανίσεις

Genetic Algorithms and

Protein Folding

Based on lecture by Dr. Steffen Schulze
-
Kremer

http://www.techfak.uni
-
bielefeld.de/bcd/Curric/ProtEn/proten.html

Genetic Algorithm
:


is a heuristic method that operates on pieces of

information like nature does on genes in the course of

evolution.


Individuals are represented by a linear string of letters of an alphabet
(in nature nucleotides, in genetic algorithms bits)



Individuals are allowed to
mutate
,
crossover

and
reproduce
.



Fitness function
evaluates individuals.



Depending on the generation replacement mode a subset of parents and
offspring enters the next reproduction cycle.



After a number of iterations the population consists of individuals that
are well adapted in terms of the fitness function.



It cannot be proven that the individuals of a final generation contain an
optimal solution for the objective encoded in the fitness function.

I.
Initialise a population of individuals.


This can be done either randomly or with domain specific background
knowledge to start the search with promising seed individuals.
(Where
available the latter is always recommended. )



Individuals are represented as a string of bits.


A
fitness function

must be defined that takes as input an
individual and returns a number (or a vector) that can be used as a
measure for the quality (fitness) of that individual.



The application should be formulated in a way that the desired
solution to the problem coincides with the most successful individual
according to the fitness function.


II.
Evaluate all individuals

of the initial population.


III.

Generate new individuals.

The reproduction probability for an individual
is proportional to its relative fitness within the current generation.

Crossover

two point crossover


0101001111000011010101011110111


1010101101011100101110001010101

uniform crossover


0101001111000011010101011110111


1010101101011100101110001010101

Genetic Operators:


Mutation
. Substitute one or more bits of an individual randomly by a new

value (0 or 1).


Variation
. Change the bits in a way that the number encoded by them is

slightly incremented or decremented.


Crossover
. Exchange parts (single bits or strings of bits) of one individual

with the corresponding parts of another individual. Originally, only

one
-
point crossover was performed but theoretically one can

process up to L
-

1 different crossover sites (with L as the length

of the

individual).





IV. Select individuals for the new parent generation.


Schemes:


1) Complete offspring is selected while all parents are discarded


(original genetic algorithm). This is motivated by the biological


model and is called total generation replacement.


2) The
n

best individuals (from old and new generation)




This method is called
elitist generation replacement
.


V. Go back to step 2 until either a desired fitness value was reached or
until a predefined number of iterations was performed

Init the
first
generation

Evaluate

Apply Genetic
Operations

Select the next
generation

Representation Formalism


hybrid approach

-

genetic algorithm is configured to
operate on numbers, not bit strings as in the
original genetic algorithm.


Disadvantages:


the mathematical foundation of genetic
algorithms holds only for binary representations,
although some of the mathematical properties
are also valid for a floating point representation.


Binary representations run faster in many
applications.


An additional encoding/decoding process may be
required to map numbers onto bit strings.


Protein Structure Prediction

Individuals


-

Protein Conformations

Fitness Function



Force Field

Representation

Cartesian 3D coordinates is not a good choice

-
> representation by torsion angles


The frequency of each torsion angle in intervals of 10
°

was determined and the
ten most frequently occurring
intervals

are made available for substitution of individual
torsion angles by the
MUTATE

operator.



At the beginning of the run, individuals were initialized
with either a completely extended conformation where all
torsion angles are 180
°

or by a random selection from the
ten most frequently occurring intervals of each torsion
angle.



For the w torsion angle the constant value of 180
°

was
used because of the rigidity of the peptide bond between
the atoms Ci and Ni+1.

Search Space

Generally molecules with
n

atoms have
3
n

-

6
degrees
of freedom
-
>

100
residues * approximately
20
atoms per residue =
5994

degrees of freedom

Systems of equations with this number of variables
are analytically intractable today.


Discrete approximation:



(
5
torsion angles per residue *


5
likely values per torsion angle) =
25
100


Fitness Function
-


Potential Energy




=



+



+



+



+



+



+



+



+


.

Charmm energy func:




=



+



+



Simplified to:

bond length potential (set to const)

bond angle potential


(set to const)

torsion angle potential

improper torsion angle potential

(set to const)

van der Waals pair interactions

electrostatic potential

hydrogen bonds



(set to const)

interaction with the solvent
(set to const
-
> in vacum)

(since there are no interactions with the solvent, there is not enough
force to drive the protein to a compact folded state)

Simplified Energy Function




=



+



+ +


.


pseudo entropic term

Empirical relation between the number of residues and the diameter:

First Test

protein Crambin, 46 a.a.

Table 3. Steric Energies in the Last Generation





Table 2. R.m.s. Deviations to Native Crambin






Simple summation of different components has the disadvantage
that components with larger numbers would dominate the fitness
function whether or not they are important or of any significance
at all for a particular conformation.

In other words
-
>
bad fitness function

The genetic algorithm favoured
individuals with lowest total energy
which in this case was most easily
achieved by optimising electrostatic
contributions.


Improvements


Instead of using separate phi psi value distributions, apply phi
-
psi (2D)
clustering procedure.


Use secondary structure prediction algorithm (70% accuracy).


Specialised Genetic Operators

LOCAL TWIST (local conformation changes by performing the ring
closure algorithm for polymers)

The

LOCAL TWIST

operator led to significant
improvements in prediction
accuracy and also to a
substantial decrease in
overall computation time.

Improvements(2)

Fitness Function
-
> vector

r.m.s. only for verification

Vector Fitness Function

Candidate selection for the next generation:



If there is an individual that has better (i.e. lower) values in each fitness
component, then we take it. Continue until no unambiguously better
individuals are found.



Then remove the worst individuals, i.e. those with higher values in each
fitness component than any other individual.



The remaining set of individuals is heuristically reduced until the exact
number of individuals for the next generation is reached. This is done by
iteratively removing an individual with the worst fitness value in a randomly
selected fitness component.

Tests on other proteins (Local Twist and rms fitness) gave also
close to native conformations (less than 3.0 A)

Capability of Genetic Algorithm in General?

Conclusion:

applying an appropriate fitness function genetic
algorithm achieves the desired results.

polar
,


,


,


,
hydro
,
Crippen

and
solvent


-
> Rms 6.27




,
hydro
,
Crippen
,
solvent

decreased with rms

polar
,


,



mislead the algorithm to non
-
native conformation

I. Fitness vector

II. Fitness vector

Crippen
,
clash
,
hydro

and
scatter


+ constraints on the secondary structures


-
> Rms 4.36

Test case


Crambin 46 a.a.

trypsin inhibitor
-
>
6.65

Conclusions


Genetic algorithms proved to be an efficient search tool for 3
-
D representations
of proteins. For a 3
-
D protein model with a simple, additive force field as fitness
function and using a rather small population the genetic algorithm produced
several

individuals (i.e. protein conformations) of dissimilar topology but each
with highly optimized fitness values.



Given an appropriate fitness function the genetic algorithm application described
here finds the desired solution within only small deviations.



The major problem lies in the fitness function. If there were one or a set of
indicators that return 1for
the object is native protein conformation
and 0 for
the
object is not a native protein conformation

one could expect the genetic
algorithm approach to deliver reasonably accurate
ab initio

predictions.
However, neither mathematical models, empirical, semi
-
empirical or statistical
force fields are yet accurate enough to reliably discriminate native from non
-
native conformations without additional constraints. Thus, the genetic algorithm
produces (sub
-
)optimal conformations in a different sense than that of
nativeness.

Notice:

the same problem (fitness
-
scoring function) exists in the Protein
Docking problem. The correct transformation (within 3
-
5A) is found in
realistic time (almost in all cases). However, to assign a high score to the
native complex is a problematic task. We don’t know yet a proper scoring
function.

Side Chain Placement

rms 1.86