DOC Version

peaceshiveringAI and Robotics

Oct 24, 2013 (3 years and 8 months ago)

133 views

Contents

ABSTRACT
................................
................................
................................
................................
................................
...............

III

CHAPTER 1

GENERAL INTRODUCTION

................................
................................
................................
...................
1

1

I
NTRODUCTION

................................
................................
................................
................................
................................
1

2

L
ITERATURE
S
URVEY
................................
................................
................................
................................
.........................
1

3

R
ESEARCH
M
ETHODOLOGY

................................
................................
................................
................................
........

14

4

O
UTLINE OF
T
HESIS
................................
................................
................................
................................
.......................

14

CHAPTER 2

GENETIC ALGORITHM
................................
................................
................................
.........................

15

1

I
NTRODUCTION
................................
................................
................................
................................
...........................

15

2

D
ARWIN
'
S
T
HEORY OF
E
VOLUTION
-

N
ATURAL
S
ELECTION

................................
................................
.....................

15

2
-
1

Evolution
................................
................................
................................
................................
.........................

15

2
-
2

Natural selection
................................
................................
................................
................................
...........

17

3

P
HEN
OTYPE AND GENOTYPE I
N THE NATURE

................................
................................
................................
............

17

4

G
ENETIC
A
LGORITHMS

................................
................................
................................
................................
...............

19

5

G
ENETIC ALGORITHM ANA
LOGY

................................
................................
................................
................................
.

20

6

T
HE STRUCTURES OF GEN
ETIC ALGORITHM

................................
................................
................................
...............

20

7

G
ENETIC ALGORITHM STE
PS
:
................................
................................
................................
................................
......

21

8

E
LEMENTS OF
G
ENETIC
A
LGORITHM

................................
................................
................................
.........................

22

9

G
ENETIC
A
LGORITHM
O
PERATIONS
................................
................................
................................
...........................

22

9
-
1

Crossover

................................
................................
................................
................................
........................

22

9
-
2

Crossover rate
................................
................................
................................
................................
................

22

9
-
3

Types
of crossover
................................
................................
................................
................................
.........

22

9
-
4

Mutation

................................
................................
................................
................................
.........................

24

9
-
5

Mutation rate

................................
................................
................................
................................
.................

24

9
-
6

Types of Mutations
................................
................................
................................
................................
........

25

10

C
ONCLUSION

................................
................................
................................
................................
................................

25

CHAPTER 3

MOLECULES AND DRUGS

................................
................................
................................
...................

26

1

I
NTRODUCTION

................................
................................
................................
................................
.............................

26

2

M
OLECULE

................................
................................
................................
................................
................................
...

26

3

S
MALL MOLECULES

................................
................................
................................
................................
.......................

26

4

D
RUGS

................................
................................
................................
................................
................................
.........

27

5

Q
UANTITATIVE STRUC
TURE
-
ACTIVITY RELATIONSHI
P

................................
................................
................................
.......

27

6

Q
UALITY OF
QSAR

MODELS

................................
................................
................................
................................
..........

28

7

C
O
MFA

AND
C
O
MSIA

................................
................................
................................
................................
...............

28

8

C
ONCLUSI
ON

................................
................................
................................
................................
................................

29

CHAPTER 4

RESEARCH WORK

................................
................................
................................
................................
.

30

1

I
NTRODUCTION

................................
................................
................................
................................
.............................

30

2

M
OLECULAR SI
MILARITY
................................
................................
................................
................................
................

30

3

Q
UANTUM
M
OLECULAR
S
IMILARITY
M
EASURES

................................
................................
................................
............

30

4

T
HE
G
RID
P
OINTS

................................
................................
................................
................................
.........................

31

5

A
LIGNMEN
T ALGORITHM

................................
................................
................................
................................
...............

34

7

T
HE MECHANISM OF THE
PROGRAM
:
................................
................................
................................
..............................

37

8

P
ROGRESS OF
F
ITNESS
V
ALUE
................................
................................
................................
................................
........

41

9

R
ESULTS AND DISCUSSIO
N

................................
................................
................................
................................
.............

50

10

C
ONCLUTION

................................
................................
................................
................................
................................

55

CHAPTER 5

CONCLUSION
................................
................................
................................
................................
...........

56

CHAPTER 6


FUTURE RESEARCH (PHA
RMACOPHORE)

................................
................................
.................

57

REFERENCES

................................
................................
................................
................................
................................
........

58


List of Figures

Figure 1
-

Genotype and Phenotype (Michalewicz 2010)

................................
...........................

18

Figure 2
-

Mechanism of Genetic Algorithm (Michaewicz 2010)
................................
.................

21

Figure 3
-

Single Point Crossover (Michalewxciz 2010)
................................
..............................

23

Figure 4
-

N Points Crossover (Michalewicz 2010)

................................
................................
.....

23

Figure 5
-

Uniform Crossover (Michalewicz 2010)

................................
................................
.....

24

Figure 6
-

Mutation (Michalewicz 2010)

................................
................................
....................

25

Figure 7
-

Mutation Factor 2m (Michalewicz 2010)
................................
................................
....

25

Figure 8
-

One Molecule in a Grid (Lock 2007)

................................
................................
...........

32

Figure 9
-

Distance between the Atoms of the Molecule and One Point on the Grid (Lock 2007)

32

Figure 10
-

Represent a Molecule in a List (Lock 2007)

................................
..............................

33

Figure 11
-

Fitness Function
................................
................................
................................
.......

36

Figure 12
-

Taking Points Values from a Grid into a List

................................
.............................

36

Figure 13
-

The Steps of our Software Algorithm
................................
................................
........

40

Figure 14
-

Progress of Fitness Function

................................
................................
....................

48

Figure 15
-

Two Molecules Aligned by
Hand

................................
................................
..............

50

Figure 16
-

Two Molecules Aligned by the Software

................................
................................
...

51

Figure 17
-

Progress of Align Two Molecules (Step 1)

................................
................................

52

Figure 18
-

Progress of Align Two Molecules (Step 2)

................................
................................

52

Figure 19
-

Progress of Align Two Molecules (Step 3)

................................
................................

53

Figure 20
-

Progress of Align Two Molecules (Step 4)

................................
................................

53

Figure 21
-

Progress of Align Two Molecules (Step 5
)

................................
................................

54

Figure 22
-

Progress of Align Two Molecules (Step 6)

................................
................................

54

Figure 23
-

Progress of Align Two Molecules (Step 7)

................................
................................

55



List

of Table
s

Table 1

................................
................................
................................
................................
......

49




Abstract

One of the most common modern heuristic methods to solve computational

problems is
genetic algorithm. Wh
en we look at genetic algorithm

we see that Darwinian evolution’s
characteristics have been mimicked. In fact, it has achieved many successes in various
fields of life’s applications.

This
research used genetic algorithm t
o properly align
molecules to be similar to a t
arget molecule. In particular,

genetic

algorithm

has been
used

as a mechanism to improve the ability for aligning some molecules in the space
and comparing them with the best position of known structure to fin
d the optimal
solution which is optimal alignmen
t. The optimal alignment is

a prepared data and an
input for the subseq
uent application, f
or example, Comparative Molecular Similarity
Indices Analysis (COMSIA) which is a 3D method to predict and correlate m
ol
ecule’s

biological activity. The research

discussed
how

transformation (translation and
rotation)

has been
perform
ed

on each m
olecule of the database, I

use
d

transform
ation
matric
es an
d it is

very

useful to do transla
tion and rotation, where I

consider
ed

the
coordinate of each atom of the molecule and its rotation angles to represent each
chromosome. To

find the best transformation I

have to use the chromosome mechanism
and perform some operation on it to obtain the diversity in random way. In additio
n,
i
t
mentioned

and
summarized

some relate
d projects which are near to my

work such as
genetic algorithm in molecular recognition and design, protein structure alignment
using a genetic algorithm and genetic algorithm for protein threading.
The research
questi
on is
“how well does genetic algorithm optimisation perform the alignment of
similar molecules”.

Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

1

Chapter 1

General Introduction

1

Introduction

Good results have been obtained with genetic algorithm
which

has been developed for
calculating the similarity between the x
-
ray powers of molecules, one of the molecules is
rigid. Genetic algorithm has mimicked Darwin's Theory of Evolution and natural
selection which evolution presumes the development of life is a

slow gradual process
began from non
-
life or simple life (simple solution in genetic algorithms) and stresses a
purely (optimal solution). In others words, the complex creatures evolve from more
simplistic ancestors naturally over
time. Problems

which have

no compatible structure
to the genetic algorithms will be very difficult to solve. However,

the structure of
molecules is very clear and it’
s also feasible to be optimized by genetic

algorithm
.
Similarity measurements

based on the molecular X
-
ray powers
have been used to
quantify the degree of resemblance between pairs of rigid three
-
dimensional
molecules.
This thesis discussed

the effect of including molecular flexibility on the similarities that
are calculated using such measurements in search of large three dimensional databases.
It is achievable to predict the molecules biological activities by knowing how similar
they are i
n t
heir shape. The research focused

on getting the molecule and aligning it by
rotation and translation to
a
target one by u
sing genetic algorithm steps. I

use
d

the grid
points as a way to represent the molecule for the computer and that by constraint x
ra
y
power on the molecule and I

have measured the distance between each atom belong the
molecule and each point on the grid by using Pythagoras method. The tool to find the
difference between two molecules is Euclidean distance.

2

Literature Survey

There ar
e some researches similar to my

work and it is very useful to mention how
authors worked with and gave their ideas about dealing with molecules and genetic
algorithm.

According to Thorner et al(1996), molecular electrostatic potential (MEP) is the method
w
hich has been used to measure the similarity between pairs of rigid three
-
dimensional
(3D) molecules. They mentioned that better results have been obtained with genetic
algorithm (GA) which has been developed for calculating the resemblance between the
MEP
s of tow molecules. The authors stated that the development of a range of
Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

2

sophisticated systems for 3
-
D substructure searching has been led by the development
of effective and efficient programs for generating three
-
dimensional (3
-
D) structures
from two d
imensional (2
-
D).

The molecule electrostatic potential around a molecule has been represented by 3
-
D
grid where the
ijk
th element is the real
-
number value of MEP at this location (
i, j, k
).
There are two stages to obtain the similarity between the target
structure and database
structure: align the corresponding grids to maximize the degree of overlapping, and
then use a measurement such as cosine coefficient to calculate the similarity
corresponding to this alignment. In fact, they did not use just the gen
etic algorithms as a
mechanism to obtain the similarity but they also have used the graph
-
theoretic
algorithm to match a target structure against each of the structures in a database and by
applying the graph
-
generation procedure to all of the constituent
structures. Therefore,
the similarity search is affected by comparing the field
-
graph representing the target
structure with the field
-
graph of each of the molecules in the data base. The mean which
has been used to do the comparison is maximal common sub
-
graph (MCS) which
identifies the largest sub
-
graph common to the pair of field
-
graphs. The MCS resulting
from this mechanism specifies an alignment of the corresponding MEPs and this
alignment enables the calculation of the intermolecular similarity which
Gaussian
approximation procedure has been used to do it. For applying genetic algorithm, the
chromosome here is encoded as a set of translations and rotations and applied to the 3
-
D coordinates of one molecule to align its MEP with the MEP of another fixed

molecule
in the space. The similarity value resulting from Gaussian similarity calculation is
considered as fitness functions for GA which identifies the alignment by maximizing the
value of this fitness. They mentioned that most organic molecules contain

one or more
rotatable bonds; therefore, allowing the molecule to exist in many different
conformations and that are so useful for MEP
-
based similarity searching. The genetic
algorithm is designed to classify a set of geometric transformations (rotations,
translations and torsional rotations) to obtain the maximal overlap of a database
structure’s MEP with that of the target structure. The chromosome which represent the
transformation contains one
-
byte components plus and extra one
-
byte component for
each r
otatable bond in the database structure, a single byte encodes 256 possible
rotations. To save time from being wasted to bring the two molecules into the same
general area of 3
-
D space, they initiate the algorithm by pulling the database structure
Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

3

and targ
e
t structure at the origin point
(0, 0, 0). They have used the crossover and
mutation as a genetic operator; they have tested one
-
point crossover, two
-
point
crossover and uniform crossover. The best results have been found from using the two
-
point crossove
r. They used mutation operator by checking each individual bit of the
chromosome in turn
and then

flipping it
(changing

it from zero to one and vice versa).
The mechanism to choose between crossover and mutation is generating a number in
the range 0
-
100, i
f the number is less than the crossover
rate then the crossover
is
performed, otherwise mutation.

There are some problems with using the field graph approach. First, the experiments
have reported that this algorithm is not very robust. Secondly, the genera
tion of each
graph needs as input a single, fixed MEP, and this generation mechanism would
therefore have to be repeated many times to create a database for flexible searching
(with consequent storage and processing costs). Therefore, they prefer to use ge
netic
algorithms over field graph approach, especially that genetic algorithm has been shown
previously to be well suited to the processing of flexible molecules.


According to Willet (2006), one of the simplest virtual screening tools is similarity
search
ing using 2D fingerprints and it is widely used in the early stages of lead
-

discovery programmes. In this paper the author has summarized the result of studies
that sought to increase the effectiveness of current system for similarity
-

based virtual
scree
ning. He found out that if there is no specific information about the sizes of the
molecules required for testing, is the coefficient of choice for computing molecular
similarities.

Willet states that there are two main types of virtual screening systems:
first, the
popular structure
-
based approach, for example, docking de novo design, which can be
used when the 3D structure of the biological target is available. The second is the ligand
-
based approaches which are applicable in the absence of such structura
l information.
For instance, pharmacophore methods, which involve the identification of the
pharmacophoric pattern common to a set of known actives and the use of pattern in a
subsequence 3D substructure search, the similar method which the author focuses
on,
and machine learning methods, in which classification rule is developed from a training
-
set containing known active and known inactive molecules.

Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

4

The basic idea underlying similarity
-
based virtual screening is molecules that are
structurally similar a
re likely to have similar properties. Therefore, the strategy of
virtual screening involves computing the similarity between each of the molecules in a
database and the known reference structure, ranking the database molecules in
decreasing order of the co
mputed similarities and then carrying out real screening on
just the top
-
ranked database molecules.

He mentioned that the measurement which is used to quantify the degree of
resemblance between the reference structure and each of the structure in the datab
ase
is the heart of any system for similarity
-
based virtual screening. Therefore, a similarity
measure involves three components: a method to represent the molecule in a way to be
compared with others (which 2D fingerprint is the structural representation
the author
has focus on), the weighting scheme that is used to assign differing degrees of
importance to the various components of these representations and a function to find
the degree of resemblance between two structural representation.

The similarity

coefficient which has been used for comparing fingerprint is the
Tanimoto coefficient. It suggests that two molecules have
a

and
b

bits set in their
fragment bit
-
strings, with
c

of these bits being set in both of the fingerprints; therefore,
the Tanimoto c
oefficient is defined to be:
c / (a + b
-

c)
.


According to Yadgary, Amir and Unger (1998) using the amino acid sequence to
compute the three
-
dimensional structure of a protein is a way to obtain the physical and
chemical properties of the protein molecul
e and that is because of the chemical and
physical properties of a protein molecule depend on its three dimensional structure,
where the structure of proteins is the key to gain insight into their function. Today, it is
common to discover the structure of
the protein by X
-
ray crystallography and NMR
spectroscopy. Calculation the structure of the protein directly from its sequence is not
possible since it requires minimization of a function of thousands of variables, with
constants that have not be accuratel
y determined. Instead of that they have mentioned
another approach which is threading. Threading recognizes a known structure which
the sequence might be compatible to predict the three dimensional fold of a protein
sequence. In this approach, the way to t
hread a given sequence by a given target
structure through searching for alignment of sequence structure which puts sequence
Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

5

residues in preferred structural places. Here the authors have suggested using genetic
algorithms to obtain optimal sequence struct
ural alignment. It is a method to predict the
protein structure and that by threading the sequence of one protein through the known
structure on another. In the absence of detectable sequence similarity, this method has
proved
it’s

self in recognizing simi
larity of a sequence to a protein of known structure.
To design a threading procedure, it needs an algorithm to align the residues of the
sequence with a structure and fitness function to evaluate the quality of the alignment.
Knowledge based potentials an
d energy functions are obtained from a database of
known protein structures and these are depended on the analysis of known three
-
dimensional structures of proteins using statistical physics. According to the authors,
the first step for using genetic algor
ithm is to represent the solutions as strings and
these strings are maintained as a population which allowed interacting. The interaction
is obtained via genetic operators such as: Mutation, crossover and Replication. They
used the alphabet of {0
, 1
} to re
present the individual in the population. A residue which
is from the sequence aligned in the structure has represented by “1” in the string of the
population, number “0” represented no residue. Number N that is greater than number
“1” represented the numb
er of residues which are not aligned in the structure position,
and N
-
1 represent skipped residue. After using some operators such as: crossover,
mutation and replication, the threaded sequence length has to be equal to the total sum
of the numbers of eac
h string. The length of the structure has to be equal to the length of
the string. The string of lower normalized energy value has more chance to participate
for the next generation because it has higher fitness value. The string which have
higher chance
to participate in genetic operators should have the higher fitness value.
They performed mutations by increasing randomly the value of a number and offsetting
it by decreasing the same amount in other positions. Crossovers have been performed
by choosing r
andomly and building two new offspring by concatenation of the suffix of
one, up to the chosen position, to the prefix of the other one. One of the genetic
algorithm problems the authors has met is early convergence of the population to one
high fitness in
dividual which is common in using genetic algorithms and it makes the
genetic process meaningless, f it continues. Therefore, it will be not useful to continue in
generating new population because it will be the same population. The common
solution to this

problem is to maintain high diversity in the population by using high
rate of mutation temporarily for number of generation then decreases it again, or
Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

6

prevent and create solutions which appear frequently. In this proposal to avoid early
convergence, the
authors have used the tree techniques. One of the good ideas to
prevent redundant solutions is using the tree data structure to make string
comparisons. As a conclusion, they have found that it is better to use higher rate of
mutations to achieve good resu
lts, but not to use too high rate of mutations as it does
not provide enough stability in the population to promote good solutions. Moreover,
even though
rigid limitations are the reason
sfor

failure in finding goo
d alignments,
the
method of genetic algorit
hm threader has representation that has designed to enable
full freedom in choosing positions for insertion and deletions.


In
Willet’s

research
the author
discuss
ed

the docking of flexible ligands into protein
active sites, in this research Willet (ND) e
ncoded the conformation of the molecule by a
real or integer valued chromosome, the i
-
th rotatable bond’s torsion angle has been
represented at the i
-
th element of the chromosome. The fitness function here is the
energy for the specified conformation which

it has been calculated by one of the several
standard molecular
-
modelling packages. It identifies the number of torsion angles
which aim to minimize the calculated energy. In this research the author mentioned the
study which chose 72 molecules with diffe
rent structures chosen from the Cambridge
structural database, where each structure consists of number between one and twelve
rotatable bonds. The number of individuals in the population was ten times the number
of torsion angle in the molecule. He used si
x bits to represent each torsion angle. A key
role in determining the physical and biological properties of the molecule is the low
-
energy conformations and there is much interest in ascertaining the stable
conformations that flexible molecules can adopt.
Each individual consist of four strings,
tow for mapping and tow for rotatable bonds torsion angles (one in a ligand and one in
protein active site) . He has used a routine which is used to determine the hydrogen
bonding energy, the input for genetic algor
ithm here are the size and locati
on of the
ligand that is docked

into receptor site, also the size and location of the site receptor as
well. The protein and lignad conformations are the output here and they have to be
associated with fittest individual in

the last population. The author found out that
systematic search is the most common approach for conformational analysis which each
torsion angle is rotated systematically by some fixed increment, but the problem with
Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

7

this approach is the sheer number of
conformations that need to be examined. For
instance, a systematic search with a 30° torsion increment for molecule containing
twelve rotatable bonds would require about

9 * 10
12


energy calculations. Thus, this
approach is achievable only if there are very few rotatable bonds in a molecule. Willet
performed genetic algorithm by using a population of randomly generated
chromosomes as the input, and was run for a maximum of 10000 en
ergy evaluations.
The improvement has been noticed after about 5000 evaluations. He has used another
approach which is SYBYL routine and he has found that this approach was faster than
genetic algorithm for molecules containing small numbers of rotational
bonds, but the
genetic algorithm was faster for molecules containing more than 7 or 8 bonds, and the
difference increased as the number of rotatable bonds increased. Therefore, genetic
algorithm provide and effective way of exploring the conformational spa
ce of flexible
molecules; also, he work at sufficient speeds to allow the conformational analysis of
highly flexible molecules that are too time consuming to investigate using substitute
conformational
-
searching algorithm. He mentioned that rational appro
aches to drug
design to know the molecule that is complementary to the site receptor; they make use
of NMR and X
-
ray information about the binding
-
site geometry of a protein. These
approaches assumed that the ligand molecules are completely rigid and that
molecule’s
suitability
as a ligand depends
on its steric
complementarily

with the site. They did not
take
into
account of the ability of the ligand to displace water and form hydrogen
-
bonds
with the active site. The genetic algorithm seeks to overcome thes
e two limitations.


According to Wild and Willett 1995, using molecular electrostatic potentials is very
good idea to calculate the intermolecular similarity in database of three
-
dimensional
chemical structure, where they used the electron densities to me
asure the similarity
between tow molecules. They have used the equation so
-
called Carbo index. It depends
on cosine coefficient, which is a good tool to depend on when using genetic algorithm
approach. For example, an initial lead in drug


or pesticide
-

d
iscovery program,
similarity searching involves matching some target molecule of interest against all of
the molecules in database to find the those molecules that are most similar to the target.
The authors mentioned they believe that genetic algorithms p
rovide both an effective
and an efficient mechanism for the investigation of a range of complex chemical
Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

8

matching problems, such as the generation of near maximal common sub graphs from
pairs of large tow dimensional structures, docking flexible ligands in
to protein active
sites and flexible three dimensional substructures. They have used the genetic
algorithms
as follows
:

The genetic algorithm search a mechanism to identify a combination of translation and
rotations which will align one molecule to anothe
r one, where every chromosome has
five components, three for translation and two for rotation. For rotation they use two
planes, each one has eight
-
bit binary number and this allow 256 possible of rotations.
For translation they use binary number as well b
ut with the maximum permitted range.
They initialize the chromosomes randomly and then decoded by applying the indicated
translation and rotation to the three dimension coordinates inside the molecule which
has been aligned. They used the fitness function
that depends on Gaussian similarity
calculation, where the resulting coordinate will be passed to this function to be
evaluated. They found that the best result obtained when they have used the uniform
crossover to get the diversity in the population and a

crossover rate of 20 % was found
to give the best result for this problem. They achieved the diversity in the initial
population by ensuring that all of the individuals had a large Hamming distance
between them, where the Hamming distance between two bina
ry individuals is the
number of corresponding bits that differ between two strings. This technique was found
to prevent early convergence. Each iteration discard non fittest individual with fittest
individual. They have used crossover and mutation to intro
duce new generation; they
have used single crossover, two crossovers and uniform crossover which is the best one
they have found. Also they have used a simple bit
-
flip mutation with some probability
(1/i). In this research they have also used roulette
-
whee
l selection to select fittest
individual.

They did not use g
ray coding which is a way of representing binary strings. In fact, to
increment or decrement the number always requires a change of only one bit. For
example in the standard binary representation

the number 3 is 011 and the number 4 is
100, for the random mutation to go from 3 to 4 it is necessary for 3 bits to be flipped. In
Grey Code, 3 are represented with 010 and 4 with 110. So to change from 3 to 4
requires only the first bit to be flipped.


Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

9


We can

change between binary and grey code. Given a binary number


b1 = (b1, b2m…., b5m)


W
e can change it to a grey
-
coded reflection


g = (g1; g2……., gm)


Changing from binary

to g
ray will affects the distance between different solutions and
the fitnes
s landscape performed by the operators such as mutation.

During the experiments in this paper, they demonstrate that the genetic algorithm leads
to similarities that are comparable in effectiveness for database search to those
resulting from the use of ap
proach passed on field graphs and superior to those
resulting from the use of bit
-
climber. Moreover, genetic algorithm leads to more robust
alignments than does a simplex optimization procedure. The authors found some
weaknesses in the field graph approac
h which is far more complex and time consuming
owing to the need to generate the graphs from the electrostatic potential grid before the
search can be carried out. Also, for some molecules the field
-
graph does not contain
sufficient nodes to enable those m
olecules to be aligned with a target molecule.

They have used four strings to represent the chromosome. Firstly, two strings use
binary representation and two strings use integer representation. The first binary string
is to represent the ligand and the la
tter for the protein, where the angle of rotatable
bond in the rotation occupies one byte from the string. Secondly, hydrogen bond
between the protein active site and the ligand has the possibility and this the possibility
with mapping is encoded in the in
teger strings. For example, the first integer string
encodes mapping from hydrogen atoms of the protein to one pair of the ligand and the
second string has the inverse mechanism.


Payne and Glen (1993) have used genetic algorithm as away to optimize the fi
t of
flexible molecules to a set of restrictions. The restrictions may be shape similarity,
charge distribution or intermolecular distance constraints. The problem, when using X
-
ray crystallographic analysis to know the structure of an active site, is how
to dock the
ligand to this active site. Molecular modelling techniques are a way used to compare
dissimilar molecules to generate conformations. In addition, there are some numerical
methods such as: atom charge distribution, comparison of electron densiti
es, dipole
Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

10

moments, volume overlaps, and electrostatic and lipophilicity potentials. The algorithm
here receives the current molecule and converts it from phenotype into genotype.
Particularly, it take the coordinate of all atoms which belong to the molecu
le and
convert them to the string of bits and this string will represent one individual the first
population. Then it will apply the operators of genetic algorithms which are crossover,
selection and mutation to obtain new generation.

There are many steps

have to be followed to do the algorithm:

Firstly, they have to find a good way to represent the problem. Secondly, they have to
use distance method to do the comparison. Thirdly, when they use X
-
ray
crystallography analysis to know the structure of an ac
tive site, it is important to know
how to dock the ligand into the active site of the protein. Finally, they have to define a
set of restrictions to compare and fit molecule with it.

The authors have used the binary strings to represent each individual. T
hey have
broke
n

down the string into four segments. The first segment represents the translation
of the molecule along the three axes x, y and z. the second segment represents the
rotation of the whole molecule around the
all
axes, the third segment represents
rotations around each rotatable stem (or bond). The fourth represents the conformation
of rings.


The methodology of
Richmond’s

research is to find alignment algorithm to superimpose
atoms in one molecule onto another si
milar atoms which belong to different molecule,
to do so, Richmond et al (2004) followed many steps:



Step 1
-

identify a set of equivalent candidate atoms which are belong to different
molecules and similar in term of local geometry.



Step 2
-

filter the se
t of equivalent candidate atoms by cancel and discard the
pairs which cannot be overlaid with any alignment transformation.



Step 3
-

calculate the alignment transformation which place over the molecules to
overlie the pairs in the filtered set of atoms equ
ivalence.

Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

11



Step 4
-

repeat the alignment and calculate a new set of atoms equivalences,
compare the atoms to identify the distance between them in case it’s less than a
user defined threshold and depending on it for the next alignment.

The procedure to match tow 2D shapes:

Firstly, identify the correspond points which belong to shape A and shape B. Secondly,
the morphing transforma
tion has to be
calculated.

Map

the points on first shape to their
correspon
ding points on the second shape
.

Finally, determine the similarity between
two shapes by calculate the sum of the matching errors of corresponding points which
belong to both of them.

Each shape has been represented by a discrete set of points sampled from external or
internal contours o
n the shape with using an edge detector. In fact, the more numbers of
points, the more accurate the description of the shape.

Over recent years the folding problem became one of the most challenging problems in
the computational chemistry world, specially
the mechanism of folding. Genetic
algorithm became so common to search in the space of this field. Each possible solution
is represented by an encoded individual or string to change it from phenotype into
genotype. For instance, to represent the conformati
on of a molecule, they construct an
individual which contains of a number of real numbers where each real number
represent angle of rotation around a flexible bond in the molecule. Here the method of
genetic algorithms begins with the population which is t
he number of individuals that
have been created randomly. During the performing of the algorithm, the authors use
fitness function that evaluates each individual to see whether it has high fitness or low
fitness to decide whether it will participate in the

next generation or not. The
individuals in the population which have high fitness will participate in the next
generation and the individual with low fitness will not participate in the next one (Jones
ND).


Daeyaert et al (2005) site how to use genetic a
lgorithms to find the similarities between
two molecules in the space. They have mentioned that the requirement to do structure
based drug design methodologies is to find a proper alignment for molecules. The
methodologies which they have mentioned are Com
parative Molecular Field Analysis
Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

12

(COMFA) and Comparative Molecular Similarity Indices Analysis (COMSIA). The authors
have used multi
-

objective function optimization to combine flexible source molecules
onto rigid target molecules, they depend on two thin
gs: the similarity score between the
source and target molecules and conformational strain of the source molecule, the first
has to be maximized and the latter has to be minimized. The aim of this function is to
optimize the smaller square distance between

the target and source molecules. To rank
the final individuals, they have used fast non
-
dominated sorting algorithm. They have
used the elitism to ensure survive of solutions which have high fitness and many
operators to provide the diversity for the popu
lation. Each individual or vector in this
search represents the alleles by real numbers: the first three positions represent a
translation in the x, y and z axis, of the source molecule, from 4 to 6 represent the Euler
angles deciding the direction of the
source molecule, and the rest of the individual
represent the values of the torsion angel of each rotatable stem (bond) in the source
molecule. They mentioned that before beginning the genetic algorithm, the coordinates
of the target molecules have to be c
entred.


According to Xu et al (2003) mentioned that the physical properties and biological
behaviour of a molecule usually depend on its accessible and low energy conformations;
therefore, fast and reliable computational methods for producing conformation

are
extremely valuable.
They

have used algorithm which produces molecular conformations
that are compatible with a set of geometric restrictions. These restrictions include
inter
ing

atomic distance bounds which derived from the molecular connectivity tabl
e. In
this work they have used Merck Molecular Force Field to calculate the potential
energies. They have mentioned that the main advantage of this work is to obtain more
diversity of the conformations. The authors focused on several enhancements to
genera
te better initial geometries and to detect and eliminate conformations which are
likely lead to the same local minima as well as on the use of this technique for protein
structure prediction, pharmacophore modelling and ligand docking.

Many problems have b
een solved successfully by using the distance geometry such as:
NMR structure determination, conformational analysis, ligand docking and protein
structure prediction. In this work the volume and distance constraints have reduced the
number of accessible co
nformations to molecule and search space. The general distance
Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

13

geometry method is a self
-
organizing algorithm works as a fitness function, which tries
to minimize an error function that measures the breach against geometric restrictions.


According to (Nic
holas E. Jewell 2001), using the 3D QSAR methods is essential to the
design of bioactive molecules, such as COMPASS, COMSIA, COMFA and HASL. It is very
important for 3D QSAR methods are to obtain alignment for the molecules in dataset as
an input for the c
alculation of the structural variables. Also, they have stated a method
to find the optimal way which obtains the convergence between two molecules. They
described the main features of FBSS

(for

field based similarity searching) and also
reported a simple
validation experiment that supplies the use of FBSS
-
based alignments
in 3D QSAR analyses. They used FBSS as the prerequisite to 3D QSAR procedure, and
compare the results with those obtained from conventional manual alignments. Their
work was to provide an

approach which is complementary to and not replacement for
the manual alignment. This program is essential to implement 3D QSAR specially
COMSIA and COMFA methods.


For calculating inter
-
molecular structural similarity, many different measures have
been
described by the authors. Carbo et al(1980) describe one approach which involves
the use of molecular field descriptors, and this approach has been developed by Good et
al (1992). This approach is to put the molecule at the centre of a 3D grid and calculat
ing
the value of molecular field, for instance, the electrostatic potential of the molecule at
each point of the 3D grid. To find the degree of similarity and the difference between
two molecules, they aligned the corresponding grid to find the best possib
le fitness, and
they use one of the distance methods to do that.

FFBS is software which used genetic algorithms to align molecules’ fields depending on
field based similarity measures for similarity searching in chemical structure database.
For each indiv
idual or chromosome the FFBS’s genetic algorithms encodes the
translations and rotations which applied to a structure to align it with a target one,
where the value of the similarity coefficient which obtained from the encoded alignment
will be the fitness

function.


Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

14

3

Research Methodology

In this research I

developed a new algorithm to find the optimal alignment for a

group
of molecules. In fact, I

used the Java language to write the program that performs this
a
lgorithm. This alignment
considered

as an input for a method which called
Comparative Molecular Similarity Indices Analysis (COMSIA) which is a 3D method to
forecast and correlate mol
ecule’s biological activates. I

obtained the optimal alignment

by using the genetic algorithm

mechanism to d
o transformation (translation + rotation)
for each molecule
. For each transformation
I

compared

the new figure with other
databases to know if it is a good so
lution or not. For comparison I

need
ed

to use good
method to find the distance between the sample
and the tar
get one; therefore, I tried

to
find the best distance function to do this comparison. This research will be useful to deal
with any shape in the future not only the molecules.



4

Outline of Thesis

Chapter two: “Principle of Genetic Algorithm and how John Holland mimicked Darwin
theory (Evolution and Natural Selection) to invent Genetic Algorithm.

Genotype and
Phenotype”.

Chapter Three: “
Molecules and Drugs and how to use QSAR and its methods
(
Compa
rative Molecular Field Analysis
CoMFA and

Comparative Molecular Similarity
Indices Analysis
CoMSIA)”.

Chapter Four: “Description of Implementing Genetic Algorithm to align some molecules
and how to use the fitness function to obtain the optimal alignment”.

Chapter Five: “The Deduction and Future Work (Pharmacophore)”.






Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

15

Chapter
2

Genetic

Algorithm

1

Introduction

In this chapter I

talk
ed

about Darwin’s theory which involves the mechanism of
evolution and natural selection

(Darwin 1859).
Then

I

mention
ed

how John Holland
used this theory to invent the idea of genetic algorithm. Phenotype and genotype are the
hardware an
d software of organisms. I
state
d

the principles, steps, elements, operations
and process of genetic algorithm and how to use them to provi
de the diversity to offer
more solutions for the problem. Operations of genetic algorithms are crossover (single
point, double points, and uniform crossover) and mutation (mutation factor 1m and
mutation factor 2m).

2

Darwin's Theory of Evolution
-

Natura
l Selection

There are two important things in Darwin’s theory which have been mimicked in
genetic algorithms:
the mechanism of evolution
and

natural selection
. I

talk
ed

about
both of them below:

2
-
1

Evolution

Darwin's Theory of Evolution presumes that the development of life is a slow gradual
process that began from non
-
life or simple life (simple solution in genetic algorithms)
and stresses a purely (optimal solution). In others words, the complex creatures ev
olve
from more simplistic ancestors naturally over time.

The process began in the sea three million years ago, where complex chemical
molecules started to clump together to form microscopic blobs (cells). These cells were
the seeds of the tree of life. Th
ey had the ability to split and replicate themselves as
bacteria do and during the time they have been diversified into different groups. Some
of these groups remained connected together and formed chain shapes which are called
alga. Others collapsed upon
themselves and formed hollow balls creating a body with an
internal cavity, these we call multi
-

celled organisms and sponges are their direct
descendants. The tree of life became more complicated and diverse during the time as
more variation appeared. Som
e of these organisms had the ability to move and
developed a mouth that opened into a gut. Meanwhile, other organism had rod inside
their bodies which made them stronger, then sense organs developed around their front
Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

16

end. Some groups had bodies which were

divided into segments provided by little
projections on either side which helped them move in the sea floor and they then got
hard and protective skins that gave their bodies some rigidity. These creatures filled the
sea with lives. Roughly, before 450 mi
llion year, some of these armoured creatures got
out of the water into the land and here the tree of life brunched into multitude of
different species that exploited this new environment in all kinds of ways. Some of these
groups developed elongated flap o
n their backs and over many generations these things
developed eventually into wings, now we call these insects. Life began in the air and
diversified into many forms. At the same time, some organisms in the sea have been
faced with change by the stiffenin
g rod in their bodies which became bond and a skull
developed in front of it with hinged jaw that could grab and hold onto its prey. These
creatures grew bigger and got the ability to swim with power and speed, because they
developed fins equipped with mus
cles. We call these creatures fish now and they are
dominated the waters of the world. Some of these creatures got the ability to gulp the
air from the water surface and their fleshy fins became weight
-
supporting legs. 375
years ago, a few of these backbon
ed creatures followed the insects onto the land, they
had wet skin and they had to return to water to lay their eggs. These types we call them
amphibians, some of them evolved dry, scaly skins which they broke their link with
water by laying eggs with wate
rtight shells.

These creatures, the reptiles, were the ancestors of today's tortoises, lizards and
crocodiles, snakes. 65 million years ago these creatures grew bigger and formed the
dinosaurs’ animals which dominated the land, but a great disaster happene
d and killed
all of them except one branch which their scales had developed into features and we call
these birds now. At the same time, some insignificant group of survivors began to
increase in numbers on the ground beneath and they are different from th
eir
competitors in that their bodies were warm and insulated with coats of fur. Now, we
have the first mammals. They had a good chance of surviving and deploying without
existing for other creatures and they were lucky to have warm and insulated bodies
ena
bling them to be active at all places, from the tropics to the Arctic, on land as well as
in water, on grassy plains and up in the trees at all times, at night as well as during the
day

(

Information from DVD about Charles Darwin and t
he Tree of Life
,

prod
uced by
sach
aMirzoeff, released
2009 )
.

Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

17


2
-
2

Natural selection


Natural selection acts to keep and accumulate minor advantageous genetic mutations.
Suppose a member of a species developed a trait. For example, it grew wings and
learned to fly. Its offspri
ng would inherit that feature and pass it on to their offspring.
The inferior (traits) members of the same species would gradually die out, leaving only
the superior (traits) members of the species. Natural selection is the preservation of
features that en
ables a species to compete better in the wild. It is also similar to
domestic breeding. Over the centuries, human breeders have produced dramatic
changes in domestic animal populations by selecting individuals to breed. Breeders
eliminate undesirable featu
res gradually over time. Similarly, natural selection
eliminates inferior species gradually over time. For more explanation we are going to
give one example here. In the wild we have a population of rabbits, some of them smart
and some
of them dumb, some o
f them fast,
some of them slow. The slower and dumber
rabbits are more likely to be eaten by foxes. However, the smart and fast ones have
more chance to survive and do breeding to get new generation of rabbits. Of course,
some of the slower and dumber rabb
its will survive, may be because they are lucky but
there population will be less than the smart and fast ones. Generation by generation we
will find that the smart and fast rabbit are much more than others type in the wild and
that is because there are mo
re parents from their type and this are what we call the
natural selection which the foxes are a part of (Michalewicz 1999, p.14).

3

Phenotype and genotype in the nature

Phenotypes refer to the physical parts of a living organism such as the sum of atoms,
molecules, macromolecules, cells, structures,
metabolism, energy utilization, tissues,
organs, reflexes and behaviours. They include anything that is part of the observab
le
structure, function or behaviour of a living organism. The Phenotype of an organism
refers to the physical expression of an organism’s genotype.

Genotype
is the "internally coded, inheritable information" carried by almost all cells of
all living organ
isms. It is used as a “blueprint” or set of instructions for building and
maintaining a living organism. This information is written in a coded language (the
genetic code) and is encoded in the genes of an organism. These genes are connected
Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

18

together into
long strings called chromosomes. The genes and their settings are referred
to as the organism’s genotype. Each gene and its settings represent a specific trait of an
organism, like eye colour or hair colour. For example, a hair colour gene and its settings

determines with hair is blonde, black or auburn. Occasionally a mutation can occur in a
gene which can result in a completely new trait expressed in an organism. This is rare as
a mutated gene doesn’t normally affect the development of the phenotype of an

organism

Genetic information is copied at the time of cell division or reproduction. This copied
information is passed from generation to generation and for this reason is said to be
“inheritable”. When two organisms mate to reproduce the resulting offs
pring will get a
share of each organism’s genes. The process is called Recombination and involves the
offspring getting half its genes from one parent and half from the other. These
instructions are very important in all aspects of the life of a cell or or
ganism. They
contain the information for many vital functions such as the formation of protein
macromolecules, and the regulation of metabolism and synthesis.

Genotype and phenotype in genetic algorithms are explained in this diagram below:


Figure
1
-

Genotype and Phenotype (Michalewicz 2010)

Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

19

4

Genetic Algorithms

In 1975 John Holland and his students have developed the genetic algorithms at the
University of Michigan. The goals of their research were explaining the adaptive
processes of natural systems and design artificial software that remain the important
mechanism of natural system. This approach has obtained a new and important
discovery in both artificial and natural system science.

Genetic algorithm is computational mo
del based on accepted theories of biological
evolution and natural selection. It is useful as research methods for solving problems
and for modelling evolutionary system. It depends on the stochastic and diversity to
find the optimal solution to the proble
m, most times it uses binary numbers or real
numbers to do its algorithms. The mechanism of genetic algorithm is to create initial
population which is number of individuals (chromosomes) and each individual
represent one possible solution for the problem t
hen perform a loop of instructions
which are selected from some pairs of parent to do the crossover or mutation to
introduce new offspring w
hich will participate in the next

generation, that depends on
the fitness function to evaluate the new offspring and

the selection method deci
de
whether it will be in the next

generation or not. Problems which have no compatible
structure to the genetic algorithms will be so difficult to solve. However, the structures
of molecules alignment are so clear and it will be s
o feasible mechanism solution to
optimize by genetic algorithms

(Michalewicz 1999
).

.

When you look at genetic algorithms you will find some vocabularies have been
borrowed from natural genetics. For example, individuals, genotype or structure in a
populat
ion, sometimes these individuals are called string or chromosomes. If you
compare between genetic algorithms and the nature you will find that each organism in
the nature carries a certain number of chromosomes; for instance, the human has 64
chromosomes.
However, in genetic algorithms each candidate solution is one
chromosome (individual, string or structure). Each chromosome in the nature has a
number of unites which are called genes, these unites in genetic algorithms are called
features.


Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

20


5

Genetic
algorithm analogy

The idea for constructing GAs based on the analogy to evolutionary biology requires
making a considerable mental transition, because the encoding mechanism is so
different in the two cases. The way in which genes are manipulated, combined

and
expressed is very different in the biological and the genetic algorithms cases.

With GAs, there is much greater distance between mathematically encoded optimization
and the field of evolutionary biology from which the inspiration for the method is
der
ived. Consequently, the language and concepts transferred are much more subject to
reinterpretation. For example, a gene and a numerical encoding called a gene are not the
same. Reaping the benefit of the genetic analogy first requires reinterpretation bef
ore
the surprising possibilities of the analogy can be exploited

(Michalewicz 1999
).

.


6

The structures of genetic algorithm

To perform genetic algorithms we require these components:



A way of encoding solutions to the problem as a chromosome (phenotype t
o
genotype).



An evaluation function, which return a rating for each chromosome given to it.



A way to initialize population of chromosomes.



Operators that may be applied to parents when they reproduce to alter their
genetic compositions for example the
standard operators are mutation and
crossover.





Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

21

7

Genetic algorithm steps:




Initialize a population by a certain procedure and evaluate each individual in the
initial population.




Choosing one of the genetic algorithms operators to apply it to the pare
nts as
away to get more diversity.




Reproductions are obtained by choosing one or two parents to reproduce new
offspring. Although the individuals with high fitness are favoured, the selection is
stochastic.




Reproducing new generations until reach stopp
ing criteria.




Figure
2
-

Mechanism of Genetic Algorithm (Michaewicz 2010)



Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

22

8

Elements of Genetic Algorithm



1
-
Encoding



Binary Encoding



Integer Encoding



Real Encoding



Complex Encoding



2
-
Initial population



3
-

Evaluation



4
-

Genetic Algorithms Operations


9

Genetic Algorithm Operations

9
-
1

Crossover

Crossover is a way to get more diversity and that by exchanging information among
individuals to creating the possibility of the right combination for better solutions
(individual). It takes two parents (
two

individuals) depending on the selection method
which the selection itself depends on the fitness function. It performed by selecting a
random position along the length of the individual and swapping all the genes aft
er this
position. As a result we will get two new individuals which can participate in the next
generation.

9
-
2

Crossover rate

Crossover rate is the chance which the method depends on to change or to swap the
information between two chromosomes (individual
s). The good value for crossover rate
is roughly 0.7.

9
-
3

Types of crossover

Single Point Crossover

It is the easiest types of crossover. It is too fast but it has the problem of less diversity
than other types especially when the population has similar individuals. It works by
Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

23

choosing random position of the chromosome and
swapping

all the genes after t
his
position between two chromosomes.



Figure
3
-

Single Point Crossover (Michalewxciz 2010)

Point Crossover

In this type of crossover it will be chosen more than one point and it does randomly and
swaps all elements between these points to get two new chromosomes. It is fast and it
leads to more diversity in the next generation.



Figure
4
-

N Points Crossover (Michalewicz 2010

Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

24

Uniform Crossover

Another type of crossover is uniform crossover, where a coin toss is performed at each
position, and the result of the coin toss determining whether or not an exchange of
genes takes place at tha
t position. It does by assigning 'heads' to one parent, 'tails' to the
other, flipping a coin for each gene of the first child and making an inverse copy of the
gene for the second child; therefore, the Inherita
nce is independent of position
.




Figure
5
-

Uniform Crossover (Michalewicz 2010)

9
-
4

Mutation

Mutation is changing randomly one or more components of a chromosome. With binary
representation, this usually flipping (flip
-
flop) bits, that means change bits from zero to
one
or vice versa. Because of that, the principles of mutation remain unchanged.

9
-
5

Mutation rate

Genes in a chromosome are randomly selected with a certain probability (Pm) and this
is the chance that a bit in a chromosome will be flipped (zero becomes one,
one
becomes zero). The value of Pm
is usually close
to 0.001.


Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

25


Figure
6
-

Mutation (Michalewicz 2010)

9
-
6

Types of Mutations

Mutation factor (1m)

In this mechanism the mutation will happen on one gene only and the value of the gene
will be changed to an entirely new value, therefore this factor will allow getting the new
value to the chromosome.

Mutation factor (2m)

In this method the existing
value will be swaped with anothor existing value. The
charachtaristics

of this factor that it does not

allow to enter a new value to the
chromosome; therefore, it preseves the genes values in the chromosome.


Figure
7
-

Mutation F
actor 2m (Mic
h
alewicz 2010)

10

Conclusion

Genetic algorithm is a method which has been invented by John Holland. He mimicked
Darwin’s theory, which is the mechanism of evolution and natural selection. We
summarised the meaning of genotype and phenotype in this chapter and how the
feature of organi
sm are inherited from generation to generation. The process and
operation of genetic algorithm have been discussed, which are crossover and mutation.


Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

26

Chapter
3

Molecules

and Drugs

1

Introduction

Most drugs which are used now
days in human therapy intera
ct with certain
macromolecular targets. It blocks or activates molecule activity by binding to them.
Molecule is an electrically neutral group of at least two
atoms

held together by
covalent

chemical bonds
, where atom is a basic unit of molecule consisting of central nucleus
surround
ed by a cloud of negatively charged electrons.

2

Molecule

A molecule may consist of atoms of different elements, as with
water

(H
2
O) or of a single
chemical element
, as with
oxygen

(O
2
). Generally atoms which
are
connected by non
-
covalent bonds such as
hydrogen bonds

or
ionic bonds

are not considered single
molecules. Molecular chemistry or
molecular
physics

is name of thescience of molecules
depending on the focus. Molecular physics deals with the laws governing their s
tructure
and properties, while m
olecular chemistry deals with the laws governing the interaction
between molecules those results in th
e formation and breakage of
chemical bonds
. Very
reactive species of molecules are called unstable molecules (Brown 2003).

3

Small molecules

In the field of pharmacology, the Sma
ll molecule is usually restricted to a molecule that
binds with high affinity to a
biopolymer

such as
protein
,
nucleic acid
, or
polysaccharide

and in addition alters the activity or function of the biopolymer. The term small
molecule in the fields of
pharmacology

and
biochemistry

is a low molecular weight
organic compound

which is by definition
,

not a
polymer
. Small molecules can have a
variety of biological functions, serving as
cell
signalling

molecules, as
drugs

in
medicine
,
as tools in
molecular biology
, as
pesticides

in
farming
, and in many other roles. These
compounds can be artificial (such as
antiviral drugs
) or natural (such as
secondary
metabolites
); they may have a beneficial effect against a disease (such as
drugs
) or may
be detrimental (such as
teratogens

and
carcinogens
) (Barnum 1991).

Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

27

4

Drugs

Drug
s are usually

small molecule
s

with roughly 50 atoms. When a drug binds to a
protein by

the

proper way, it increases the activity of the protein.
When the drug binds
the active side of the molecule, it inhibits
the key molecule; therefore, some approaches
cause a key molecule to stop functioning as a try to reduce the functioning of the
pathway in the diseased state. However, to avoid the side effects, the drugs should not
be designed in such a way which affected

any other molecules that may be similar in
appearance to the target molecule
. I
n the most basic sense, drug
s are

an
organic
small
molecule

which prevent

or activate the function of a
biomolecule

such as a
pr
otein
, so
as a result it will be useful
therapy to the
patient

(Lock 2007, p.1).

5

Quantitative structure
-
activity relationship

According to Patani and Lavoie (1996) quantitative
structure
-
activity relationship
(QSAR) is a mechanism or a process when chemical structure correlates quantitatively
with processes, such as a biological activity or chemical reactivity. Sometimes we call it
quantitative structure
-
property relationship (QS
PR). For example, as in the
concentration of a stuff required to give a certain biological response, we can express
the biological activity quantitatively. In addition, when we can express physicochemical
properties or structures by numbers, we can make ma
thematical relationship, or
quantitative structure
-
activity relationship between the two. Therefore, it is possible to
predict the biological response of other chemical structures by using the mathematical
expression.

3D
-
QSAR is one application to calculat
e the power field and that requires three
-
dimensional structures, for exa
mple molecule superimposition is

based on protein
crystallography. Its mechanism depends on the computed potentials instead of
experimental constants. It uses the shape of the molecul
e and the electrostatic fields
based on the energy function which is applied (
Leach and Andrew 2001)
.


Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

28

6

Quality of QSAR models

QSAR is a predictive
model

which derived from statistical application tools correlating
biological activity

such as desirable therapeutic effect and undesirable side effects of
chemicals. It

applied in many disciplines for instance, toxicity prediction, regulatory
decisions, and
risk assessment

(Tong et al 2005). Also,
lead optimization

and
drug
discovery

(Dearden 2003).

Judging the quality of QSAR depends on choice of descriptors, statistical methods
and
the quality of biological data. It has to obtain model which capable of making accurate
and reliable prediction of the new compounds’ biological activities (Wold and Eriksson
1995). Proper validation and evaluation of the prediction power is important
component of all Quantitative structure
-
activity relationships QSAR models (Radhika,
Kanth and Vijjulatha 2010, p. S76).


Obtaining successful QSAR model depends on the accuracy of the input data, selection
of appropriate descriptors and statistical tools,

and validation of the developed model
(Roy 2007). According to Lionard (2006) the validation is the procedure that the
reliability and relevance of a process are established for a precise purpose.

According to Doytchinova and Flower (2002, p.536) 3D QSAR
methods are attractive
because of their combination of an understandable molecular description, rigorous
statistical analysis, and an unambiguous graphical display of the results.

7

CoMFA and CoMSIA

The methodologies of
Comparative Molecular Field Analysis
CoMFA and
Comparative
Molecular Similarity Indices Analysis
CoMSIA provides all the information

that

necessary for understanding aligned molecules’ biological properties by obtaining a
suitable sampling of steric, electrostatic and hydrogen
-
bond donor fields around them (
Radhika, Kanth and Vijjulatha 2010, p. S76)

According to
(
Fabian and Timofei 1996
, p. 155) the method of CoMFA has become a
powerful tool to obtain QSAR. The methodology of CoMFA assumes that the differences
in molecular biological activity are often related to the differences in the magnitudes of
molecular fields surrounding the recep
tor ligands investigated (Shagufta et al 2006,
p.106).

Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

29

According to Doytchinova and Flower (2002, p.536) CoMSIA methods use fields based
on similarity indices describing similarities and differences between ligands and
correlates them with changes in the b
inding affinity. Also, th
e
y mentioned that CoMSIA
properties are the most important contributions responsible for binding affinity and
these properties are: fields describe steric, electrostatic, hydrophobic and hydrogen
-
bond donor and acceptor.

CoMSIA i
s a substitute approach for performing 3D QSAR by CoMFA. In terms of
similarity indices, molecular similarity is compar
ed. In addition to the steric
and
electrostatic fields used in CoMFA, the CoMSIA method defines explicit hydrophobic
and hydrogen bond
donor and acceptor descriptors. Mainly, the purpose of COMSIA is
to partition the different properties into various locations where they play a
n

important
role in determining the biological activity. The most important parameter in optimizing
CoMSIA perfo
rmance is how to combine

the five properties

in a CoMSIA model
(Shagufta et al 2006, p.106).


8

Conclusion

The main things
that
have been summarised in this chapter are molecule structures and
some types of molecules. We have mentioned the physiologies of
drugs, the way they
work to block or activate others molecules function and how we use it as a therapy for
human being. We have explained the mechanism of Quantitative structure
-
activity
relationship QSAR and its methods (Comparative molecular field analys
is CoMFA and
Comparative Molecular Similarity Index Analysis CoMSIA)






Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

30

Chapter 4


Research Work

1

Introduction

In this chapter I

summarise
d my

work which utilizes genetic algorithm to find the
optimal alignment for a group of molecules by using the grid points which is the way to
represent the mo
lecule for the computer. I

use
d

the Euclidean distance as a tool to find
the difference or the similarity between two molecules
.

2

Molecular similarity

Proposing a new method to improve drugs is an extremely challenging but highly
rewarding task, which explains the current plethora of a
pproaches. Molecular similarity
measures are so important in the field of new medicin
es and agrochemicals. I

use
d

the
new similarity measures operation to calculate the similarity between molecules from
the same family

which is the Steroid family. I

utili
ze
d

this measurement to optimize the
alignment for these molecules based on one molecule as a targe
t molecule and others as
sample

molecul
es. The method that I used

depends on the Euclidean distance to
quantify between the sa
mple and target molecule. I use
d

a mechanism of grid and this
grid has a lot of points, these points get affected by the power which comes from each
atom in one molecule. This approach will enable predictions in medically related QSAR.
In the chemical environment you can predict the che
mical behaviour of one molecule,
for example (reactivity, ligand docking, and acidity) based on its structure. You do not
need to understand the often extremely complex details of the molecule’s action in the
chemical environment.
This means

that we can us
e the data of one molecule’s action to
predict the action of another molecule closely related by merely comparing how similar
they are. This is the basis of molecule similarity in chemical environment.

3

Quantum Molecular Similarity Measures


The developme
nt of analogous techniques for three dimensional similarity searching
has been supported by the development of effective and efficient techniques for three
dimensional substructure searching, where the aim is to identify those molecules in a
database that
are most similar to a user
-
defined target structure, using some
Genetic Algorithm and Molecules Al
ignment


The School of Computer & Information Science

31

quantitative measure of intermolecular structural similarity. There are a lot of ways to
measure the quantum similarity, such as encompass algorithms for clustering 2D
structures, molecular su
rface matching, similarity searching through 3D databases,
shape
-
group methods to describe the topology of molecular shape, CoMFA
(Comparative Molecular Field Analysis), shape
-
graph descriptions

(Thorner et al 1996,
p. 900)
. In t
his approach I used

Comparative Molecular Similarity Indices Analysis
(CoMSIA) as a method to measure the s
imilarity between two molecules.

In this research the similarity measures based on the molecular X ray powers

which

have been used to quantify the degree of resemblance

between pairs of rigid three
-
di
mensional molecules. This research discussed

the effect of including molecular
flexibility on the similarities that are calculated using such measures in searches of large