Genetic Algorithm in DNA Computing: A

cottonseedbotanistΤεχνίτη Νοημοσύνη και Ρομποτική

24 Οκτ 2013 (πριν από 3 χρόνια και 7 μήνες)

172 εμφανίσεις



Series of Sel
ected Papers from Chun
-
Tsung Scholars,Peking University(2003)


23

Genetic Algorithm in DNA Computing: A
Solution to the Maximal Clique Problem


Department of Physics

〰⁧0慤eⰠ Li Yu慮ⰠF慮朠Chen

Tut潲: Ou祡y朠Qi


Abstract

Genet ic algorit hm is one of t he possible ways to break t he limit of brute
-
force
met hod in DNA compu
t ing. Using t he idea of Darwinian evolut ion, we int roduce a
genet ic DNA
comput ing

algorit hm to solve t he maximal clique problem. All t he
operat ions in t he algorit hm are
accessible

wit h today

s molecular biot echnology. Our
computer simulat ions show t hat wit
h t his new comput ing algorit hm, it is possible to
get a solut ion from a very small init ial data pool, avoiding
enumerat ing

all candidate
solut ions. For randomly generated problems, genet ic algorit hm can give correct
solut ion wit hin a few cycles at high pro
babilit y. Alt hough t he current speed of a DNA
computer is slow compared wit h silicon comput ers, our simulat ion indicates t hat t he
t ime requirement of t his genet ic algorit hm is approximately a linear funct ion of t he
number of vert ices in t he net work. This m
ay make DNA computers more powerful
attacking some hard computational problems.


Keywords:
DNA computer, genetic algorithm, NP
-
complete problem.



















Series of Sel
ected Papers from Chun
-
Tsung Scholars,Peking University(2003)


24


Introduction

In recent years, biomolecular computing has been documented in various
literature
s (see Ruben & Landweber, 2000
[1]

and references therein). Since Adleman

s
solution to the Hamiltonian path problem
[2]
, DNA and RNA solutions of some other
famous NP
-
complete problems, such as the maximal clique problem
[3]
, the 3
-
SAT
problem
[4]

and the kni
ght problem
[5]
, have been given. Some important progresses
have been made in the technologies in computing with biomolecules
[6
-
11]
. The power
of parallel, high
-
density computation by molecules in solution allows DNA computers
to solve hard computational pr
oblems such as NP
-
complete problems in
polynomial

increasing time, while a conventional Turing machine require exponentially
increasing time
[12]
. However, all the current DNA computing strategies are based on
enumerating

all candidate solutions, then, usin
g selection processes to eliminate
unwanted DNA. This algorithm requires that size of the initial data pool increases
exponentially with the number of variables in the calculation, so that the capacity of
the DNA computer is limited. For example, to calcul
ate a DNA solution of a maximal
clique problem, the number of molecules in the solution must be at least 2
N
, while N
is the number of nodes. With

M scale DNA concentration, a 35
-
node problem can be
solution in a typical test tube, while a 75
-
node
problem

requires a swimming pool!
Obviously, it is inaccessible in practice. In order to break the barrier of this
brute
-
force method, we need to develop new algorithms for DNA computer.

One of the strategies to overcome the volume size problem is to apply the ide
a of
Darwinian evolution. Most organisms evolve by means of two primary processes:
natural selection and sexual reproduction. The first determines which members of a
population survive to reproduce, and the second ensures mixing and recombination
among the

genes of their offspring. It was demonstrated that the generic algorithm
based on these basic processes could solve complex problems
[13]
. The algorithm
exploits the higher
-
payoff, or target regions of the solution space, because successive
generations of
reproduction and crossover produce increasing numbers of strings in
those regions. It favors the fittest strings as parents, and so above
-
average strings will
have more offspring in the next generation. We notice that the parallelism of
molecular computati
on makes it extremely convenient to apply genetic algorithm (GA)

[14]

in DNA computing. In this paper, we present our simulation results of using the
genetic algorithm to solve the maximal clique (MC) problem.
Our results show that it
is possible to get a
solution from a very small initial data pool, avoiding
enumerating

all candidate solutions. For randomly generated problems, genetic algorithm can give
correct solution within a few cycles at high probability. Although the current speed of
a DNA computer i
s slow compared with silicon computers, our simulation indicates
that the time requirement of this genetic algorithm is approximately a linear function
of the number of vertices in a network. This may make DNA computers more
powerful attacking some hard co
mputational problems.




Series of Sel
ected Papers from Chun
-
Tsung Scholars,Peking University(2003)


25

Methods


Mathematically, a clique is defined as a set of vertices in which every vertex is
connected to every other vertex by an edge. The maximal clique problem asks: Given
a network containing N vertices and M edges, how many ver
tices are in the largest
clique? In
Fig.1a
,

a problem of
five

vertices and
five

edges is defined. Obviously, the
vertices (2,3,4) make the largest clique, so the size of the largest clique in this network
is three. The maximal clique problem has been prove
n
an

NP
-
complete problem.
Besides the conventional algorithm on electronic computer and DNA computer, some
interesting
attempts

have been made
[15]
.




Fig. 1
: The original graph describing a maximal clique problem of five vertices and fi
ve edges (a)
and its complementary graph (b).

The data structure of the computation is the same of the previous work
[3]
. For a
graph of N vertices, we use an N
-
digit binary number to represent each possible
clique. In the N
-
digit binary number, a bit set t
o 1 represents the corresponding vertex
in the clique, a bit set to 0 represents the corresponding vertex out of the clique. For
example, 5
-
digit binary number (01110) represents the largest clique (2,3,4) in
Fig.1a
.
To solve the maximal clique problem is
to find an N
-
digit binary number that contains
most 1s (called
condition A
). For each graph of N vertices, we can build a
complementary graph, which contains all the missing edges in the original graph, as
shown in
Fig.1b
. Obviously, any two vertices in a
clique in the original graph are
not

connected in the complementary graph. It should be noticed that if the complementary
graph can be divided into several connected graphs, we can divide the problem into
several sub
-
problems, find each sub
-
problem

s maxim
al clique, and then the union of
these cliques makes the whole problem

s maximal clique. From this point of view, we
only need to solve problems whose complementary graph is a connected graph.

With this data structure, we design the following genetic algor
ithm to solve the
maximal clique problem in a DNA computer, and pay special attention to make sure
that all the processes in our algorithm are
readily

accessible in current
biotechnologies.



Series of Sel
ected Papers from Chun
-
Tsung Scholars,Peking University(2003)


26

We start with a small data pool containing many N
-
digit binary num
bers. In the
initial data pool, each digit of each binary number is set to 0, meaning that no vertex
is in the clique. Then we let the data pool
evol
ve. In each cycle of evolution, we first
randomly replace some digits in the data pool by 1s with certain p
robability, no matter
whether the original digit is 0 or 1. We call this operation
mutation
. Then we

eliminate

from the data pool all numbers that do not satisfy
condition A
. For example, for the
problem shown in
Fig.1a

and
Fig.1b
, such numbers as (11xxx),

(1xx1x), (1x1xx),
(xx1x1) and (xxx11) will be removed from the data pool, while x representing either 0
or 1. If the size of the data pool is smaller after the elimination, we
enlarge

it by
making copies of the binary numbers until the size of the pool re
aches that of the
initial pool. Through these operations of mutation and elimination, the cliques in the
data pool will become bigger as the increase of the number of evolution cycle.

In order to find out if there is some progress in each cycle, after the
operation of
elimination, we will find the maximum number of 1s in a string. This number
represents the current biggest clique

s size. If this size tends to increase with the
increase of number of cycles, we know that the maximal clique is not found, so th
e
computation should go on. If it stops growing for many cycles, we regard it as a sign
that the evolution has reached it
s end, so the computation
should

stop and the size
of
the clique we get at
last
is considered as

the
size of the
maximal clique.

All ab
ove
-
mentioned calculating processes can be
readily

mapped into
corresponding biological operations using current biotechnologies. In the following,
we discuss the related biotechnologies that are needed in order to carry out the
algorithm.

First, we need t
o encode the binary numbers into DNA strands, one strand
representing one binary number. We noticed that the parallel overlap assembly
technique of building the data pool to be very helpful to our algorithm
[16]
. To encode
an N
-
digit binary number, we need
N+1 DNA sequences representing the positions
and N sequences representing the values of each digit. The sequences representing the
positions are all of the same length, but the sequences representing the different
values are of different length, for exampl
e, 1s are shorter than 0s. It should be noticed
that the sequences must be of some different characters
[17]
.

There are at least two techniques available to help carry out the operation of
elimination. If we make the length of the value sequences of 1s be z
ero, then
wherever

in a binary number 1 occurs, the two position sequences next to that digit
will be directly linked. If we design the position sequences in such a way that a
restriction enzyme can break the linking point, we will eliminate value 1 in tha
t
specific position. If we want to break all the strands with 1 at position a and position b,
we only need to divide the data pool into two parts, pool A and pool B, breaking the 1
at position a in pool A and 1 at position b in pool B. The union of pool A
and pool B
will not contain DNA strands with 1 at both positions but still can contain strands with
1 at only one of the two positions (
Fig. 2a
)

[3]
. For the other technology, we make the
length of the value sequences of 0s be zero, then wherever in a bina
ry number 0
occurs, the two position sequences next to it will join. In order to eliminate the
unqualified DNA strands, we only have to pick up the qualified ones by using


Series of Sel
ected Papers from Chun
-
Tsung Scholars,Peking University(2003)


27

polyacrylamide gel containing bound probes to capture the strands with 0 at any of t
he
two positions (
Fig. 2b
)
[4]
.


Fig. 2
: Different method of eliminating the data pool. (a) use restricted enzymes; (b) pick up
qualified strands with probes in electrophoresis.


The operation of mutation would be a little difficult to

carry out. In our algorithm
the value sequences representing 0s and 1s are of different length, so we need to make
DNA sequences mutate to a different length. If the length of value 0

s sequence is
zero, we can use the restriction enzyme to cut at the joi
nt position the strands each
into two, add some DNA subsequence representing the
corresponding

value 1 into
data pool,
then

let the strands recombine. The sequence of value 1 has a chance to be
inserted into the joint of the two pieces. As our simulation i
ndicates, the mutation rate
is no more than 5%, so this technique may be enough.
In addition
, the method of
bubble PCR and ligation can also be a candidate for carrying out mutation.
Cloning or
PCR can
help carry out

the operation of enlarging
. Because the

value sequences of 1s
and 0s are of different length, an electrophoresis with marker will give us the
information of the sizes of the cliques
[3]
.


Results

In order to evaluate the algorithm, we randomly generated some problems of
different sizes, wishing

to study the time requirements of our algorithm for problems
of different difficulty levels. The problems are generated with a connecting rate of
approximate 80%, that is, when generating the problem, each pair of vertices
has

a
probability of 80% to be c
onnected by an edge. In our simulation, the size of data pool
is

10
6
. The probability at which we put 1s into digits (the mutation rate) is
2/N

in the


Series of Sel
ected Papers from Chun
-
Tsung Scholars,Peking University(2003)


28

sense that the cliques will grow averagely at the rate of two vertices per cycle. The
computation will st
op if the size of the biggest clique remains unchanged for 30
cycles. We also use a conventional recursive algorithm (RA) to solve the problems in
an exhaustive way in order to check the result of the genetic algorithm and to compare
the time requirement o
f the two algorithms. Some results of our computation are
shown in
Table 1
. As we can see from the results, the genetic algorithm can solve the
problem correctly; the cycle needed to get the correct answers is
almost
a
linear

function of the number of node
s (See Fig. 3(a)). If the connecting rate of problems is
fixed, then the number of edges in the complementary graph is order of N
2
. In each
cycle, each edge in the complementary graph brings in one operation of elimination, if
the number of cycles required

increases linearly to N, then the total time requirement
of our genetic algorithm is order of N
3
. On the other hand, the total time requirement
of the recursive algorithm is exponential to N, as shown in Fig. 3(b). More
importantly, using the genetic algo
rithm, we only have to use a very small data pool,
compared to the number of all possible solutions. The number of all possible solutions
to a 40
-
vertices problem is 2
40
, or 10
12
, to a 100
-
vertices problem is 2
100
, or 10
30
,
while the size of our data pool
is only about 10
6
, yet we can still give correct answer
in most of the
occasions

(see table 1). We thus believe that this DNA computing
algorithm overcomes the volume size barrier of brute
-
force method used in DNA
computing literature
[2
-
5]
.


Vertices

(N)

E
dges

Size
of
MC

Time (
Arbitrary

unit, by RA)

Prob. to get the
correct answer

(by GA)

Average number of
cycles used before
reaching an answer

40

624

14

1

100% (6 of 6)

14

50

980

14

5.7

100% (6 of 6)

15

60

1420

16

34.2

100% (6 of 6)

25

70

1932

18

108

100
% (6 of 6)

38

80

2530

19

703

100% (6 of 6)

41

90

3200

19

1130

100% (6 of 6)

44

100

4000

20

7440

100% (6 of 6)

57

110

4800

21

15800

100% (6 of 6)

81

130

6700

21

80800

100% (6 of 6)

85

160

10000

22

281000

83% (5 of 6)

113

Table 1



Series of Sel
ected Papers from Chun
-
Tsung Scholars,Peking University(2003)


29


Fig. 3
: Comparison

of related time requirement between genetic algorithm and recursive algorithm
in solving MC problems with different sizes. The time requirement of the genetic algorithm
increases linearly (a), while that of recursive algorithm increases exponentially (b).



The difficulty level of a problem is related not only to the number of vertices in
the graph, but also to the connecting rate. From
Table 1

we can see that at the
connecting rate of 80%, the size of maximal clique increases very slowly with the
increa
se of the number of vertices. This implies that for randomly generated problems,
a connecting rate of 80% is still not high enough to give large cliques. Therefore, we
also studied MC problems with fixed number of vertices (40 and 70) but with
increasing n
umber of edges. The main results of the computation are in
Table 2
. We
observe that the size of maximal clique increases rapidly as the connecting rate
approaches 100%. As the number of edges increases, the recursive algorithm requires
exponentially increa
sing time, as shown in Fig. 4(c), 4(d). To our surprise, the number
of cycles needed for the genetic algorithm to solve the problem
does not

increase
much, as shown in Fig 4(a), 4(b). The number of edges in complementary graph
decreases as the increase of
number of edges in the original graph, which reduces the
number of operations of elimination in each cycle, thus the increase of the total
number of operations would be even smaller.




Vertices

(N)

Edges

Size of
MC

Time (Arbitrary unit, by
RA)

Cycles need
ed

(by GA)

40

(Full
Edges:
780)

700

19

38

15

710

20

80

15

720

21

173

16

730

22

440

16

740

25

1570

16

750

26

4250

17



Series of Sel
ected Papers from Chun
-
Tsung Scholars,Peking University(2003)


30

760

27

15700

17

70

(Full
Edges:
2415)

1950

18

268

38

2000

20

661

39

2050

20

1430

40

2100

23

11300

42

2150

24

26800

44

2200

27

283000

47

Table 2



Fig. 4
: Comparison of related time requirement between genetic algorithm and recursive algorithm
in solving MC problems with different number of edges. The time requirement of the genetic
algorithm is almost unchanged (a) (b
), while that of recursive algorithm increases exponentially (c)
(d).


Discussion


Let us see into some reasons behind

genetic algorithm

s
powerfulness

in solving
this problem. Because the maximal cliques have most vertices, they are easier to be


Series of Sel
ected Papers from Chun
-
Tsung Scholars,Peking University(2003)


31

finally

arrived at than any other smaller cliques are. Although there are much more
smaller

cliques than the maximal cliques, most of them are on the right track to the
maximal cliques. To make this point more clear, see Fig. 5, which shows the process
of solving

an 8
-
vertex problem. Figure (a) describes the problem, whose linking rate
is 0.8. Figure (b) shows all the cliques, each as a point, divided into five levels, and all
the mutations that give larger cliques, each as a line. (For the reason of space, the
ex
act clique that each point represents is not shown.) From bottom to top, the size of
the cliques in each level is from 0 to 4. They grow bigger along the lines. The
mutations that fail to give a clique are eliminated by selection and so are not
contained i
n the graph. Notice that in level K, each point has K lines connecting to the
points in level K
-
1, which means there are K ways to get a clique of size K from
cliques of size K
-
1. The faded lines and points represent those mutations and cliques
that surely

cannot

lead to or end up as a maximal clique. As mentioned above, in
randomly generated problems, they are only a small part of all mutations and cliques,
the rest are on the right track. As long as percentage of cliques on the right track in
each level i
s much bigger than 1/D, where D is the size of the data pool, the genetic
algorithm can almost surely give the correct solution after some cycles. Since the data
pool of DNA molecules is very large, the capacity of the algorithm can be satisfying.




Fig. 5

(a)



Series of Sel
ected Papers from Chun
-
Tsung Scholars,Peking University(2003)


32


Fig. 5(b)

Fig. 5: (a
), a problem of
eight

vertices, (b), A network describing the relation between all the
cliques. Each vertex represents a different clique in (a). They are arranged in such a way that the
size
s of the cliques are from 0 to 4 from bottom to top. Mutation can make cliques grow larger
along the lines. The faded vertices and lines mean that the corresponding cliques and mutations
cannot lead to the maximal cliques, which are at the top level.



Ack
nowledgements


We owe many thanks to Ms. Minping Qian from the school of mathematics for her
inspiring discussion and helpful comment to our work. We acknowledge Mr. Luping
Xu

s helpful detailed information on the biotechnologies related to the algorithm
.
Other members of the lab of nonlinear science and biotechnology also lent many
helping hands to our work. This work is supported by
the Chun
-
Tsung Foundation,


863


program of National Science and Technology Department of China.




References:

1.

Ruben, A.

J. and Landweber, L. F., The past, present and future of molecular computing.
Nature Reviews Molecular Cell Biology
, 2000,

1
:

69
-
72
.

2.

Adleman, L., Molecular computation of solutions to combinatorial problems. Science, 1994,
266: 1021
-
1024.

3.

Ouyang, Q., Kapl
an, P. D., Liu S. and Libchaber, A., DNA solution of the maximal clique


Series of Sel
ected Papers from Chun
-
Tsung Scholars,Peking University(2003)


33

problem. Science, 1997, 278: 446
-
449.

4.

Braich, R. S., Chelyapov, N., Johnson, C., Rothermund, P. W. K. and Adleman, L., Solution
of a 20
-
variable 3
-
SAT problem on a DNA computer. Scienc
e, 2002, 296: 499
-
502.

5.

Faulhammer, D., Cukras, A. R., Lipton, R. J. and Landweber, L. F., Molecular computation:
RNA solutions to chess problems.
P
roc.

N
atl.

A
cad.

S
ci., 2000,

U
.
S
.
A
. 97: 1385
-
1389.

6.

Benenson, Y., Paz
-
Elizur, T., Adar, R., Keinan, E., Livneh
, Z. and Shapiro, E.,
Programmable and autonomous computing machine made of biomolecules. Nature, 2001,
414: 430
-
434.

7.

Liu, Q., Wang, L., Frutos, A. G., Condon, A. E., Corn, R. M. and Smith, L. M., DNA
computing on surfaces. Nature, 2000, 403: 175
-
179.

8.

Ogih
ara, M. and Ray, A., DNA computing on a chip. Nature, 2000, 403: 143
-
144.

9.

Sakamoto, K., Gouzu, H., Komiya, K., Kiga, D., Yokoyama, S., Yokomori, T. and Hagiya,
M., Molecular computation by DNA hairpin formation. Science, 2000, 288: 1223
-
1226.

10.

Wang, L., Hal
l, J. G., Lu, M., Liu, Q. and Smith, L. M., A DNA computing readout operation
based on structure
-
specific cleavage. Nat. Biotechnol., 2001, 19: 1053
-
1059.

11.

Zimmermann, Karl
-
Heinz, On applying molecular computation to binary linear codes. IEEE
Trans. Inform.

Theory, 2002, 48: 505
-
510.

12.

Impagliazzo, R., Paturi, R. and Zane, F., Which problems have strongly exponential
complexity?
J
.

C
omput.

S
yst.

S
ci., 2001, 23: 512
-
530, doi
: 10.1006
/jcss.2001.1774.

13.

Holland, J. H., Generic algorithm, Scientific
American
, 1992,
July
, 66
-
72.

14.

Foster. J. A., Evolutionary computation.
N
at.

R
ev.

G
enet., 2001,

2
:

428
-
436
.

15.

Chiu, D. T., Pezzoli, E., Wu, H., Stroock, A. D. and Whitesides, G. M., Using
three
-
dimensional
microfluidic

networks for solving computationally hard problems.
P
roc.

N
atl.

A
cad.

S
ci.

U
.
S
.
A
., 2001,

98
:

2961
-
2966
.

16.

Kaplan, P. D., Ouyang, Q., Thaler, D. S. and Libchaber, A., Parallel overlap assembly for the
construction of computational DNA libraries. J. Theor. Biol., 1997, 188: 333
-
341, doi:
10.1006/jtbi.1997.0475.

17.

Arit
a, M. and Kobayashi, S., The power of sequence design in DNA computing.
ICCIMA
2001: F
ourth

I
nternational

C
onference

on

C
omputational

I
ntelligence

and

M
ultimedia

A
pplications
, P
roceedings, 163
-
167.






作者简介:

李源,男,
1982

4
月生于湖南,
2000

9
月因获全国物理竞赛二等奖从
广东华南师大附中保送进北大物理系物理学专业。

方辰,男,
1982

7
月生于北京,
2000

9
月从北京三十五中考入北大物
理系物理学专业。

两人在校期间主动学习,积极思考,成绩优良,从大三下学期开始攻读数学
科学学院的双学位,曾组队参加全国大学生数学建模大赛并获北京市一等奖;在
日常生活中待人真诚,热心帮助同学,跟周围人关系融洽。




Series of Sel
ected Papers from Chun
-
Tsung Scholars,Peking University(2003)


34

感悟与寄语:

在这次科研活动中,我们磨练了意志,提高了能力:既体验到了最初尝试各
种方法屡遭失败的痛苦
,又享受到了终于得到预期结果时的喜悦;既在实际的工
作中自学了很多关于计算机和编程的实用知识,又在互相之间对算法的基本思想
的讨论中加深了对相应的数学、物理概念的理解。更重要的,我们通过这次的科
研,认识到了实际工作中和其他人合作的重要性
——
这是和个人学习之间的最大
不同,如果没有和老师、同学之间的讨论以及他们给我们的启发和帮助,完成这
篇论文是根本不可能的。


指导教师简介


欧阳颀,男,教授、
博士生导师。
1982
年毕业于清华大学化学化工系,同年
留校工作。
1983
年赴法国波尔多第一大学留学,
1989
年获法国博士学位

1989
年在美国德州大学奥斯丁分校工作,

1996
年转入美国
NEC
研究所。
1998

6
月回国到北京大学物理系从事非线性科学与生物芯片技术开发工作,同年被评为
首届“长江学者”特聘教授。

欧阳颀博士
1983
年以来一直从事非线性科学的基础理论与实验研究,并在
该领域取得了一系列重大成果,被国际同行公认为斑图动力学领域的实验科学带
头人之一。
1985
年至今他已经在国际著名科学杂志上发表论文
50
余篇。其中包
括英国《自然》杂志
3
篇,美国《科学》杂志
2
篇,美国《物理通讯快报》(
PRL

7
篇。