APPLIED GENETIC ALGORITHMS IN INFORMATION RETRIEVAL

grandgoatΤεχνίτη Νοημοσύνη και Ρομποτική

23 Οκτ 2013 (πριν από 4 χρόνια και 14 μέρες)

88 εμφανίσεις

APPLIED GENETIC ALGORITHMS IN INFORMATION RETRIEVAL
BANGORN KLABBANKOH
Faculty of Information Technology
King Mongkuts Institute of Technology Ladkrabang
Ladkrabang Bangkok 10520
Tel. (02) 7372551-4(EXT:802) Fax. 3269074 E-Mail:S0067034@kmitl.ac.th
OUEN PINNGERN PH.D.
Department of Computer Engineering, Faculty of Engineering
King Mongkuts Institute of Technology Ladkrabang
Ladkrabang Bangkok 10520
Tel. (02) 3269969 E-Mail:kpouen@kmitl.ac.th
Abstract: This article presents an online information retrieval using genetic algorithms to
increase information retrieval efficiency. Under vector space model, information retrieval is
based on the similarity measurement between query and documents. Documents with high
similarity to query are judge more relevant to the query and should be retrieved first. Under
genetic algorithms, each query is represented by a chromosome. These chromosomes feed into
genetic operator process: selection, crossover, and mutation until we get an optimize query
chromosome for document retrieval. Our testing result show that information retrieval with 0.8
crossover probability and 0.01 mutation probability give the highest precision while 0.8 crossover
probability and 0.3 mutation probability give the highest recall.
1. INTRODUCTION
Genetic Algorithms (GAs) are probabilistic search methods that have been developed by
John Holland in 1975. [1][2] GAs applied natural selection and natural genetics in artificial
intelligence to find the globally optimal solution to the optimization problem from the feasible
solutions.
Nowadays GAs have been applied to various domains, including timetable, scheduling,
robot control, signature verification, image processing, packing, routing, pipeline control systems,
machine learning, and information retrieval [ 3][5].
2. PRINCIPLE OF GENETIC ALGORITHMS
2.1 BASIC PRINCIPLES
GAs are characterized by 5 basic components as follow:
1) Chromosome representation for the feasible solutions to the optimization problem.
2) Initial population of the feasible solutions.
3) A fitness function that evaluates each solution.
4) Genetic operators that generate a new population from the existing population.
5) Control parameters such as population size, probability of genetic operators, number
of generation etc.

2.2 PROCESS OF GENETIC ALGORITHMS
GAs is an iterative procedure which maintains a constant size population of feasible
solutions. During each iteration step, called a generation, the fitness of current population are
evaluated, and population are selected based on the fitness values. The higher fitness
chromosomes are selected for reproduction under the action of crossover and mutation to form
new population. The lower fitness chromosomes are eliminated. These new population are
evaluated, selected and fed into genetic operator process again until we get an optimal solution
(see Fig. 1)
3. ONLINE INFORMATION RETRIEVAL USING GENETIC ALGORITHMS
3.1 CHROMOSOME REPRESENTATION
Online information retrieval using genetic algorithms is based on vector space model.
Within this model, both documents and queries are represented by vector. A particular document
is represented by vector of terms and a particular query is represented by vector of query terms.
Generate Initial Population
Assess Initial Population
Select Population
Crossover New Population
Mutate New Population
Assess New population
Terminate
Search?
Stop
No
Yes
FIGURE 1 THE PROCESS OF GENETIC ALGORITHMS
A document vector (Doc) with n keywords and a query vector with m query terms can be
represented as
Doc = (term
1
,term
2
,term
3
.term
n
)
Query = (qterm
1
, qterm
2
, qterm
3
,..qterm
m
)
We use binary term vector, so each term
i
(or qterm
j
) is either 0 or 1. Term
i
is set to zero
when term
i
is not presented in document and set to one when term
i
is presented in document.
For example, user enters a query into our system that could retrieve 5 documents. These
documents are
Doc
1
= {Relational Databases, Query, Data Retrieval, Computer Networks, DBMS}
Doc
2
= {Artificial Intelligence, Internet, Indexing, Natural Language Processing}
Doc
3
= {Databases, Expert System, Information Retrieval System, Multimedia}
Doc
4
= {Fuzzy Logic, Neural Network, Computer Networks}
Doc
5
= {Object-Oriented, DBMS, Query, Indexing}
All keywords of these documents can be arranged in the ascending order as
Artificial Intelligence, Computer Networks, Data Retrieval, Databases, DBMS, Expert
System, Fuzzy Logic, Indexing, Information Retrieval System, Internet, Multimedia, Natural
Language Processing, Neural Network, Object-Oriented, Query, Relational Databases
Encode in the chromosome representation as
Doc
1
= 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 1
Doc
2
= 1 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0
Doc
3
= 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0
Doc
4
= 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0
Doc
5
= 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 0
These chromosomes are called initial population that feed into genetic operator process.
The length of chromosome depends on number of keywords of documents retrieved from user
query. From our example the length of each chromosome is 16 bits.
3.2 FITNESS EVALUATION
Fitness function is a performance measure or reward function which evaluate how good
each solution be. The information retrieval problem is how to retrieve user required documents. It
seems that we could use the fitness functions in Table 1 to calculate the distance between
document and query. From Table 1, there are 2 types of fitness functions: weighted term vector
and binary term vector.
We define X = (x
1
, x
2
, x
3
,.., x
n
) , | X | = number of terms occur in X ,
|| YX

=
number of terms occur in both X and Y [6]
TABLE 1 FITNESS FUNCTION
Similarity
Measure
Sim (X,Y)
Binary
Term Vectors
Weighted
Term Vectors
Dice coefficient
Cosine
coefficient
Jaccard
coefficient
  

  


t
i
t
i
t
i
iiii
t
i
ii
yxyx
yx
1 1 1
22
1
.
.
Result from these fitness functions are interval 0 to 1. By 1.0 means document and query
is sameness. Values near 1.0 mean documents and query are more relevant and values near 0.0
mean documents and query are less relevant. Values evaluate from fitness functions are called
fitness.
3.3 SELECTION
After we evaluate populations fitness, the next step is chromosome selection. Selection
embodies the principle of survival of the fittest. Satisfied fitness chromosomes are selected for
reproduction. Poor chromosomes or lower fitness chromosomes may be selected a few or not at
all.
3.4 CROSSOVER
Crossover is the genetic operator that mix two chromosomes together to form new
offspring. Crossover occurs only with some probability (crossover probability). Chromosomes
are not subjected to crossover remain unmodified. The intuition behind crossover is exploration
of new solutions and exploitation of old solutions. GAs construct a better solution by mixture
good characteristic of chromosomes together. Higher fitness chromosomes have an opportunity to
be selected more than the lower ones, so good solution always alive to the next generation.
YX
YX


2
2/12/1
.YX
YX 
YXYX
YX


 

 

t
i
t
i
ii
t
i
ii
yx
yx
1 1
22
1
.
.
 

 


t
i
t
i
ii
t
i
ii
yx
yx
1 1
22
1
.2
Crossover technique includes one point crossover, two point crossover and multiple point
crossover. If the structures are represented as binary strings, crossover can be implemented by
choosing a point at random, called crossover point, and exchanging the segments to the right of
this point. For example, two chromosomes are crossover between position 5 and 11.
1 0 1 1 1 1 1 1 0 0 1 1 1 0 1
1 0 0 1 1 0 0 1 1 1 1 0 0 0 0
The resulting crossover yields two new chromosomes.
1 0 1 1 1 0 0 1 1 1 1 1 1 0 1
1 0 0 1 1 1 1 1 0 0 1 0 0 0 0
3.5 MUTATION
Mutation involves the modification of the values of each gene of a solution with some
probability (mutation probability). In accordance with changing some bit values of chromosomes,
give the different breeds. Chromosomes may be better or poorer than old chromosomes. If they
are poorer than old chromosomes, they are eliminated in selection step. The objective of mutation
is restoring lost and exploring variety of data. For example: randomly mutate chromosome at
position 10
1 0 1 1 1 1 1 1 0 0 1 1 1 0 1
Result 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1
3.6 PROCESS OF OUR SYSTEM
1. User enters query into our system.
2. Match keywords from user query with list of keywords
3. Encode documents retrieved by user query to chromosomes (initial population)
4. Population feed into genetic operator process such as selection, crossover, and
mutation.
5. Do step 4 until max generation is reached. We will get an optimize query chromosome
for document retrieval.
6. Decode optimize query chromosome to query and retrieve document from database.
4. EXPERIMENTATION
4.1 TEST CASE FORMULATION
This experimentation tests for 21 queries with 3 different fitness functions: jaccard
coefficient (F1), cosine coefficient (F2) and dice coefficient (F3). A particular fitness function
tests with set of parameters: probability of crossover (Pc = 0.8), and probability of mutation (Pm =
0.01, 0.10, 0.30) to compare the efficiency of retrieval system. The information retrieval
efficiency measures from recall and precision.
Recall is defined as the proportion of relevant document retrieved (see equation 1) [4][6]
(1)
Total relevant in collection
Precision is defined as the proportion of retrieved document that is relevant (see equation
2) [4][6]
Number of documents retrieved and relevant (2)
Total retrieved
A tested database consisted of 343 documents taken from students projects of
Information Technology Faculty, King Mongkuts Institute of Technology Ladkrabang
Recall =
Precision =
Number of documents retrieved and relevant
TABLE 2. INFORMATION RETRIEVAL BY 3 FITNESS FUNCTIONS
WITH PC = 0.8 AND PM = 0.01
Keywords Query Chromosome F1 F2 F3 RetRel RetNRel
application
00100000000000000000000000001100
0.84
0.91
0.90
30
1
database
0001000000000000000000000000000010000100
0.59
0.65
0.65
34
8
DNS
0011001001
1.00
1.00
1.00
6
2
internet
00000000000010000000000001
0.76
0.86
0.84
41
-
marketing
0110110
1.00
1.00
1.00
11
8
recognition
11000
0.71
0.75
0.74
7
-
security
000100100
1.00 1.00 1.00 17 57
network
0000100000000010000000
1.00
1.00
1.00
78
21
4.2 EXPERIMENT RESULTS
Preliminary testing indicated that
1. Experiment from 3 fitness functions testing show that optimize queries from these
fitness functions are all the same queries but there are different fitness values (F1, F2, and F3) as
shown in Table 2. From Table 2, RetRel is defined as number of retrieved relevant documents
and RetNRel is defined as number of retrieved but not relevant documents.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pmutati on
Efficiency
Preci si on
Recal l
FIG. 2 PRECISION AND RECALL
2. Information retrieval with Pc = 0.8 and Pm = 0.01 yields the highest precision 0.746
while information retrieval with Pm = 0.10 yields the moderate precision 0.560 and information
retrieval with Pm = 0.30 yields the lowest precision 0.417 as shown in Figure 2.
3. Information retrieval with Pc = 0.8 and Pm = 0.30 yields the highest recall 0.976
while information retrieval with Pm = 0.01 yields the moderate recall and information retrieval
with Pm = 0.l0 yields the lowest recall 0.786 as shown in Figure 2.
5. CONCLUSIONS
From preliminary experiment indicated that precision and recall are invert. To use which
parameters depends on the appropriate- ness that what would user like to retrieve for. In the case
of high precision documents prefer, the parameters will be high crossover probability and low
mutation probability. While in the case of more relevant documents (high recall) prefer, the
parameters will be high mutation probability and lower crossover probability. From preliminary
experiment indicated that we could use GAs in information retrieval. The continuous study is
testing with larger databases and represent retrieved documents by sequence of fitness values
which represent user desire.
REFERENCES
[1] David, L. Handbook of Genetic Algorithms. New York : Van Nostrand Reinhold. 1991.
[2] Goldberg, D.E. Genetic Algorithms: in Search, Optimization, and Machine Learning. New
York : Addison-Wesley Publishing Co. Inc. 1989.
[3] Kraft, D.H. et. al. The Use of Genetic Programming to Build Queries for Information
Retrieval. in Proceedings of the First IEEE Conference on Evolutional Computation. New
York: IEEE Press. 1994. PP. 468-473.
[4] Korfhage, R.R. Information Storage and Retrieval. New York : Wiley Computer Publishing.
1997.
[5] Martin-Bautista, M.J. et. al. An Approach to An Adaptive Information Retrieval Agent using
Genetic Algorithms with Fuzzy Set Genes. In Proceeding of the Sixth International
Conference on Fuzzy Systems. New York: IEEE Press. 1997. PP.1227-1232.
[6] Salton, G. Automatic text processing: the transformation, analysis, and retrieval of
information by computer. New York: Addison- Wesley Publishing Co. Inc. 1989.