A Data Mining Algorithm based on Genetic Algorithm

bankpottstownΤεχνίτη Νοημοσύνη και Ρομποτική

23 Οκτ 2013 (πριν από 3 χρόνια και 1 μήνα)

67 εμφανίσεις


1

A Data Mining Algorithm based on Genetic Algorithm

Yun Bai
1

Deyu Qi
2

Qiangguo Pu
1

Nikos Mastorakis
3

1
Computer Center, University of Science and Technology of Suzhou, Jiangsu 215011, China

2
College of Computer Science, South China University of Technolo
gy, Guangzhou 510640, China

3
Hellenic Naval Academy, Terma Hatzikyriakou, 18539 Piraeus, Greece

Also: WSEAS, Ag.I.Theologou 17
-
23, 15773, Zografou, Athens, Greece


Abstract
:
-
Data Mining (DM) is a new hot research point in Database technology area. In this

paper, we discuss
the application of genetic algorithm (GA) in DM, and bring forward the basis idea and key design question of the
new DM algorithm based on GA, such as knowledge rule expression, knowledge rule coding, fitness function
definition. We illu
strate the validity of the new DM algorithm by the given instance.


K
eywords
:

-
Genetic Algorithm, Data Mining, Knowledge Expression, Knowledge Rule


1.

Data Mining and Knowledge Expression

1.1

Data Mining


Data Mining (DM) is the process to extract
any unknown
and useful information and
knowledge from a number of incomplete,
uncertain, blurred and random data. Similar
process also includes the Knowledge Discovery,
Data Analysis, Data Fusion and
Decision
-
Making Support
[1]
. There are also the
following types of kn
owledge as discovered by
DM:
generalized
,
Characteristic
,
Contrast
,
correlative,
predicting

and
Deviation
, while the
classification,
clustering
, reduced dimension,
pattern recognition
, visual
ization
, decision tree,
genetic algorithm and uncertainty disposa
l are
normally used as the tools and methods for the
discovery
[2]
.



1.2

Knowledge Expression


As very important to DM, the correct
selection and use of Knowledge Expression
would greatly improve the efficiency on
problem
solution
[3]
. The knowledge may be
exp
ressed as predicate logic, p
roduction

rules,
semantic network, framework and neural
network. The knowledge extracted in DM is
also normally expressed as concepts, rules,
regularities and patterns
[1]
. Most of DM would
be deemed as search, while the database

is the
search space. The DM Algorithm as described
in this paper may be considered as a search
policy which adopts the genetic algorithm to
search within the database for evolution on a
group of rules as generated randomly until the
database is covered by

the rule group so as to
mine any useful rule as hidden in the database.

As the
p
roduction

rule is expressed simply
and clearly, it is also mostly applied within the
current expert system. Its expression is
composed of two parts:

premise and conclusion.
T
he premise is just the precondition which must
be met in rule operation, while the conclusion is
just the action or result in rule operation. A
p
roduction

rule may be described as:

IF E
1
(A
1
, A
2
,

, A
m
)

E
2
(A
1
, A
2
,

, A
m
)
∧ … ∧
E
n
(A
1
, A
2
,

, A
m
) THEN
H(Conclusio
n)

In which, E
i
(A
1
, A
2
,

, A
m
)(1

i

n) is the
premise attributed by A
i
(1

i

m). If they have
the relation of OR(

) between the attributes,
E
i
(A
1
, A
2
,

, A
m
) is expressed as an
d
isjuncti
on

form, that is, E
i
=A
1

OR E
i
=A
2


OR
E
i
=A
m
.
Where AND(

) exists between att
ributes,
E
i
(A
1
, A
2
,

, A
m
)is expressed as a coincidence
form, that is, E
i
=A
1

AND E
i
=A
2


AND

E
i
=
A
m
. Following is an example of the rule
expression and its significance.

A rule on knowledge classification:

IF PRICE(moderate, low)


2

QUALITY(high, normal)

SERV
ICE(good)
THEN CLASS=acc, equivalent to IF
(PRICE=moderate

PRICE=low)

(QUALITY=high

QUALITY=normal)

(SERVICE=good) THEN CLASS=acc. This
rule is expressed with the significance as, if the
price of commodity is moderate or low, while
the quality is high or n
ormal and service is good,
then the commodity would be accepted.


2.

Genetic algorithm and realization of
genetic operation

2.1

Genetic algorithm


The genetic algorithm (GA) is an
algorithm simulating the process of biological
evolution to complete the optimized

search, it is
essentially

a random search method based on
simulation of biological evolution process
[6]
. Its
basic principle may be concluded to
comprehend or transfer the target function on
optimization to the fitness of any biological
population

in
envi
ronment
, to correspond the
optimized mutations to any individual
of
biological
population
, and to analogize any
algorithm on optimized solution with the
evolution of the biological
population
[7]
.

There are three apparent features between
the genetic algor
ithm and the traditional
optimized algorithm, that is, high
robust
, whole
search capacity and internal parallelism.


2.2

Genome coding


An important process in GA is coding, that
is the transfer from the parameter form on
solution of the optimized problem to
the
expression on genome code string.

Taking the knowledge
class

rule coding as
an example, the genetic code string on the rule
(individual) is realized with the binary code
according to

the feature of the knowledge
class

rule
[8]
. A rule is divided into t
wo parts, rule
characteristic
(premise) and rule
sort(conclusion), that is,

IF PRICE(moderate, low)

QUALITY(high, normal)

SERVICE(good)
THEN CLASS=acc, supposed that the type of
the
characteristic

attribute is discrete(any
attribute on numerical
characteristic

would be
dispersed), if the discrete attribute has k
possible results, then the k bits would b
e
allocated in the binary string, each bit
corresponds to the given value, 0 means no
value in
d
isjuncti
on

form, while 1 is reversed.
Any individual sort is expressed as the
continuous binary string on sort attribute for the
purpose of simplifying algorith
m. As known
from the above rule, if the value field of the
characteristic

attribute PRICE is {high,
moderate, low}, the value field of QUALITY is
{high, normal, poor}, the value field of
SERVICE is {good, bad}, the value field of the
sort attribute is {uac
c, acc, good, vgood}
respectively

corresponded by the codes of 00,
01, 10 and 11, then the rule may express the
following genome(binary string)form as
0111101001, in which the rule corresponds to
the genome type one by one. The binary string
001100111 also

corresponds to the following
rules as:

IF PRICE(low)

QUALITY(high)

SERVICE(good) THEN CLASS=vgood

If a
characteristic

attribute code is full of 1,
then the attribute would not affect any validity
of rule with whatever value. For example, if the
PRICE is coded as 111 and the rule on the code
is 1111001010,

then the
signification

of the rule
is that if the commodity quality is good while
the service is good, then the commodity is
accepted taking no account of price. In contrary
saying, where the rule of “if the commodity
service is poor, then the commodity i
s
unaccepted”, that is, IF SERVICE(bad) THEN
CLASS=uacc, which corresponds to the binary
code string 1111110100.


2.3

Definition on fitness function


The “good rule” exists from any mining on

3

classification rule within the knowledge
database by GA, and acts a
s the father
generation rule to reproduc
tion
, crossover and
mutation until the optimized rule group is
discovered. The “good rule” means a high
matching between the rule and any
instance

within the test data set (also called as the record,
including featur
ed and classified attributes),
while the fitness function shall reflect the
matching degree of rule and data set. In the
definition of the fitness function there are three
important parameters to be considered in the
rule, such as accuracy, utility and cov
erage,
which will be noted as follows (assumed that
U

is a test data set, and
e

is an
instance

(element)within it).

Accuracy:

The accuracy of a rule
r
i

is
measured with the matching degree between
instance
s within the test data set
U

and the rule,
which
is reflected by the following formula:


In which,
is a subset of the test data set
U
, in which each element(
instance
)matches
with the rule of
r
i

to be evolved, while

is
the base of
the subset of
.
is also a
subset of
U
, in which only the
characteristic

part(premise) of each element(
instance
)matches
with the rule of
r
i

to be evolved, and

is
the base of the subset of

. Apparently, the
higher the rule accuracy is, the higher the rule
trust is. If the accuracy is 1, then the rule within
the test data set
U

is constantly true, where the
condition of the rule is
existence
, the conclusion
of the rule

is also
existence
.

Utility:
Within the test data set
U
, an
instance

of
e

may be matched with several rules
to be evolved, while each rule has the different
utility to be measurable. If an
instance

e

within
U

only matches with a rule to be evolved within
the current
population
, then the utility of the
rule is 1. If an
instance

e

matches with m rules
to be evolved within the current
population
,
then each rule to be evolved has the utility of
1/m as indicated by the following formula:


the
instance

e

successfully
matches with the rule
r

other


In which,
U

and
e

are defined as above,
while
r

is a rule within the current
population
.

Coverage:

The coverage of the rule
r
i

means the numbers of
instance

e

within the test
data set
U
matched
with the premise
part(
characteristic

part) of
r
i

to be evolved as
indicated by the following formula:


Apparently, the higher the rule coverage is,
the better the rule generality is. As the values of
the three parameters would affe
ct the size of
rule fitness, we are in wish of their higher
values.

When fixing the fitness function of the rule,
the relationship between rules shall be analyzed
first. There are three possible relationships
between rules, such as contain, redundant and
contradictory. Let’s take a look:

Contain:


IF PRICE(normal, low)

SERVICE(good) THEN CLASS=acc






IF SERVICE(good)
THEN CLASS=acc

Contradictory:


IF PRICE(high)

QUALITY(poor)

SERVICE(bad) THEN
CLASS=acc






IF PRICE(high)

QUALITY(poor)

SERVICE(bad)


THEN
CLASS=uacc

Redundant:


Very
simpleness
. That is
two
same rules.

The relationship between rules shall be
considered in design of fitness function of the
rule, while contain, contra
dictory

and redundant

4

rules shall be deleted to ensure any father
generation rule is selected as the “good rule”.
Following is t
he algorithm to evaluate rule
fitness:

Step 1:


Assumed that
U

is a test data set,
P

is a rule (genome) variety (N size);

Step 2:

Calculating the Accuracy (
r
i
) and
Utility (
r
i
) for each rule
r
i

(i=1,2,

N) within the current
population

P
;

Step 3:


Arrang
ing rule
r
i

in an
descending

order according to the
product with Accuracy(
r
i
) and
Utility(
r
i
), while the results of
arrangement shall be written into
the sort table;

Step 4:


The Coverage(
r
j
) of the first rule
in the sort table shall be
calculated, while
the fitness(
r
j
) =
Coverage(
r
j
) * Accuracy(
r
j
) *
Utility(
r
j
) (initial
j
=0)

Step 5:

The rule as covered by the rule of
r
j

shall be deleted from
U
, i.e.

Step 6:


If
r
j

is the last rule in the sort
table, exit once the calculation on
f
itness of all rules is completed,
otherwise
j
=
j
+1, and goto
Step 4
.


2.4

Genetic arithmetic operator

2.4.1

Selective arithmetic operator


We adopt the normal selection policy as
fitness
scal
ing method, also the roulette
selection method.


2.4.2

Crossover arithmetic oper
ator


The cross
over

arithmetic operator plays a
very important role. It maintains the feature of
best individual within the original
population

on
a certain extent
, while the algorithm may search
any new gene space, then the new individual
within the
popul
ation

has diversity. As
described in the document, the cross
over

arithmetic operator uses the single
-
point
crossing, that is, randomly set a cross
over

point
within the individual binary gene string for the
parents’ genome cross
over

operation.


2.4.3

Mutation a
rithmetic operator


The mutation arithmetic operator is used to
change the gene value on some gene positions
of the individual binary gene string in the
population
, such as 1

0 and 0

1.


2.5

Algorithm description



The following is the flow of data mining
algorithm based on genetic algorithm.

Begin

Input:

Test data set
U

and all control
parameters of the genetic algorithm

Output: The optimized classified rule set
having been evo
lved

Step 1:
Initialized
population
: N
genomes(binary gene string) are generated
randomly for valid disposal on rules.

Step 2:

Calculation of the fitness value: To
calculate the fitness on each individual(rule)
within the current
population
.

Step 3:

If o
nce the currently evolved
generation number reaches the requirements on
setting the maximum evolved generation
number or the certain control parameters to
accord

with setting, then goto
Step 7
, otherwise
it shall be continued.

Step 4:

To select, crossover

and mutate the
generation to generate the son generation
population
.

Step 5:

To replace the individual at low
fitness within the father generation
population

with the individual at high fitness within the son
generation
population

for the new generation.


Step 6:
Goto
Step 2
.

Step 7:

To output the most optimized rule
set.

End


5


3.

Simulation on example


For example, the acceptance of customer
on the commodity is uacc, acc, good or vgood,
which is measured in price, quality and service,
while the price is hi
gh, moderate or low, quality
is high, normal or poor, and the service is good
or bad. Practice as the following table:


Example
no.

Featured attribute of
commodity

Acceptance

(Classifica
-

tion)

Price

Quality

Service

1

High

Poor

Bad

Uacc

2

Moderate

No
rmal

Bad

Uacc

3

Low

High

Good

Vgood

4

High

High

Good

Good

5

Moderate

High

Good

Good

6

Moderate

Normal

Good

Acc

7

High

Normal

Good

Acc

8

High

Poor

Good

Uacc

9

Moderate

Poor

Bad

Uacc

10

Low

Poor

Bad

Uacc

11

High

Normal

Good

Acc

12

Moderate

Normal

G
ood

Acc

13

Low

Normal

Good

Good


Where we practise the above example with
the genetic algorithm, the parameter cross
probability Pc=0.9, variance probability
Pm=0.03,
population

size P_SIZE=128,
iterative calculation of 500 times are selected to
mine the

following two valid rules:

Rule 1: IF QUALITY(poor) THEN
CLASS=uacc(fitness=36.00) (meaning the poor
service would not be accepted by customer).

Rule 2: IF SERVICE(good) THEN
CLASS=acc(fitness=7.88) (meaning the good
service would be accepted by custome
r).

The above example only notes that the
genetic algorithm is valid on classification of
knowledge rule, while the given data set or
metadata (meta
-
knowledge) may be millions in
data mining from the general and professional
knowledge.


4.

Conclusion


This
paper raises the detailed methods on
fixing any parameter and its operation within
the genetic algorithm based on the realization
technology of genetic algorithm in combination
of the knowledge classification rule mining in
the data mining, including the g
enome coding
method and design of the genetic arithmetic
operator, especially the definition and
realization of the fitness function. Finally the
data mining algorithm based on the genetic
algorithm is given. Within the data mining, as
the genetic algorith
m has the
characteristic

of
high robust
, whole search capacity and internal
parallelism, it may be evolved based on the
meta
-
knowledge (metadata) to get the more
satisfactory knowledge and rule, while any
previously blank knowledge may be generated
from th
e mutation arithmetic operator within
the genetic algorithm, just as search any hidden
knowledge, which is fully noted in the example
as described in this paper.


REFERENCES


1. Tubao Ho, Trongdung Nguyen, Ducdung
Nguyen, Saori Kawasaki,

Visualization
Su
pport for User Centered Model
Selection in Knowledge Discovery and
Data Mining,

International Journal of
Artificial Intelligence Tools
,

2001,

10(4).

691
-

713

2.
Susan E. George
,

A Visualization and
Design Tool (AVID) for Data Mining with
the Self
-
Organi
zing Feature Map
,

International Journal of Artificial
Intelligence Tools
,

2000,

9(3).

369
-

375

3.
Tzung
-
Pei Hong, Chan
-
Sheng Kuo,
Sheng
-
Chai Chi
,

Trade
-
off Between
Computation Time and Number of Rules
for Fuzzy Mining from Quantitative Data
,

Internation
al Journal of Uncertainty,
Fuzziness and Knowledge
-
Based Systems
,
2001,9(5).

587
-

604

4.
Vladimir Estivill
-
Castro, Jianhua Yang
,
Clustering Web Visitors by Base, Robust

6

and Convergent Algorithms
,
International
Journal of Foundations of Computer
Science
, 2002,

13(4).

497


520

5.

R. Félix, T. Ushio
,
Binary Encoding of
Discernibility Patterns to Find Minimal
Coverings
,
International Journal of
Software Engineering and Knowledge
Engineering
, 2002,

12(1).

1
-
18

6.

D
.E.Goddberg, Genetic Algorithms in
Search
, Optimizition and Machine
Learning, Addison
-
wesley Publishing
Company, 1989

7
.

Ai

L
irong

He Huacan.
Summari
se on
Genetic Algorithms

Journal of Computer
Application

and Research

1997

14(4).3
-
6

8
. Xiao Yong, Chen Yiyun. Constructing
Decision Trees by Using Genetic
Algorithm, Journal of Computer Research
and Development,1998,35(1).49
-
52

9
.
Xia
omin Zhong, Eugene Santos, Directing
Genetic Algorithms for Probabilistic
Reasoning Through Reinforcement
Learning, International Journal of
Uncertainty, Fuzziness and
Knowledge
-
Based Systems, 2000,8(2).
167
-
185.

10
. Gary William Grewal, Thomas Charles
Wi
lson, An Enhanced Genetic Algorithm
for Solving the High
-
Level Synthesis
Problems of Scheduling, Allocation, and
Binding, International Journal of
Computational Intelligence and
Applications, 2001,10(1).91
-
110