# Data Mining with Artificial

Data Management

Nov 20, 2013 (4 years and 5 months ago)

86 views

Data Mining with Artificial
Evolution.

Helen Johnson

Anna Kwiatkowska

David Sweeney

Panagiotis Tzionas

Data mining

A multi
-
objective
optimisation

problem

Aims to extract valid, novel and interesting
rules (laws) from data.

Validity

Support

Confidence

Law generality

Law accuracy

Data provided by V. Athias and C. Jeandel

“The flows of particles of various sizes in the austral seas”

Details of the data set:

Particles at four size groups measured at two depths: 2000 and 3000 m

A total of 51 measurements over a period of a few hundred days

The ‘real’ data problem

-0,2
0
0,2
0,4
0,6
0,8
1
1,2
1,4
0
10
20
30
40
50
60
sm 2000
sm 3000
med 2000
med 3000
lg 2000
last 2000
last 3000
Example

Concentration

Dissolved phase

Suspended

particles

Sinking
particles

OBSERVATIONS

Agglomeration

Sinking

TRANSFORMATIONS

AIM

Model
interactions

Interactions between particles

Parameters

Methodology

Target = LAW

Phenotype: a linear combination of terms

1.2
x
2
+
x
3
sin(
x
1
) + 3.6
x
1
x
2

Genotype: coding of the phenotype

(1.2,0,2,3), (1,2,3,1), (3.6,1,1,2)

where 0 = x
i
;
1 =
x
i

*

x
j
; 2 = x
i
sin x
j

Mixed integer

real valued representation hybrid ES/GA

Selection:

The problem to find a set of laws

(Michigan, Pittsburgh, Universal Suffrage)

0
1
2
3
4
5
6
7
8
9
10
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
Example

Result

P1

P2

Assessing the fitness of one law

The law is calculated for each example

The results are sorted

Plateaux are identified

Fitness function is calculated

Testing a simple fitness function

Fitness function =
Σ
length(P
i
)

The known law (A
0
* A
1
= cst).

Found laws

1)

-
0.37A
0

* A
1

0.36A
2
/A
2

+ 0.07A
0
/A
0

2)

-
0.04A
0
*A
0

0.008A
1
*A
2

0.77A
1
*A
0

Example

Result

v
j

Example

Result

v
j

Fitness=8

Fitness=8

Problem with the fitness function:

The new fitness function

Identifying the maximum length plateau for each example.

Example

Result

v
j

Correct law:

A
0
*A
1
=0.156

One of our best results:A
0
*A
1
=0.12138

Fitness=8

Example

Result

v
j

Fitness=64

examples
plateau

of

no
1
i
P
length
i
F
The tautology problem

A tautology:

A
0
-
A
0
=0

A
1
/A
1
=1

A tautology provides no knowledge.

The derived laws must be checked for tautologies.

Apply laws to a random data set.

If the law fits all the data then it is a tautology.

Lessons from preliminary experiments

1.
Population size: no influence on the laws

2.
Probability of crossover:

Decrease from 0.6 to 0.4: many tautologies

So decrease “tautology threshold”: elimination of some

tautologies.

3.
Probability of mutation:

Decrease from 0.1 to 0.05: improvement in laws

4.
Plateau threshold

Decreasing the threshold in steps: improved laws

1,32
1,34
1,36
1,38
1,4
1,42
1,44
1,46
1
6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
Plot generated after optimisation

Example

Result

Conclusions

Powerful technique for finding knowledge in data

The fitness function is crucial

Tuning of the algorithm is data dependant

No single optimum algorithm for a specific dataset

Pre
-
processing of data ?

Criteria for defining a plateau ?

Number of constructs and type of constructs ?

How important is law interpretation ?

Questions arising