Data Mining with Artificial

levelsordΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 28 μέρες)

67 εμφανίσεις

Data Mining with Artificial
Evolution.

Helen Johnson

Anna Kwiatkowska

David Sweeney

Panagiotis Tzionas


Problem leader: Michele Sebag

Team leader: Michael Herdy


Data mining


A multi
-
objective
optimisation

problem


Aims to extract valid, novel and interesting
rules (laws) from data.


Validity

Support

Confidence

Law generality

Law accuracy

Data provided by V. Athias and C. Jeandel

“The flows of particles of various sizes in the austral seas”

Details of the data set:

Particles at four size groups measured at two depths: 2000 and 3000 m

A total of 51 measurements over a period of a few hundred days

The ‘real’ data problem

-0,2
0
0,2
0,4
0,6
0,8
1
1,2
1,4
0
10
20
30
40
50
60
sm 2000
sm 3000
med 2000
med 3000
lg 2000
last 2000
last 3000
Example

Concentration

Dissolved phase

Suspended

particles

Sinking
particles

OBSERVATIONS

Adsorption

Agglomeration

Sinking

TRANSFORMATIONS

AIM

Model
interactions

Interactions between particles

Parameters

Methodology


Target = LAW

Phenotype: a linear combination of terms

1.2
x
2
+
x
3
sin(
x
1
) + 3.6
x
1
x
2





Genotype: coding of the phenotype

(1.2,0,2,3), (1,2,3,1), (3.6,1,1,2)

where 0 = x
i
;
1 =
x
i

*

x
j
; 2 = x
i
sin x
j



Mixed integer

real valued representation hybrid ES/GA



Selection:

The problem to find a set of laws



(Michigan, Pittsburgh, Universal Suffrage)

0
1
2
3
4
5
6
7
8
9
10
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
Example

Result

P1

P2

Assessing the fitness of one law



The law is calculated for each example



The results are sorted



Plateaux are identified



Fitness function is calculated

Testing a simple fitness function


Fitness function =
Σ
length(P
i
)


The known law (A
0
* A
1
= cst).


Found laws

1)

-
0.37A
0

* A
1



0.36A
2
/A
2

+ 0.07A
0
/A
0



2)

-
0.04A
0
*A
0



0.008A
1
*A
2



0.77A
1
*A
0

Example

Result

v
j

Example

Result

v
j

Fitness=8

Fitness=8

Problem with the fitness function:

The new fitness function

Identifying the maximum length plateau for each example.

Example

Result

v
j

Correct law:

A
0
*A
1
=0.156

One of our best results:A
0
*A
1
=0.12138


Fitness=8

Example

Result

v
j

Fitness=64




examples
plateau

of

no
1
i
P
length
i
F
The tautology problem

A tautology:

A
0
-
A
0
=0



A
1
/A
1
=1




A tautology provides no knowledge.



The derived laws must be checked for tautologies.



Apply laws to a random data set.



If the law fits all the data then it is a tautology.

Lessons from preliminary experiments

1.
Population size: no influence on the laws


2.
Probability of crossover:



Decrease from 0.6 to 0.4: many tautologies



So decrease “tautology threshold”: elimination of some

tautologies.


3.
Probability of mutation:



Decrease from 0.1 to 0.05: improvement in laws


4.
Plateau threshold



Decreasing the threshold in steps: improved laws


1,32
1,34
1,36
1,38
1,4
1,42
1,44
1,46
1
6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
Plot generated after optimisation

Example

Result

Conclusions



Powerful technique for finding knowledge in data



The fitness function is crucial



Tuning of the algorithm is data dependant



No single optimum algorithm for a specific dataset



Pre
-
processing of data ?



Criteria for defining a plateau ?



Number of constructs and type of constructs ?



How important is law interpretation ?

Questions arising