An Interval-value Clustering Approach to Data Mining

tealackingAI and Robotics

Nov 8, 2013 (3 years and 11 months ago)

102 views

An Interval
-
value Clustering Approach to Data Mining


Yunfei Yin



Faliang Huang

Department of Computer Science


Institute of Computer Application




Guangxi Normal University




Guangxi Normal University





Guilin 541004,
P.R.
China

Guilin 541004,
P.R.
China

yinyunfei@vip.sina.com


huangfliang
@
163
.com







ABSTRACT

Interval
-
value clustering algorithm is a result of the deep
development of calculation math, and it is widely used in
engineering, commerce, aviation and so on. In order
to apply
the research for interval methods and theory to practice, and
find more valuable knowledge in mining and analyzing the
enterprise data, a kind of data mining method for interval
clustering is provided. After introducing three kind
s

of
interval clu
stering methods, offer a method about mining
association rules in interval database
s
. By comparison with
the traditional method of data mining, this method is more
accurate, more effective, more novel and more useful. So
there is much larger space for the
development of this
method, and it will be certain to bring huge realistic
significance and social significance in commercial
information mining and deciding.


KEYWORDS

I
nterval
-
value, clustering, interval database, interval
distance, data mining


INTRODUC
TION

Since Agrawal R. put forward the idea of mining Boolean
association rules

[1]
, Data mining had been a fairly active
branch. During the past ten years, Boolean association rule
mining has received more and more considerable attention of
famous authorit
ies and scholars, and they have also
published a great deal of papers about it. Such as,
Bibliography [2][12] brought forward a fast algorithm of
mining Boolean association rule which can be used to solve
commodities arrangement in supermarkets; Bibliograp
hy
[13] put forward the idea about causality association rule
mining; Bibliography [9] offered a useful algorithm about
mining negative association rules. Boolean association rule
mining tries to find the regular patterns of consumer
behavior in retail dat
abase, and the mined rules can explain
such patterns as “if people buy creamery, they will also buy
sugar”. However, binary association rule mining restricts the
application area to binary one.

In order to find more valuable association rules, bibliography

[6] had mined out clustering association rules by clustering
way. But, because of the half
-
baked information and the
ambiguity between one thing and another are always existed,
it makes using some interval numbers to represent an object
become the only se
lective way. For example, “50 percent of
male people aged 40 to 65 and earned 80,000 to 1,000,000
each year own at lease two villas”. In this case, we can only
use the interval clustering model to solve it. Based on this,
this article introduces a model me
thod, which is fit for
mining such database information. Proven by multiple true
examples, this method can find more valuable association
rules.

This paper is organized as follows. In the following section,
we will describe three interval
-
value clustering
methods and
the relevant examples. In section 3, we will offer two
interval
-
value mining ways in common database and interval
database respectively. In section 4, there are 3 experimental
results about three experiments. We will give a brief
conclusion abo
ut the research in section 5.


INTERVAL
-
VALUE CLUSTERING METHOD

Interval value clustering algorithm [7] is a result of the deep
development of calculation math, and it is widely used in
many fields such as engineering, commerce [13]. The data
mining based
on interval
-
vale clustering is one of the
applications of such model, and this model involves in such
three interval
-
value clustering methods as follow:
Number
-
Interval clustering, Interval
-
Interval clustering and
1



2


3





4



5

Figure 1 Netting

Matrix
-
Interval clustering.


Number
-
interv
al Clusetering Method

Suppose
n
x
x
x
,...,
,
2
1

is n objects, whose actions are
characterized by some interval values. According to the
traditional clustering similarity formula [
6
], we can get
correlative similarity matrix:




















1
...
]
,
[
]
,
[
...
...
...
1
]
,
[
1
)
(
2
2
1
1
21
21
,
n
n
n
n
j
i
t
t
t
t
t
t
r
R

Matrix
R is a symmetry matrix, where
]
,
[



ij
ij
ij
t
t
r

is the
similarity between
i
x

and
j
x
, i,j=1,2,…n.

There are three steps about Number
-
Interval clustering
method [14]: (1) Netting: at the diagonal of R a serial number
i
s clearly labeled. If
0



ij
t

(the threshold is offered by
user), the element
]
,
[


ij
ij
t
t

is replaced by “

”; if
0



ij
t
,
the element
]
,
[


ij
ij
t
t

is replaced by space; if
]
,
[
0



ij
ij
t
t


t
he element is replaced by “#”. Call “

” as node, while “#”
as similar node. We firstly drag a longitude and a woof from
the node to diagonal, and use broken lines to draw longitude
and woof from similar node to diagonal as described in
f
igure 1. (2) relatively certain classification: For each node,
band the longitude and woof which go through it, and the
elements which are at the end of the node are classified as the
same set; finally, the rest is classified as the last set. (3)
similar f
uzzy classification: For each similar node, band
them to the relatively certain class which is passed by the
longitude or woof starting from the similar node.







As can be seen from the figure 1, relatively certain
classification can clearly classify th
e objects, while similar
fuzzy classification cannot clearly classify the objects. For
example, objects set U={1,2,3,4,5} can be classified to two
sets as A={1,2,[5]},B={3,4,[5]}. However, which set does
“5” attribute to on earth? So, we introduce the conc
ept of
similar confidence.

Definition 1.

(similar confidence): Suppose
A={
]
[
,
,...
,
2
1
x
x
x
x
n
},
]
,
[


i
i
t
t

is the similar coefficient of
x and
i
x
, and
}
,...,
,
min{
0
2
2
0
2
1
1
0
1
















n
n
n
t
t
t
t
t
t
t
t
t





is called the similar confidence of x similarly attribu
ting to
set A. If x is similarly attributed to
s
A
A
A
,...,
,
2
1

at the
same time, and the confidences are respectively
s



,...,
,
2
1
. Take
}
,...,
,
max{
2
1
s
j





, when
j

5
.
0

, we believe that x should be attribute
d to set
j
A
;
when
5
.
0

j


x should be classified to a set alone.

For example, in the above instance, the corresponding
similar coefficients of objects set U are known as follow:

















1
]
92
.
0
,
75
.
0
[
]
85
.
0
,
7
.
0
[
]
83
.
0
,
71
.
0
[
]
81
.
0
,
7
.
0
[
1
]
81
.
0
,
78
.
0
[
]
70
.
0
,
65
.
0
[
]
61
.
0
,
6
.
0
[
1
]
72
.
0
,
6
.
0
[
]
7
.
0
,
5
.
0
[
1
]
9
.
0
,
8
.
0
[
1

and
0

=0
.75.

Then, the confidence of object 5 similarly attributing to A is:

A

=min{
7
.
0
81
.
0
75
.
0
81
.
0


,
71
.
0
81
.
0
75
.
0
83
.
0


}


=min{0.5,0.67}


=0.5

The confidence of object 5 similarly attributing to B is:

B

=min{
7
.
0
85
.
0
75
.
0
85
.
0


,
75
.
0
92
.
0
75
.
0
92
.
0


}


=min{0.67,1}


=0.67

Take

=max{
A

,
B

}=
B

=0.67



>0.5



object 5 should be attributed to class B.

That is to say, the last classification is: A={1,2}, B={3,4,5}.


Interval
-
interval Clustering Method

Iterval
-
Interval clustering method is the extension and
generalization of Number
-
Inte
rval; It expresses the threshold

of Number
-
Interval as the form of interval.

Interval
-
Interval clustering method also need netting,
relatively certain classification and similarly clear
classification as three procedures. In order to co
nfirm which
relatively certain set the similar node is attributed to at last,
we extend the concept of similar confidence deeply as
follow:

Suppose


is interval value [


0
0
,


],
A=
]}
[
,
,...,
,
{
2
1
x
x
x
x
n
,
]
,
[


i
i
t
t

is the similar coefficient
of x and
i
x
. According to the following information
formula:

























































0
0
2
0
0
0
2
0
1
log
1
0
log
1
0
0










i
i
i
t
t
t
i
i
i
i
i
i
i
t
t
t
i
i
i
i
t
if
t
t
if
t
t
t
t
if
t
t
if
t
t
t
i
i
i
i
i
i

work out
]
,
[


i
i


.

Then,

Let
]}
,
[
],...,
,
[
],
,
min{[
]
,
[
2
2
1
1









n
n








,
and call it as the similar confidence of x similarly attributing

to set A. if
s



,...,
,
2
1

are the confidences which x is
attributed to respectively, take
}
,...,
,
max{
3
2
1





j
. If the center of
j

5
.
0

,
regard x should be attributed to
j
A
; if the center of
j

5
.
0

, x should be classified alone.

Simply saying, Interval
-
Interval clustering method is to work
out the


level set of similar matrix whose elements are all
interval values, and


is a
lso a interval value here.

For example, in the above experiment, we let
λ
=[0.7,0.8].

Then, the confidence of object 5 similarly attributing to A is:

A

=min{[1+
7
.
0
81
.
0
8
.
0
81
.
0
2
log
7
.
0
81
.
0
8
.
0
81
.
0




,1+
7
.
0
81
.
0
7
.
0
81
.
0
2
log
7
.
0
81
.
0
7
.
0
81
.
0




],[1+
7
.
0
83
.
0
8
.
0
83
.
0
2
log
7
.
0
83
.
0
8
.
0
83
.
0




,1]}


=min{[0.69,1],[0.51,1]}


=[0.69,1]

The confidence of object 5 similarly
attributing to B is:

B

=min{[1+
7
.
0
85
.
0
8
.
0
85
.
0
2
log
7
.
0
85
.
0
8
.
0
85
.
0




,1+
7
.
0
85
.
0
7
.
0
85
.
0
2
log
7
.
0
85
.
0
7
.
0
85
.
0




],[1+
75
.
0
92
.
0
8
.
0
92
.
0
2
log
75
.
0
92
.
0
8
.
0
92
.
0




,1]}


=min{[0.47,1],[0.65,1]}


=[0.65,1]

Take

=max{
A

,
B

}=
B

=
[0.65,1]



>0.5



object 5 should be attributed to class B.

The result is same to the previous result of number
-
interval
clustering.


Interval
-
matrix Clustering Method

For each interval
]
,
[


ij
ij
t
t
, it can be equal to
1
0
,
)
(







u
u
t
t
t
ij
ij
ij
. Given a
0
u
, the interval can
be expressed by
0
)
(
u
t
t
t
r
ij
ij
ij
ij






. So, we transform
similar interval matrix R into real matrixes:














1
...
...
...
...
1
1
2
,
1
,
1
,
2
n
n
r
r
r
M

and













1
...
...
...
...
1
1
2
,
1
,
1
,
2
n
n
u
u
u
U
, where the real matrix U is
made up with different
ij
u

which is related to different
interval value.

Next, after having a composition calculation to M and U
respectively, we get their fuzzy equality relationships; if take
different


value, we can get different classification r
esults,
where classification result is the intersection set of equality
relationship M and U, Finally, choose the reasonable class
according to the fact situation. Simply saying,
Matrix
-
Interval clustering method is to transform interval
value matrix into
two real matrixes, and then, have a cluster
by fuzzy equality relationship clustering method. For
matrix
-
interval clustering is different with the change of the
value of
ij
u

( all the values of
ij
u

consist in matrix U),
wh
ile the value of
ij
u

is fairly influenced by the field
knowledge, it needs the field experts to give special
directions and can gain satisfied clustering results. But, as
this kind of way changes the interval
-
value to real one, the
efficie
ncy will have a remarkably increase. Here, the
corresponding example is omitted for its complication.

It is because there exists many more interval
-
values in reality
that we discuss three kind of interval
-
value ways, while these
interval
-
values cannot be c
orrectly processed by the
traditional method of data mining. Next we will discuss the
data mining way in interval
-
value database.


DATABASE INFORMATION MINING MODEL

There are two types for database information mining: one is
the data mining in a common da
tabase; the other is the data
mining in an interval
-
value database.


Data Mining In a Common Database

In the common database, the records are made up with a
batch of figures, which range in a certain field. The values of
each field are changed in a certain

area and their types are the
same, and the popular processing method is to divide them
into several intervals according to the actual needs.
However, a hard dividing boundary will be appeared, so we
bring forward the interval clustering method. By doing s
o,
the classification will be more reasonable and the boundary
will be softened for the thresholds can be changed according
to the actual conditions; it is more important that we can have
an automatic operation by making all the thresholds take the
same va
lue. That is to say, we can break away from the real
problem area, and make all data of each field clustered
automatically (under the control of the same threshold). The
algorithm is as follow:

Algorithm 1.

(Data Clustering Algorithm Automatically)

Input:
DB: database; Attr_List: attribute set; Thresh_Set:
the threshold used to cluster all the attributes;

Output:
the clustering results related all attributes;

Step 1: for each

i
a
Attr_List
i
a
Uni
_
=Transfer_ComparableType(
i
a
);//Transform all
the attributes into comparative types, and save to
i
a
Uni
_
.

Step 2:
while(Thresh_Value(
i
a
)<Thresh_Set(
i
i
a
Uni
a
_
,
)){//wor
k out the similarities of all the values of each attribute

Step
3: for any k,j

DB and k,j
i
a

{

Step 4: Compute_Similiarity(k,j);}//Calculate similarity

Step 5: Gen_SimilarMatrix(
i
i
M
a
,
);//Generate similarity
matrix of values of certain attribute

Step 6: C

i
M
;}//C is the array of similarity matrix

Step 7: for each
C
c
i


C= C +GetValue(
i
c
);

Step 8: Gen_IntervalCluster(Attr_List,C);//Clustering

Step 9: S=statistic(C);//count the support of item set

Step 10: A
rrange_Matrix(DB,C);//Merge and arrange the
last mining results

In the above steps, after getting the last clustering results, we
can have a data mining, and this kind of result of data mining
are quantitative association rules [11], which describe the
qu
antitative relation among items.


Data Mining In an Interval
-
value Database

This is another kind of information mining, and the
difference with common information mining lies in that it
introduces the concept of interval
-
value database.

Definition 2.

(inte
rval
-
value relation database): Suppose
n
D
D
D
,...,
,
2
1
are n real fields, and
)
(
),...,
(
),
(
2
1
n
D
F
D
F
D
F

are respectively some sets
constructed by some interval
-
values in
n
D
D
D
,...,
,
2
1
[3].
Regard them as value fields of attributes in which some
relati
ons will be defined. Make a Decare Product:
)
(
...
)
(
)
(
2
1
n
D
F
D
F
D
F



, and call one of this
Decare set’s subsets as interval relations owned by record
attributes, and now, the database is called interval
-
value
relation database. A record can be expressed by
t=(
n
x
x
x
,...
,
2
1
), where
)
(
i
i
D
F
x


(i=1,2,…,n) is
interval of
i
D
.

Definition 2.

(closed interval distance): Suppose [a,b], [c,d]
are any two closed intervals, and the distance between two
intervals is defined to
d([a,b],[c,
d])=
2
/
1
2
2
)
)
(
)
((
d
b
c
a



. It is easy to
certify the distance is satisfied with the three conditions of
the definition of distance.

Interval value data mining is to classify
)
(
i
D
F

by
“interval
-
value clustering method”, and finally merge the
d
atabase to reduce the verbose attributes, and transform to
common quantitative database for mining. The algorithm is
as follow:

Step 1: Transform
)
(
i
D
F

related to attribute
i
D

in the
database into comparative type by genera
lizing and
abstracting[11];

Step 2: In the processed database, work out the interval
distance between two figures for each


i
D
F
, and the
distance is regarded as their similar measurement. So a
similar matrix is generated like this;

Step 3:

Cluster according to one of the three interval
-
value
methods;

Step 4: Decide whether the value is fit to the threshold, after
labeling the attribute again, get quantitative attribute;

Step 5: Make a data mining about quantitative attributes;

Step 6: Repea
t step 3 and step 4;

Step 7: Arrange and merge the results of data mining.

Step 8: Get the quantitative association rules.


Example Research

There is the information as follow:

Table 1 Career, income questionnaire

Age

Income

Career

Number of
villas owned

22

2000

Salesman for
books

0

39

10000

Salesman for IT

2

50

3000

College teacher

1

28

5000

Career training
teacher

0

46

49000

CEO for IT

5

36

15000

CEO for
manufacturing

1

Firstly, make a clustering for interval values:

The clustering result for age
s:
11
I
={22, 28},
12
I
={36, 39},
13
I
={46, 50};

The clustering result for incomes:
21
I
={2000, 3000, 5000},
22
I
={10000, 15000},
23
I
={49000};

The clustering re
sult for careers:
31
I
={Salesman for books,
Salesman for IT },
32
I
={ College teacher, Career training
teacher},
33
I
={ CEO for IT , CEO for manufacturing};

The clustering result for Number of villas owned:
41
I
={0,
1},
42
I
={2, 5}.

Then, calculate the supports:

Table 2 statistical table for item sets

Item

Support

11
I

2

12
I

2

13
I

2

21
I

3

22
I

2

23
I

1

31
I

2

32
I

2

33
I

2

41
I

4

42
I

2

Thirdly, normalize the database.

Table 3 Transformed transaction database

TID

Items

1

11
I
,
21
I
,
31
I
,
41
I

2

12
I
,
22
I
,
31
I
,
42
I

3

13
I
,
21
I
,
32
I
,
41
I

4

11
I
,
21
I
,
32
I
,
41
I

5

13
I
,
23
I
,
33
I
,
42
I

6

12
I
,
22
I
,
33
I
,
41
I

Finally, i
f we mine the transformed database, can find such
association rules as “42 percent of male people earned 2,000
to 5,000 each month own at most one villa”, That is to say,
21
I

41
I

(Support=0.42,Confidence=1).


ALGORITHOM EVALUATION

In order to testify the effect of the above algorithm, we have
made a large quantity of testing work. The imitative and true
data testing expressed the aforementioned algorithm coul
d
improve the effect of data mining dramatically, and found
many valuable association rules.

The following are the results of processing the true
databases
.


Experiment One

This is a result of mining about a data set named Flags, which
describes the chara
cters of the national flags of all the
countries in the world, and the characters include area of
country, population, national flag, color, shape, size, special
patterns, layout and so on. The mined results are described in
table 4:

Table 4 The results o
f mining for A database named Flags


Experiment

Two

This is a result of mining about a data set named Zoo, which
describes the different characters of 101 kinds of animals,
and the characters include: hair, eggs, milk, tail, legs, toothed
and so on. The mined results are described in table 5:

Table 5 T
he results of mining for A database named Zoo


Rema
rks:
In the above experiments, data set comes from
ftp://www.pami.sjtu.edu.cn

which are true databases. The
data scale refers to all the records included in database; the
scale for attributes refers to the number
s of attributes in
attribute set; the numbers of effective attributes refers to the
numbers of the rest attributes after reduced; the values of
average each attribute refers to all the different intervals after
being divided; the numbers of pattern refers
to the numbers
of pattern after mining.

Additionally, it is needed to explain: In order to enhance the
effect, we labeled all the attributes during doing the
experiments. For example, if there is a rule “1,11

40”, it
represents “if the country is in America and the population is
within 10,000,000, there is not any vertical stripe in its flag”.

The detail explanations on the experiments and other
experimental results can be obtained by visiting
ftp://ftp.glnet.edu.cn
.


CONCLUSIONS

The application of interval value clustering in data mining
has been discussed in this article. Firstly put forward three
kind of interval
-
value clustering methods, and explained by
examples; the
n respectively offered interval
-
value clustering
mining methods in common database and interval database;
at last there was an active example research; in section 4, the
results of three experiments sufficiently proved that the

scale
for

data
set

numbers
of
effective
attributes

values
of
averag
e each
attribu
te


numbers
of pattern

threshold


suppo
rt


con
fide
nce

194

17

2

10

0.85

0.8

194

14

6

138

0.7

0.8

194

13

8

72

0.77

0.8


scale
for

data
set

numbers
of
effective
attributes

values
of
averag
e each
attribu
te


numbers
of pattern

threshold


sup
port


confi
dence

101

18

3

2

0.8

0.7

101

14

2

4

0.72

0.8

101

10

2

4

0.7

0.74

application of interval
-
valu
e clustering in data mining has
widely developing prospect.

This kind of data mining method based on interval
-
value is
mainly fit to mine some numeric data and comparable data;
especially for the interval value database it is more
significant. For other ty
pe data, it has certain value for
reference. In the world, the research about the relation
between interval value clustering and data mining is just at
the primary stage, but its development prospect is quite well.
A series of new technology and software w
ill be produced.
Now, most big commercial companies are all competing the
supermarket, and 21
st

century will be a new era where
interval value clustering method is used in data mining.



REFERENCES

1. Agrawal, R.
,
Imieliski, T.

and
Swami, A.

Mining
Associa
tion Rules Between Sets of Items in Large
Databases.

ACM SIGMOD Int. Conf.
On
Management of
Data, 1993, 1993: 207
-
216.

2
. Agrawal
,

R. and Srikant
,
R.

Fast algorithm for mining
association rules in large database.

In Reseach Report
RJ9839,

IBM Almaden Resea
rch Center,San Jose,CA,June
1994.

3
. He
,
X.G. Fuzzy Database System, Tsinghua University
Press, Beijing,1994.

4.
Hu
,

C.Y., Xu
,

S.Y., and Yang X.G. A Introduction to
Interval Value Algorithm,
Systems Engineering
--

Theory &
Practice, 2003 (4): 59
-
62.

5.
Len
t,

B.
,

Sw
ami, A.
,

Widom
,

J. Clustering association
rules.

In Proc.1997 Int. Conf. Data Engineering (ICDE’97),
Birmingham,

England,

April
,

1997.

6.
Luo
,

C.Z. A Guide to Fuzzy Set, Beijing Normal
University Press, Beijing,1989.

7.
Moore
,

R.,

Yang
,
C. Interva
l Analysis I.

Technical
Document,

Lockheed Missiles and Space Division,

Number
LMSD
-
285875,1959.

8.
Srikant
,

R.
,

and Agrawal, R. Mining Quantitative
Association Rules in Large Tables. In:
Proceedings of ACM
SIGMOD
, 1996: 1
-
12.

9.
Wu
,

X.D., Zhang
,

C.Q.
,

and

Zhang
,

S.C. Mining Both
Positive and Negative Association Rules. In Proceedings of
19
th

International Conference on Machine Learning, Sydney,
Australia, July 2002:658
-
665.

10
Yin, Y.
F.
, Zhang, S.
C.
, Xu, Z.
Y.

A Useful Model for
Software Data Mining, Comput
er Application 3, 2004
:

10
-
13.

11.
Yin, Y.
F.
, Zhong, Z., and Liu, X.
S.

Data Mining Based
Stability Theory. Changjiang River Science Research
Institute Journal 2, 2004
:
22
-
24.

12.
Zhang
,

C.Q.
,

and Zhang
,

S.C. Association Rule Mining
Models and Algorithms. S
pringer
-
Verlag, Berlin Heidelberg,
2002.

13.
Zhang
,

S.C., and Zhang
,

C.Q. Discovering Causality in
Large Databases,

Applied Artificial Intelligence,

2002.

14.
Zhang
,

X.F. The Cluster Analysis for Interval
-
valued
Fuzzy Sets, Journal of Liaocheng University,

2001,14(1):5
-
7.