An Interval

value Clustering Approach to Data Mining
Yunfei Yin
Faliang Huang
Department of Computer Science
Institute of Computer Application
Guangxi Normal University
Guangxi Normal University
Guilin 541004,
P.R.
China
Guilin 541004,
P.R.
China
yinyunfei@vip.sina.com
huangfliang
@
163
.com
ABSTRACT
Interval

value clustering algorithm is a result of the deep
development of calculation math, and it is widely used in
engineering, commerce, aviation and so on. In order
to apply
the research for interval methods and theory to practice, and
find more valuable knowledge in mining and analyzing the
enterprise data, a kind of data mining method for interval
clustering is provided. After introducing three kind
s
of
interval clu
stering methods, offer a method about mining
association rules in interval database
s
. By comparison with
the traditional method of data mining, this method is more
accurate, more effective, more novel and more useful. So
there is much larger space for the
development of this
method, and it will be certain to bring huge realistic
significance and social significance in commercial
information mining and deciding.
KEYWORDS
I
nterval

value, clustering, interval database, interval
distance, data mining
INTRODUC
TION
Since Agrawal R. put forward the idea of mining Boolean
association rules
[1]
, Data mining had been a fairly active
branch. During the past ten years, Boolean association rule
mining has received more and more considerable attention of
famous authorit
ies and scholars, and they have also
published a great deal of papers about it. Such as,
Bibliography [2][12] brought forward a fast algorithm of
mining Boolean association rule which can be used to solve
commodities arrangement in supermarkets; Bibliograp
hy
[13] put forward the idea about causality association rule
mining; Bibliography [9] offered a useful algorithm about
mining negative association rules. Boolean association rule
mining tries to find the regular patterns of consumer
behavior in retail dat
abase, and the mined rules can explain
such patterns as “if people buy creamery, they will also buy
sugar”. However, binary association rule mining restricts the
application area to binary one.
In order to find more valuable association rules, bibliography
[6] had mined out clustering association rules by clustering
way. But, because of the half

baked information and the
ambiguity between one thing and another are always existed,
it makes using some interval numbers to represent an object
become the only se
lective way. For example, “50 percent of
male people aged 40 to 65 and earned 80,000 to 1,000,000
each year own at lease two villas”. In this case, we can only
use the interval clustering model to solve it. Based on this,
this article introduces a model me
thod, which is fit for
mining such database information. Proven by multiple true
examples, this method can find more valuable association
rules.
This paper is organized as follows. In the following section,
we will describe three interval

value clustering
methods and
the relevant examples. In section 3, we will offer two
interval

value mining ways in common database and interval
database respectively. In section 4, there are 3 experimental
results about three experiments. We will give a brief
conclusion abo
ut the research in section 5.
INTERVAL

VALUE CLUSTERING METHOD
Interval value clustering algorithm [7] is a result of the deep
development of calculation math, and it is widely used in
many fields such as engineering, commerce [13]. The data
mining based
on interval

vale clustering is one of the
applications of such model, and this model involves in such
three interval

value clustering methods as follow:
Number

Interval clustering, Interval

Interval clustering and
1
2
3
4
5
Figure 1 Netting
Matrix

Interval clustering.
Number

interv
al Clusetering Method
Suppose
n
x
x
x
,...,
,
2
1
is n objects, whose actions are
characterized by some interval values. According to the
traditional clustering similarity formula [
6
], we can get
correlative similarity matrix:
1
...
]
,
[
]
,
[
...
...
...
1
]
,
[
1
)
(
2
2
1
1
21
21
,
n
n
n
n
j
i
t
t
t
t
t
t
r
R
Matrix
R is a symmetry matrix, where
]
,
[
ij
ij
ij
t
t
r
is the
similarity between
i
x
and
j
x
, i,j=1,2,…n.
There are three steps about Number

Interval clustering
method [14]: (1) Netting: at the diagonal of R a serial number
i
s clearly labeled. If
0
ij
t
(the threshold is offered by
user), the element
]
,
[
ij
ij
t
t
is replaced by “
”; if
0
ij
t
,
the element
]
,
[
ij
ij
t
t
is replaced by space; if
]
,
[
0
ij
ij
t
t
t
he element is replaced by “#”. Call “
” as node, while “#”
as similar node. We firstly drag a longitude and a woof from
the node to diagonal, and use broken lines to draw longitude
and woof from similar node to diagonal as described in
f
igure 1. (2) relatively certain classification: For each node,
band the longitude and woof which go through it, and the
elements which are at the end of the node are classified as the
same set; finally, the rest is classified as the last set. (3)
similar f
uzzy classification: For each similar node, band
them to the relatively certain class which is passed by the
longitude or woof starting from the similar node.
As can be seen from the figure 1, relatively certain
classification can clearly classify th
e objects, while similar
fuzzy classification cannot clearly classify the objects. For
example, objects set U={1,2,3,4,5} can be classified to two
sets as A={1,2,[5]},B={3,4,[5]}. However, which set does
“5” attribute to on earth? So, we introduce the conc
ept of
similar confidence.
Definition 1.
(similar confidence): Suppose
A={
]
[
,
,...
,
2
1
x
x
x
x
n
},
]
,
[
i
i
t
t
is the similar coefficient of
x and
i
x
, and
}
,...,
,
min{
0
2
2
0
2
1
1
0
1
n
n
n
t
t
t
t
t
t
t
t
t
is called the similar confidence of x similarly attribu
ting to
set A. If x is similarly attributed to
s
A
A
A
,...,
,
2
1
at the
same time, and the confidences are respectively
s
,...,
,
2
1
. Take
}
,...,
,
max{
2
1
s
j
, when
j
5
.
0
, we believe that x should be attribute
d to set
j
A
;
when
5
.
0
j
x should be classified to a set alone.
For example, in the above instance, the corresponding
similar coefficients of objects set U are known as follow:
1
]
92
.
0
,
75
.
0
[
]
85
.
0
,
7
.
0
[
]
83
.
0
,
71
.
0
[
]
81
.
0
,
7
.
0
[
1
]
81
.
0
,
78
.
0
[
]
70
.
0
,
65
.
0
[
]
61
.
0
,
6
.
0
[
1
]
72
.
0
,
6
.
0
[
]
7
.
0
,
5
.
0
[
1
]
9
.
0
,
8
.
0
[
1
and
0
=0
.75.
Then, the confidence of object 5 similarly attributing to A is:
A
=min{
7
.
0
81
.
0
75
.
0
81
.
0
,
71
.
0
81
.
0
75
.
0
83
.
0
}
=min{0.5,0.67}
=0.5
The confidence of object 5 similarly attributing to B is:
B
=min{
7
.
0
85
.
0
75
.
0
85
.
0
,
75
.
0
92
.
0
75
.
0
92
.
0
}
=min{0.67,1}
=0.67
Take
=max{
A
,
B
}=
B
=0.67
∵
>0.5
∴
object 5 should be attributed to class B.
That is to say, the last classification is: A={1,2}, B={3,4,5}.
Interval

interval Clustering Method
Iterval

Interval clustering method is the extension and
generalization of Number

Inte
rval; It expresses the threshold
of Number

Interval as the form of interval.
Interval

Interval clustering method also need netting,
relatively certain classification and similarly clear
classification as three procedures. In order to co
nfirm which
relatively certain set the similar node is attributed to at last,
we extend the concept of similar confidence deeply as
follow:
Suppose
is interval value [
0
0
,
],
A=
]}
[
,
,...,
,
{
2
1
x
x
x
x
n
,
]
,
[
i
i
t
t
is the similar coefficient
of x and
i
x
. According to the following information
formula:
0
0
2
0
0
0
2
0
1
log
1
0
log
1
0
0
i
i
i
t
t
t
i
i
i
i
i
i
i
t
t
t
i
i
i
i
t
if
t
t
if
t
t
t
t
if
t
t
if
t
t
t
i
i
i
i
i
i
work out
]
,
[
i
i
.
Then,
Let
]}
,
[
],...,
,
[
],
,
min{[
]
,
[
2
2
1
1
n
n
,
and call it as the similar confidence of x similarly attributing
to set A. if
s
,...,
,
2
1
are the confidences which x is
attributed to respectively, take
}
,...,
,
max{
3
2
1
j
. If the center of
j
5
.
0
,
regard x should be attributed to
j
A
; if the center of
j
5
.
0
, x should be classified alone.
Simply saying, Interval

Interval clustering method is to work
out the
level set of similar matrix whose elements are all
interval values, and
is a
lso a interval value here.
For example, in the above experiment, we let
λ
=[0.7,0.8].
Then, the confidence of object 5 similarly attributing to A is:
A
=min{[1+
7
.
0
81
.
0
8
.
0
81
.
0
2
log
7
.
0
81
.
0
8
.
0
81
.
0
,1+
7
.
0
81
.
0
7
.
0
81
.
0
2
log
7
.
0
81
.
0
7
.
0
81
.
0
],[1+
7
.
0
83
.
0
8
.
0
83
.
0
2
log
7
.
0
83
.
0
8
.
0
83
.
0
,1]}
=min{[0.69,1],[0.51,1]}
=[0.69,1]
The confidence of object 5 similarly
attributing to B is:
B
=min{[1+
7
.
0
85
.
0
8
.
0
85
.
0
2
log
7
.
0
85
.
0
8
.
0
85
.
0
,1+
7
.
0
85
.
0
7
.
0
85
.
0
2
log
7
.
0
85
.
0
7
.
0
85
.
0
],[1+
75
.
0
92
.
0
8
.
0
92
.
0
2
log
75
.
0
92
.
0
8
.
0
92
.
0
,1]}
=min{[0.47,1],[0.65,1]}
=[0.65,1]
Take
=max{
A
,
B
}=
B
=
[0.65,1]
∵
>0.5
∴
object 5 should be attributed to class B.
The result is same to the previous result of number

interval
clustering.
Interval

matrix Clustering Method
For each interval
]
,
[
ij
ij
t
t
, it can be equal to
1
0
,
)
(
u
u
t
t
t
ij
ij
ij
. Given a
0
u
, the interval can
be expressed by
0
)
(
u
t
t
t
r
ij
ij
ij
ij
. So, we transform
similar interval matrix R into real matrixes:
1
...
...
...
...
1
1
2
,
1
,
1
,
2
n
n
r
r
r
M
and
1
...
...
...
...
1
1
2
,
1
,
1
,
2
n
n
u
u
u
U
, where the real matrix U is
made up with different
ij
u
which is related to different
interval value.
Next, after having a composition calculation to M and U
respectively, we get their fuzzy equality relationships; if take
different
value, we can get different classification r
esults,
where classification result is the intersection set of equality
relationship M and U, Finally, choose the reasonable class
according to the fact situation. Simply saying,
Matrix

Interval clustering method is to transform interval
value matrix into
two real matrixes, and then, have a cluster
by fuzzy equality relationship clustering method. For
matrix

interval clustering is different with the change of the
value of
ij
u
( all the values of
ij
u
consist in matrix U),
wh
ile the value of
ij
u
is fairly influenced by the field
knowledge, it needs the field experts to give special
directions and can gain satisfied clustering results. But, as
this kind of way changes the interval

value to real one, the
efficie
ncy will have a remarkably increase. Here, the
corresponding example is omitted for its complication.
It is because there exists many more interval

values in reality
that we discuss three kind of interval

value ways, while these
interval

values cannot be c
orrectly processed by the
traditional method of data mining. Next we will discuss the
data mining way in interval

value database.
DATABASE INFORMATION MINING MODEL
There are two types for database information mining: one is
the data mining in a common da
tabase; the other is the data
mining in an interval

value database.
Data Mining In a Common Database
In the common database, the records are made up with a
batch of figures, which range in a certain field. The values of
each field are changed in a certain
area and their types are the
same, and the popular processing method is to divide them
into several intervals according to the actual needs.
However, a hard dividing boundary will be appeared, so we
bring forward the interval clustering method. By doing s
o,
the classification will be more reasonable and the boundary
will be softened for the thresholds can be changed according
to the actual conditions; it is more important that we can have
an automatic operation by making all the thresholds take the
same va
lue. That is to say, we can break away from the real
problem area, and make all data of each field clustered
automatically (under the control of the same threshold). The
algorithm is as follow:
Algorithm 1.
(Data Clustering Algorithm Automatically)
Input:
DB: database; Attr_List: attribute set; Thresh_Set:
the threshold used to cluster all the attributes;
Output:
the clustering results related all attributes;
Step 1: for each
i
a
Attr_List
i
a
Uni
_
=Transfer_ComparableType(
i
a
);//Transform all
the attributes into comparative types, and save to
i
a
Uni
_
.
Step 2:
while(Thresh_Value(
i
a
)<Thresh_Set(
i
i
a
Uni
a
_
,
)){//wor
k out the similarities of all the values of each attribute
Step
3: for any k,j
DB and k,j
i
a
{
Step 4: Compute_Similiarity(k,j);}//Calculate similarity
Step 5: Gen_SimilarMatrix(
i
i
M
a
,
);//Generate similarity
matrix of values of certain attribute
Step 6: C
←
i
M
;}//C is the array of similarity matrix
Step 7: for each
C
c
i
C= C +GetValue(
i
c
);
Step 8: Gen_IntervalCluster(Attr_List,C);//Clustering
Step 9: S=statistic(C);//count the support of item set
Step 10: A
rrange_Matrix(DB,C);//Merge and arrange the
last mining results
In the above steps, after getting the last clustering results, we
can have a data mining, and this kind of result of data mining
are quantitative association rules [11], which describe the
qu
antitative relation among items.
Data Mining In an Interval

value Database
This is another kind of information mining, and the
difference with common information mining lies in that it
introduces the concept of interval

value database.
Definition 2.
(inte
rval

value relation database): Suppose
n
D
D
D
,...,
,
2
1
are n real fields, and
)
(
),...,
(
),
(
2
1
n
D
F
D
F
D
F
are respectively some sets
constructed by some interval

values in
n
D
D
D
,...,
,
2
1
[3].
Regard them as value fields of attributes in which some
relati
ons will be defined. Make a Decare Product:
)
(
...
)
(
)
(
2
1
n
D
F
D
F
D
F
, and call one of this
Decare set’s subsets as interval relations owned by record
attributes, and now, the database is called interval

value
relation database. A record can be expressed by
t=(
n
x
x
x
,...
,
2
1
), where
)
(
i
i
D
F
x
(i=1,2,…,n) is
interval of
i
D
.
Definition 2.
(closed interval distance): Suppose [a,b], [c,d]
are any two closed intervals, and the distance between two
intervals is defined to
d([a,b],[c,
d])=
2
/
1
2
2
)
)
(
)
((
d
b
c
a
. It is easy to
certify the distance is satisfied with the three conditions of
the definition of distance.
Interval value data mining is to classify
)
(
i
D
F
by
“interval

value clustering method”, and finally merge the
d
atabase to reduce the verbose attributes, and transform to
common quantitative database for mining. The algorithm is
as follow:
Step 1: Transform
)
(
i
D
F
related to attribute
i
D
in the
database into comparative type by genera
lizing and
abstracting[11];
Step 2: In the processed database, work out the interval
distance between two figures for each
i
D
F
, and the
distance is regarded as their similar measurement. So a
similar matrix is generated like this;
Step 3:
Cluster according to one of the three interval

value
methods;
Step 4: Decide whether the value is fit to the threshold, after
labeling the attribute again, get quantitative attribute;
Step 5: Make a data mining about quantitative attributes;
Step 6: Repea
t step 3 and step 4;
Step 7: Arrange and merge the results of data mining.
Step 8: Get the quantitative association rules.
Example Research
There is the information as follow:
Table 1 Career, income questionnaire
Age
Income
Career
Number of
villas owned
22
2000
Salesman for
books
0
39
10000
Salesman for IT
2
50
3000
College teacher
1
28
5000
Career training
teacher
0
46
49000
CEO for IT
5
36
15000
CEO for
manufacturing
1
Firstly, make a clustering for interval values:
The clustering result for age
s:
11
I
={22, 28},
12
I
={36, 39},
13
I
={46, 50};
The clustering result for incomes:
21
I
={2000, 3000, 5000},
22
I
={10000, 15000},
23
I
={49000};
The clustering re
sult for careers:
31
I
={Salesman for books,
Salesman for IT },
32
I
={ College teacher, Career training
teacher},
33
I
={ CEO for IT , CEO for manufacturing};
The clustering result for Number of villas owned:
41
I
={0,
1},
42
I
={2, 5}.
Then, calculate the supports:
Table 2 statistical table for item sets
Item
Support
11
I
2
12
I
2
13
I
2
21
I
3
22
I
2
23
I
1
31
I
2
32
I
2
33
I
2
41
I
4
42
I
2
Thirdly, normalize the database.
Table 3 Transformed transaction database
TID
Items
1
11
I
,
21
I
,
31
I
,
41
I
2
12
I
,
22
I
,
31
I
,
42
I
3
13
I
,
21
I
,
32
I
,
41
I
4
11
I
,
21
I
,
32
I
,
41
I
5
13
I
,
23
I
,
33
I
,
42
I
6
12
I
,
22
I
,
33
I
,
41
I
Finally, i
f we mine the transformed database, can find such
association rules as “42 percent of male people earned 2,000
to 5,000 each month own at most one villa”, That is to say,
21
I
→
41
I
(Support=0.42,Confidence=1).
ALGORITHOM EVALUATION
In order to testify the effect of the above algorithm, we have
made a large quantity of testing work. The imitative and true
data testing expressed the aforementioned algorithm coul
d
improve the effect of data mining dramatically, and found
many valuable association rules.
The following are the results of processing the true
databases
.
Experiment One
This is a result of mining about a data set named Flags, which
describes the chara
cters of the national flags of all the
countries in the world, and the characters include area of
country, population, national flag, color, shape, size, special
patterns, layout and so on. The mined results are described in
table 4:
Table 4 The results o
f mining for A database named Flags
Experiment
Two
This is a result of mining about a data set named Zoo, which
describes the different characters of 101 kinds of animals,
and the characters include: hair, eggs, milk, tail, legs, toothed
and so on. The mined results are described in table 5:
Table 5 T
he results of mining for A database named Zoo
Rema
rks:
In the above experiments, data set comes from
ftp://www.pami.sjtu.edu.cn
which are true databases. The
data scale refers to all the records included in database; the
scale for attributes refers to the number
s of attributes in
attribute set; the numbers of effective attributes refers to the
numbers of the rest attributes after reduced; the values of
average each attribute refers to all the different intervals after
being divided; the numbers of pattern refers
to the numbers
of pattern after mining.
Additionally, it is needed to explain: In order to enhance the
effect, we labeled all the attributes during doing the
experiments. For example, if there is a rule “1,11
→
40”, it
represents “if the country is in America and the population is
within 10,000,000, there is not any vertical stripe in its flag”.
The detail explanations on the experiments and other
experimental results can be obtained by visiting
ftp://ftp.glnet.edu.cn
.
CONCLUSIONS
The application of interval value clustering in data mining
has been discussed in this article. Firstly put forward three
kind of interval

value clustering methods, and explained by
examples; the
n respectively offered interval

value clustering
mining methods in common database and interval database;
at last there was an active example research; in section 4, the
results of three experiments sufficiently proved that the
scale
for
data
set
numbers
of
effective
attributes
values
of
averag
e each
attribu
te
numbers
of pattern
threshold
suppo
rt
con
fide
nce
194
17
2
10
0.85
0.8
194
14
6
138
0.7
0.8
194
13
8
72
0.77
0.8
scale
for
data
set
numbers
of
effective
attributes
values
of
averag
e each
attribu
te
numbers
of pattern
threshold
sup
port
confi
dence
101
18
3
2
0.8
0.7
101
14
2
4
0.72
0.8
101
10
2
4
0.7
0.74
application of interval

valu
e clustering in data mining has
widely developing prospect.
This kind of data mining method based on interval

value is
mainly fit to mine some numeric data and comparable data;
especially for the interval value database it is more
significant. For other ty
pe data, it has certain value for
reference. In the world, the research about the relation
between interval value clustering and data mining is just at
the primary stage, but its development prospect is quite well.
A series of new technology and software w
ill be produced.
Now, most big commercial companies are all competing the
supermarket, and 21
st
century will be a new era where
interval value clustering method is used in data mining.
REFERENCES
1. Agrawal, R.
,
Imieliski, T.
and
Swami, A.
Mining
Associa
tion Rules Between Sets of Items in Large
Databases.
ACM SIGMOD Int. Conf.
On
Management of
Data, 1993, 1993: 207

216.
2
. Agrawal
,
R. and Srikant
,
R.
Fast algorithm for mining
association rules in large database.
In Reseach Report
RJ9839,
IBM Almaden Resea
rch Center,San Jose,CA,June
1994.
3
. He
,
X.G. Fuzzy Database System, Tsinghua University
Press, Beijing,1994.
4.
Hu
,
C.Y., Xu
,
S.Y., and Yang X.G. A Introduction to
Interval Value Algorithm,
Systems Engineering

Theory &
Practice, 2003 (4): 59

62.
5.
Len
t,
B.
,
Sw
ami, A.
,
Widom
,
J. Clustering association
rules.
In Proc.1997 Int. Conf. Data Engineering (ICDE’97),
Birmingham,
England,
April
,
1997.
6.
Luo
,
C.Z. A Guide to Fuzzy Set, Beijing Normal
University Press, Beijing,1989.
7.
Moore
,
R.,
Yang
,
C. Interva
l Analysis I.
Technical
Document,
Lockheed Missiles and Space Division,
Number
LMSD

285875,1959.
8.
Srikant
,
R.
,
and Agrawal, R. Mining Quantitative
Association Rules in Large Tables. In:
Proceedings of ACM
SIGMOD
, 1996: 1

12.
9.
Wu
,
X.D., Zhang
,
C.Q.
,
and
Zhang
,
S.C. Mining Both
Positive and Negative Association Rules. In Proceedings of
19
th
International Conference on Machine Learning, Sydney,
Australia, July 2002:658

665.
10
Yin, Y.
F.
, Zhang, S.
C.
, Xu, Z.
Y.
A Useful Model for
Software Data Mining, Comput
er Application 3, 2004
:
10

13.
11.
Yin, Y.
F.
, Zhong, Z., and Liu, X.
S.
Data Mining Based
Stability Theory. Changjiang River Science Research
Institute Journal 2, 2004
:
22

24.
12.
Zhang
,
C.Q.
,
and Zhang
,
S.C. Association Rule Mining
Models and Algorithms. S
pringer

Verlag, Berlin Heidelberg,
2002.
13.
Zhang
,
S.C., and Zhang
,
C.Q. Discovering Causality in
Large Databases,
Applied Artificial Intelligence,
2002.
14.
Zhang
,
X.F. The Cluster Analysis for Interval

valued
Fuzzy Sets, Journal of Liaocheng University,
2001,14(1):5

7.
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment