# Clustering Algorithms for

Τεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 4 χρόνια και 5 μήνες)

81 εμφανίσεις

Clustering Algorithms for
Categorical Data Sets

As mentioned earlier, one essential issue for
clustering a categorical data set is to define
a similarity(dissimilarity) function between
two objects.

One of the most fundamental and important
data model of categorical data sets is the
market
-

The Market
-

In the data model, there is a set of objects
{O
1
, O
2
,…, O
n
} and a set of transactions
{T
1
, T
2
,…, T
m
}. Each transaction is
actually is subset of the object set.

A market
-
represented by a 2
-
dimensional table, in
which each entry is either 0 or 1.

The Tabular Representation of the

m
n
T
T
T
O
O
O

0
1
1
0
2
1
2
1
Data Sets with the Market
-
Data Model

A record of web site accesses.

A record of course enrollment.

Clustering Objects in a Market
-

In this problem, it is assumed that each
transaction is an independent event.

The commonly used measures of similarity
include:

Jacard coefficient.

Mutual information.

B
A
B
A

B
P
A
P
B
A
P

Once the similarity between each pair of
objects has been determined, then we may
apply algorithms such as single
-
complete
-

Experiment results shows that the complete
-
clustering quality than the single
-
algorithm.

An Example

Given the following web access record, we may
cluster the web sites accordingly.

Site 1

Site 2

Site 3

Site 4

Site 5

User1

1

1

0

1

1

User2

1

0

1

0

0

User3

0

1

0

1

1

User4

1

0

1

0

1

User5

1

0

1

1

1

Based on the Jacard coefficient, we have the following
similarity measurements:

sim(s1, s2) = 1/5

sim(s1, s3) = 3/4

sim(s1, s4) = 2/5

sim(s1, s5) = 3/5

sim(s2, s3) = 0

sim(s2, s4) = 2/3

sim(s2, s5) = 1/2

sim(s3, s4) = 1/5

sim(s3, s5) = 2/5

sim(s4, s5) = 3/5

If we employ the complete
-
then we have the following cluster result:

s1

s3

s2

s4

s5

¾

2/3

½

We may use the chi
-
square statistics as the
similarity measure. However, we need to consider
whether the accesses to two web sites are
positively correlated or negatively correlated.

For example:

s1

~s1

s3

3

0

3/5

~s3

1

1

2/5

4/5

1/5

8
15

s3)
sim(s1,
set

we
Therefore,
).
correlated
y
(positivel

1
4
5
)
3
(
)
1
(
)
3
&
1
(
8
15
2
5
3
5
8
5
12
5
5
3
5
3
5
4
5
3
2
2

s
P
s
P
s
s
P

On the other hand.

s2

~s2

s3

0

3

3/5

~s3

2

0

2/5

2/5

3/5

6

s3)
sim(s2,
set

we
Therefore,
).
correlated
y
(negativel

0
)
3
(
)
2
(
)
3
&
2
(
6
4
5
6
5
4
5
6
5
5
6
2
2

s
P
s
P
s
s
P

The Object
-
Attribute Data Model

In the data model, there is a set of objects
{O
1
, O
2
,…, O
n
} and a set of attributes {A
1
,
A
2
,…, A
m
}. Each attribute has a number of
possible values.

For example, we may characterize a person
by education background, profession,
marriage status, …etc.

If each attribute has exactly two possible
values, then the object
-
attribute data model
is degenerated to the market
-
model.

An object
-
attribute data set can be
transformed to a market
-
the following example shows.

Attributes

Objects

A
i

v
1

v
2

v
k

:

Attributes

Objects

A
i
=v
1

:

A
i
=v
2

A
i
=v
k

The ROCK algorithm

A categorical data clustering algorithm that
takes into account node connectivity.

In ROCK, each object is represented by a
node.

Two nodes are connected by an edge if the
similarity between the corresponding
objects exceeds a threshold.

Let
i
, n
j
)

of two nodes
n
i

and
n
j

denote
the number of common neighbors of these
two nodes.

Given a data set and an integer number

k
,
the ROCK algorithm partitions the objects
into
k

clusters so that the following function
is maximized.

k
i
C
n
n
f
i
s
r
i
i
s
r
C
n
n
C
1
,
2
1
,

The ROCK algorithm works bottom
-
up by
merging the pair of clusters that has
maximum goodness measurement

f
j
f
i
f
j
i
j
i
j
i
C
C
C
C
C
C
C
C
g
2
1
2
1
2
1
,
,

Fundamental of the Criteria
Functions

Assume that the expected number of edges
at a node in cluster

C
i

is
|C
i
|
f(

)
.

Then, the expected number of links
contributed by a node in
C
i

is

Therefore, the expected number of links in
C
i

is

.
1
2
θ
f
i
θ
f
i
θ
f
i
C
C
C

.
2
1
2
θ
f
i
θ
f
i
i
C
C
C

The Pseudo
-
code of the ROCK
Algorithm

procedure

cluster(
S,k
)

begin

S
)

for each

s

S

do

q
[
s
] := build_local_heap(
)

Q

S,q
)

while

size
(
Q
) >
k

do

{

u

:= extract_max(
Q
)

v

:= max(
q
[
u
])

delete(
Q,v
)

w

:= merge(
u,v
)

for each

x

q
[
u
]

q
[
v
]
do

{

[
x,w
] :=
[
x,u
] +
[
x,v
]

delete(
q
[
x
]
,u
); delete(
q
[
x
]
,v
)

insert(
q
[
x
]
,w,g
(
w,x
)); insert(
q
[
w
]
,x,g
(
w,x
))

update(
Q,x,q
[
x
])

}

insert(
Q,w,q
[
x
])

deallocate(
q
[
u
]); deallocate(
q
[
v
])

}

end

The COBWEB Conceptual
Clustering Algorithm

The COBWEB algorithm was developed by
machine learning researchers in the 1980s
for clustering objects in a object
-
attribute
data set.

The COBWEB algorithm yields a clustering
dendrogram called classification tree that
characterizes each cluster with a
probabilistic description.

The Classification Tree Generated
by the COBWEB Algorithm

The Category Utility Function

The COBWEB algorithm operates based on
the so
-
called category utility function (CU)
that measures clustering quality.

If we partition a set of objects into
m
clusters, then the CU of this particular
partition is

m
V
A
P
C
V
A
P
C
P
i
j
ij
i
i
j
k
ij
i
m
k
k

2
2
1
|
Insights of the CU Function

For a given object in cluster
C
k
, if we guess
its attribute values according to the
probabilities of occurring, then the expected
number of attribute values that we can
correctly guess is

i
j
k
ij
i
C
V
A
P
2
|

Given an object without knowing the cluster
that the object is in, if we guess its attribute
values according to the probabilities of
occurring, then the expected number of
attribute values that we can correctly guess
is

i
j
ij
i
V
A
P
2

P(
C
k
)is incorporated in the CU function to
give paper weighting to each cluster.

Finally,
m

is placed in the denominator to
prevent over
-
fitting.

Operation of the COBWEB
algorithm

The COBWEB algorithm constructs a
classification tree incrementally by inserting
the objects into the classification tree one by
one.

When inserting an object into the
classification tree, the COBWEB algorithm
traverses the tree top
-
down starting from the
root node.

At each node, the COBWEB algorithm
considers 4 possible operations and select
the one that yields the highest CU function
value:

insert.

create.

merge.

split.

Insertion means that the new object is
inserted into one of the existing child nodes.
The COBWEB algorithm evaluates the
respective CU function value of inserting
the new object into each of the existing
child nodes and selects the one with the
highest score.

The COBWEB algorithm also considers
creating a new child node specifically for
the new object.

The COBWEB algorithm considers
merging the two existing child nodes with
the highest and second highest scores.

B

A

P

P

B

A

N

Merge

The COBWEB algorithm considers spliting
the existing child node with the highest
score.

B

A

P

P

B

A

N

Split

The COBWEB Algorithm

Cobweb(N, I)

If N is a terminal node,

Then Create
-
new
-
terminals(N, I)

Incorporate(N,I).

Else Incorporate(N, I).

For each child C of node N,

Compute the score for placing I in C.

Let P be the node with the highest score W.

Let Q be the node with the second highest score.

Let X be the score for placing I in a new node R.

Let Y be the score for merging P and Q into one node.

Let Z be the score for splitting P into its children.

If W is the best score,

Then Cobweb(P, I) (place I in category P).

Else if X is the best score,

Then initialize R’s probabilities using I’s values

(place I by itself in the new category R).

Else if Y is the best score,

Then let O be Merge(P, R, N).

Cobweb(O, I).

Else if Z is the best score

Then Split(P, N).

Cobweb(N, I).

Input:

The current node N in the concept hierarchy.

An unclassified (attribute
-
value) instance I.

Results:

A concept hierarchy that classifies the instance.

Top
-
level call:

Cobweb(Top
-
node, I).

Variables:

C, P, Q, and R are nodes in the hierarchy.

U, V, W, and X are clustering (partition) scores.

Auxiliary COBWEB Operations

Variables:

N, O, P, and R are nodes in the hierarchy.

I is an unclassified instance.

A is a nominal attribute.

V is a value of an attribute.

Incorporate(N, I)

update the probability of category N.

For each attribute A in instance I,

For each value V of A,

Update the probability of V given category N.

Create
-
new
-
terminals(N, I)

Create a new child M of node N.

Initialize M’s probabilities to those for N.

Create a new child O of node N.

Initialize O’s probabilities using I’s value.

Merge(P, R, N)

Make O a new child of N.

Set O’s probabilities to be P and R’s average.

Remove P and R as children of node N.

Add P and R as children of node O.

Return O.

Split(P, N)

Remove the child P of node N.

Promote the children of P to be children of N.

Probability
-
Based Clustering

The foundation of the probability
-
based
clustering approach is based on a so
-
called
finite mixture model.

A mixture is a set of
k

probability
distributions, each of which governs the
attribute values distribution of a cluster.

A 2
-
Cluster Example of the Finite
Mixture Model

In this example, it is assumed that there are
two clusters and the attribute value
distributions in both clusters are normal
distributions.

N
(

2
,

2
2
)

N
(

1
,

1
2
)

The Data Set

A 51 B 62 B 64 A 48 A 39 A 51

A 43 A 47 A 51 B 64 B 62 A 48

B 62 A 52 A 52 A 51 B 64 B 64

B 64 B 64 B 62 B 63 A 52 A 42

A 45 A 51 A 49 A 43 B 63 A 48

A 42 B 65 A 48 B 65 B 64 A 41

A 46 A 48 B 62 B 66 A 48

A 45 A 49 A 43 B 65 B 64

A 45 A 46 A 40 A 46 A 48

Operation of the EM Algorithm

The EM algorithm is to figure out the
parameters for the finite mixture model.

Let {
s
1
,
s
2
,…,
s
n
} denote the the set of
samples.

In this example, we need to figure out the
following 5 parameters:

1
,

1
,

2
,

2
, P(C
1
).

For a general 1
-
dimensional case that has
k

clusters, we need to figure out totally
2
k+
(
k
-
1) parameters.

The EM algorithm begins with an initial
guess of the parameter values.

Then, the probabilities that sample
s
i

belongs to these two clusters are computed
as follow:

.

sample

of

value
attribute

is

where
,
|
Prob
|
Prob
|
Prob
2
1
Prob
Prob
Prob
Prob
|
Prob
|
Prob
2
1
Prob
Prob
Prob
Prob
|
Prob
|
Prob
2
1
1
2
2
2
2
2
2
2
1
1
1
1
1
2
2
2
2
2
1
2
1
i
i
i
i
i
i
x
i
i
i
i
x
i
i
i
i
s
x
x
C
x
C
x
C
p
e
x
C
x
C
C
x
x
C
e
x
C
x
C
C
x
x
C
i
i

The new estimated values of parameters
are computed as follows.

n
p
C
P
p
x
p
p
x
p
p
x
p
p
x
p
n
i
i
n
i
i
n
i
i
i
n
i
i
n
i
i
i
n
i
i
n
i
i
i
n
i
i
n
i
i
i

1
1
1
1
2
1
2
2
1
1
2
1
2
1
1
1
2
1
1
1
)
(
)
1
(
)
1
(

;

)
1
(
)
1
(

;

The process is repeated until the clustering
results converge.

Generally, we attempt to maximize the
following likelihood function:

.

2
1
*
2
1
2
2
1
1
i
i
i
i
i
i
i
i
i
i
i
|x
C
P
|x
C
P
coff
|x
C
P
x
P
|x
C
P
x
P
|C
x
P
C
P
|C
x
P
C
P

Once we have figured out the approximate
parameter values, then we assign sample
s
i

into
C
1
, if

Otherwise,
s
i

is assigned into
C
2
.

.
2
1
Prob
Prob
Prob
Prob
|
Prob
|
Prob
2
1
Prob
Prob
Prob
Prob
|
Prob
|
Prob
,
5
.
0
|
Prob
|
Prob
|
Prob
2
2
2
2
2
1
2
1
2
2
2
2
2
2
2
1
1
1
1
1
2
1
1

i
i
x
i
i
i
i
x
i
i
i
i
i
i
i
i
e
x
C
x
C
C
x
x
C
e
x
C
x
C
C
x
x
C
x
C
x
C
x
C
p
The Finite Mixture Model for
Multiple Attributes

The finite mixture model described above
can be easily generalized to handle multiple
independent attributes.

For example, in a case that has two
independent attributes, then the distribution
function of cluster
j

is of form:

yj
yj
xj
xj
y
x
yj
xj
j
e
e
y
x
f


2
2
2
2
2
1
,

Assume that there are 3 clusters in a 2
-
dimensional data set. Then, we have 14
parameters to be determined:

x1
,

y1
,

x1
,

y1
,

x2
,

y2
,

x2
,

y1
,

x3
,

y3
,

x3
,

y3
,
P
(
C
1
),
and
P
(
C
2
).

The probability that sample

s
i

belongs to
C
j

is:

)
,
(
|
Prob
)
,
(
|
Prob
)
,
(
|
Prob
)
,
(
|
Prob
2
1
)
,
(
Prob
Prob
)
,
(
Prob
Prob
|
)
,
(
Prob
)
,
(
|
Prob
3
2
1
2
2
2
2
2
2
i
i
i
i
i
i
i
i
j
ji
y
x
yj
xj
i
i
j
i
i
j
j
i
i
i
i
j
y
x
C
y
x
C
y
x
C
y
x
C
p
e
e
y
x
C
y
x
C
C
y
x
y
x
C
yj
yj
i
xj
xj
i



The new estimated values of the parameters
are computed as follows:

n
p
C
P
p
y
p
p
x
p
p
y
p
p
x
p
n
i
ji
j
n
i
ji
n
i
yj
i
ji
yj
n
i
ji
n
i
xj
i
ji
xj
n
i
ji
n
i
i
ji
yj
n
i
ji
n
i
i
ji
xj

1
1
1
2
2
1
1
2
2
1
1
1
1
)
(

;

;

Limitation of the Finite Mixture
Model and the EM Algorithm

The finite mixture model and the EM
algorithm generally assume that the
attributes are independent.

Approaches have been proposed for
handling correlated attributes. However,
these approaches are subject to further
limitations.

Generalization of the Finite
Mixture Model and the EM
Algorithm

The finite mixture model and the EM algorithm
can be generalized to handle other types of
probability distributions.

For example, if we want to partition the objects
into
k

clusters based on
m

independent nominal
attributes, then we can apply the EM algorithm to
figure out the parameters required to describe the
distribution.

In this case, the total number of parameters
is equal to

If two attributes are correlated, then we can
merge these two attributes to form an
attribute with |
A
i
| |
A
j
| possible values.

.

attribute
of

values
possible

of
number

the
is

where
,
)
1
(
1
i
i
m
i
i
A
A
A
k
k

An Example

Assume that we want to partition 100
samples of a particular species of insects
into 3 clusters according to 4 attributes:

Color(
A
c
): milk, light brown, or dark brown;

A
h
): spherical or triangular;

Body length(
A
l
): long or short;

Weight(
A
w
): heavy or light.

If we determine that body length and weight are
correlated, then we create a composite attribute
A
s
:(length, weight) with 4 possible values: (L, H),
(L, L), (S, H), and (S, L).

We can figure out the values of the parameters in
the following table with the EM algorithm, in
C
1
), P(
C
2
), and P(
C
3
):

Color

(Body length, Weight)

C
1

P(M|C
1
)

P(L|C
1
)

P(D|C
1
)

P(S|C
1
)

P(T|C
1
)

P((L,H)|C
1
), P((S,H)|C
1
)

P((L,L)|C
1
), P((S,L)|C
1
)

C
2

P(M|C
2
)

P(L|C
2
)

P(D|C
2
)

P(S|C
2
)

P(T|C
2
)

P((L,H)|C
2
), P((S,H)|C
2
)

P((L,L)|C
2
), P((S,L)|C
2
)

C
3

P(M|C
3
)

P(L|C
3
)

P(D|C
3
)

P(S|C
3
)

P(T|C
3
)

P((L,H)|C
3
), P((S,H)|C
3
)

P((L,L)|C
3
), P((S,L)|C
3
)

We invoke the EM algorithm with an initial
guess of these parameter values.

For each sample
s
i
=(
v
1
,
v
2
,
v
3
), we compute
the following probabilities:

)
,
,
(
|
Prob
)
,
,
(
|
Prob
)
,
,
(
|
Prob
)
,
)(
,
,
(
|
Prob
)
,
,
(
Prob
Prob
|
Prob
|
Prob
|
Prob
)
,
,
(
Prob
Prob
|
)
,
,
(
Prob
)
,
,
(
|
Prob
3
2
1
3
3
2
1
2
3
2
1
1
3
2
1
3
2
1
3
2
1
3
2
1
3
2
1
3
2
1
v
v
v
C
v
v
v
C
v
v
v
C
y
x
v
v
v
C
p
v
v
v
C
C
v
C
v
C
v
v
v
v
C
C
v
v
v
v
v
v
C
i
i
j
ji
j
j
j
j
j
j
j

The new estimated values of the parameters
are computed as follows:

n
p
C
P
p
p
M
s
C
M
P
n
i
ji
j
n
i
ji
n
i
ji
i
j

1
1
1
)
(
)
)
(
color
(
)
|
(