On statistical models of cluster

coachkentuckyΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

73 εμφανίσεις

1

On statistical models of cluster
stability



Z. Volkovich
a, b
, Z. Barzily
a

, L. Morozensky
a

a. Software Engineering Department, ORT
Braude College of Engineering, Karmiel,
Israel

b. Department of Mathematics and Statistics, The
University of Maryland, Baltimore County,
UMBC, Baltimore, USA

2

What is Clustering?


Clustering deals the partitioning of a data set
to groups of elements which are similar to
each other.




A group membership is determined by means
of a distance
-
like function that measures the
resembling between two data points.

3

Goal of the paper



In the current paper we present a method for
assessing cluster stability.




This method, combined with a clustering
algorithm, yields an estimate of a data
partition, namely, the number of clusters and
the attributes of each cluster.

4

Concept of the paper


The basic idea of our method is that if one
”properly” clusters, two independent samples then,
under the assumption of a consistent clustering
algorithm, the clustered samples can be classified as
two samples drawn from the same population.

5

The Model


Conclusion: The substance we are dealing with

belongs to the subject of the hypothesis testing.

As no prior knowledge of the distribution of the

population is available thus, a distribution
-
free

two
-

sample test can be applied.


6

Two
-
sample test




Which two
-
sample tests can be used for our
purpose? There are several possibilities. We
consider the two
-
sample test built on negative
definite kernels approach proposed by
A.A.
Zinger, A.V. Kakosyan and L.B. Klebanov,
1989
and L. Klebanov,
2003
.


This approach is very similar to the one proposed
by
G. Zech, B. Aslan,
2005
..


Applications for distribution’s characterization of
these distances were also discussed by




L. Klebanov, T. Kozubowskii, S. Rachev and

V.
Volkovich
,
2001
.





7

Negative Definite Kernels

A real symmetric function
N

is negative definite,

if for any
n


1




any
x
1
, .., x
n

Є

X

for any real numbers
c
1
, .., c
n

such

that



The kernel is called strongly negative definite, if the

equality in this relationship is reached only if
c
i

=
0
,

i
=
1
, .., n
.





j
,
i
j
i
j
i
0
c
c
)
x
,
x
(
N



n
1
i
n
.
0
c
8

Example

Functions of the type
φ
(
x
) = ||
x||
r

,
0
< r


2
,
produce

negative definite kernels, which are strongly

negative definite if
0
< r <
2
.

It is important to

note that a negative definite kernel,
N
2
,

can be obtained from a negative definite kernel,
N
1
,

by the transformations
N
2

= N
1
α


,
0
< α <
1

and


N
2

= ln(
1
−N
1
).


9

We restrict ourself to the hard clustering situation,


where the partition is defined by a set of associations



In this case, the underlying distribution of
X

is







Cj

x
0
Cj

x
1
)
Cj
,
(
x
v

(*)

1



k
j
c
c
X
j
j
p


where



are cluster probabilities and




are the
inner
clusters distributions

.

Negative Definite Kernel

test

j
c
p
j
c

10

Negative Definite Kernel

test (
2
)

We consider kernels

N(x
1
, x
2
, c
1
, c
2
) = N
x
(x
1
, x
2
) χ(c
1
=c
2
) ,

where
N
x
(x
1
, x
2
)

is a negative definite kernel and

χ(c
1
=c
2
)

is an indicator function of the event

{c
1
=c
2
}
.
Formally speaking, this kernel is

not a Negative definite kernel
.
However, a

distance can be constructed as:












)
dx
(
)
dx
(
)
x
,
x
(
N
p
p
)
,
(
L
2
2
1
1
2
1
k
1
c
k
1
c
c
c
2
1
1
2
2
1
and
Dis(μ, ν) = L(μ, μ) + L(ν, ν) −
2
L(μ, ν)

.

11

Negative Definite Kernel

test (
3
)

Theorem
.


Let
N(x
1
, x
2
, c
1
, c
2
)

be a negative definite kernel
described above

and let
μ

and

ν

be two measures satisfying
(
*
) such that


P
μ
(c|x) = P
ν
(c|x),


then



Dis(μ, ν)


0
;


• If
N
x

is a strongly negative definite function


then

Dis(μ, ν) =
0

if and only if
μ = ν
.


12



Let
S
1
: x
1
,
x
2
,

…, x
n

and
S
2
: y
1
,
y
2
,

…, y
n
be two samples
of independent random vectors having probability laws F
and G respectively. We are willing to test the hypothesis


Against the alternative



when the distributions
F

and
G

are unknown.




Negative Definite Kernel

test (
4
)

G
F
:
H
0

G
F
:
H
1

13

Algorithm description


Let us suppose that a hard clustering algorithm
Cl
,

based on the probability model, is available.

Input parameters: a clustered sample
S

and a

predefined number of clusters
k
.

Output parameters: clustered sample
S(
k
) = (S
,
C
k
)


consisting of a vector
C
k

of the cluster labels of
S
.

For two given disjoint samples
S
1

and
S
2

we consider

a clustered sample
(S
1

U

S
2
, C
k
)
and denote by
c
the

mapping from this clustered sample to
C
k

.

where
|C
j
|

is the size of the cluster number

j
:

Algorithm description (
2
)

Let us introduce


)
x

(x
j)


)
c(x

j,


)
c(x

,
x
,
N(x
*
*

|


C
|
1


)
S

,

L(S

1
1
2
2
S
x
S
x
2
1
2
1
2
1
k
1
j
j
2
1



















1
1
2
2
S
x
S
x
2
1
2
1
j
.

)
x

(x

j)


)
(c(x

j)


)
(c(x



|


C
|



15

Algorithm description (
3
)


The algorithm consists of the following steps

:





.
.
8
,...,
1

},
Q
{
.
7
).
Q
(
/
))
Q
(
(
Q
N
,...,
1
,
Q

of

values
standard

the
Calculate

6.
).
S
,
S
(
2
)
S
,
S
(
)
S
,
S
(
Q

:
Calculate

5.
).
,
S
\
(
S
.
4
).
,
(
S

3.
N

to
1
n
for

2.
k

to
1
k
for

1.

.
n
~
n
n
n
~
s
n
(2)
n
(1)
n
(2)
n
(2)
n
(1)
n
(1)
n
n
(1)
n
(2)
n
(1)
n
s
*
k
S
k
n
I
of
extremum
the
yielding
one
the
as
chosen
is
k
of
value
Optimal
N
n
sample
the
for
I
index
ion
concentrat
an
Calculate
std
mean
Q
by
n
L
L
L
M
X
sample
M
X
sample











16

Algorithm description (
4
)


Remarks about the algorithm

:



1.
Need for standardization (Step
6
):

i.
The clustering algorithm may not determine the correct cluster for an
outlier. This adds noise to the result.

ii.
The noise level decreases in k since less data elements are assigned to
distant centroids.

iii.
Standardization decreases the noise level.

2.
Choice of the optimal k as the most concentrated (Step
8
):

i.
If k is less than the “true” number of clusters then at least one cluster is
formed by uniting two separate clusters thus, is less concentrated.

ii.
If k is larger than the “true” number of clusters then at least one cluster is
formed in a location where there is a
random

concentration of data
elements in the sample. This, again, decreases the concentration of


because two clusters are not likely to have the same random concentration.

~
n
}
Q
{
~
n
}
Q
{
~
n
}
Q
{
17

Numerical experiments


In order to evaluate the performance of the

described methodology we provide several

numerical experiments on synthetic and real datasets.

The selected samples (steps
3

and
4

of the algorithm)

are clustered by applying the
K
-
Means

algorithm.
The results obtained are used as inputs for steps
4


and
5

of the algorithm. The quality of the
k*


partitions is evaluated (step
7

of the algorithm) by

three concentration statistics: the
Friedman’s Index
,

The
KL
-
distance

and the
Kurtosis
.

Numerical experiments (
2
)


We demonstrate the performance of our algorithm

by comparing our clustering results to the ”true”

structure of the real datasets. This dataset is chosen

from the text collections available at

http://www.dcs.gla.ac.uk/idom/ir resources/test collections/.


The set consists of the following three collections:


DC
0

Medlars Collection (
1033

medical abstracts).


DC
1

CISI Collection (
1460

information science abstracts).


DC
2

Cranfield Collection (
1400

aerodynamics abstracts).

19

Numerical experiments (
3
)

7

6

5

4

3

2

0.0647

0.0867

0.0919

0.0299

0.0150

0.0193

Friedman index

0.1711

0.1633

0.1541

0.0767

0.0558

0.0522

KL
-
distance

7.2073

5.6070

4.6352

3.0289

2.4233

3.3275

Kurtosis

Following the ”bag of words” well known approach,
300


and
600

“best” terms were selected, and the thirty leading

Principal components were found. In the case when

number of the samples and size of the samples

are equal
1000

for
K(x,y)=||x
-
y||
2

we obtained

20

Numerical experiments (
4
)

We can see that two of the indexes indicate

three clusters in the data



Thank you