Testing of Clustering

mudlickfarctateΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

100 εμφανίσεις

Testing of Clustering

Noga Alon, Seannie Dar

Michal Parnas, Dana Ron

Property Testing (Informal Definition)

For a fixed property
P

and any object
O
,

determine whether
O

has property
P
,

or whether
O

is
far

from having property

P


(i.e.,
far

from any other object having
P

).

Task should be performed by
querying
the object (in
as
few
places as possible).

?

?

?

?

?


Examples


Object can be a
graph

(represented by its
adjacency matrix), and property can be
3
-
colorabilty.



Object can be a
function

and property can be
linearity.




Context


A

relaxation
of
exactly

deciding
whether the
object has

the property.



A
relaxation

of
learning

the object.

Property testing can be viewed as:

In either case want testing algorithm to be

significantly

more efficient

than decision/learning

algorithm.

When can Property Testing be Useful?


Object is to
too large

to even fully scan, so
must
make approximate decision.


Object is not too large but

(1) Exact decision is
NP
-
hard

(e.g. coloring)

(2) Prefer
sub
-
linear

approximate algorithm to


polynomial exact algorithm.


Use Testing as
preliminary step

to exact
decision or learning. In first case can quickly
rule out object far from property. In second
case can aid in efficiently
selecting good
hypothesis class.


Previous Work


Testing
algebraic

properties
-

linearty, low
-
degree polynomials...


Testing

graph

properties
-

bipartitness,
k
-
colorability, connectivity, acyclicity, first
-
order
graph properties...


Testing
monotonicity

of functions.


Testing (properties defined by)
regular
languages
,
branching programs
.

Testing of Clustering

X

-

set of points in , |
X
| =
n


d

Notation:

dist(x,y)

-

distance
between points
x

and
y


(e.g.

Euclidean)


For any subset
S

of
X
:

)
,
(
max
min
)
(
y
x
dist
S
r
S
x
y


)
,
(
max
)
(
,
y
x
dist
S
d
S
y
x


-

Diameter

of
S

-

Radius

of
S

.
y

Definition Continued (for diameter cost)


X

is
(
k,b
)
-
clusterable

if exists
a
k
-
way

partition (
clustering
)
of

X


s.t. each
cluster

has
diameter
at most
b
.

X


is
e
-
far

from being
(
k,b’
)
-
clusterable
(
b’


b
)

if
there is
no

k
-
way

partition of
any

Y



X
,

|
Y
|


(1
-
e
)
n

s.t. each cluster has diameter

at most
b’
.

In first case algorithm should
accept

and in second
reject

with probability


2/3




b

Our Results


For
general

metrics (obey
triangle ineq
.),

and
both costs:


b’=2b, |S| =


. For
b’<2b
,
|S| =

.


For
L
2

metric and
radius cost
:

b’= b
,
|S|=

.


For

L
2

metric and
diameter cost
:

b’=
(
1+
b
)
b
,
|S|=

.

Dependence on
1/
b

and exponential
dependence on dimension
d

are
unavoidable
.


All algorithms select uniform sample
S



X

where
|
S
|=

poly
(
k
,1/
e
)
.

)
/
(
~
e
k
O
)
/
(
e
n

)
/
)
((
~
e
k
d
O

)
)
/
2
(
)
/
((
~
2
2
d
k
O
b
e

Our Results cont.

Our Algorithms can be used to obtain
approximately good

clusterings.

That is,
k
-
clusterings with cost at most
b


of
all but at most an
e
-
fraction of the points.

Independently,
Mishara, Oblinger and Pitt

give algorithms with similar complexities for
other costs (e.g.,
sum of distances

to center.)

Related Work on Clustering


Hard to approximate

cost of optimal clustering
to within constant factor

(e.g < 2) even for L
2

metric
[HS,FG]
.



An approximation factor

of 2 can be
achieved
efficiently

under both costs
[FG]
.


Testing Diameter Clustering


under the L
2

metric

Apply “
natural
” algorithm:


-

Uniformly and independently

select sample


from
X
, having size

-

If
sample is (
k,b
)
-
clusterable

then
ACCEPT
,


otherwise
REJECT
.


Verifying whether a sample of size
m

is (
k,b
)
-
clusterable (according to diam
-
cost), can be

done in time .

)
)
/
2
(
)
/
((
~
2
2
d
k
O
m
b
e


)
(
2
k
d
m
O

Analysis of k=1 d=2 case

If
X

is
(k,b)
-
clusterable

then always accept.

Assume from now on that
X

is
e
-
far from
(
k,(1+
b
)b
)
-
clusterable
.


Will show that w.p.


2/3, sample contains at
least
2 points at distance >
b

(causing
rejection).

Analysis of k=1 d=2 case cont
.

View sample as being selected in
p =
p/b
2

phases
.
In each phase make certain
progress
.

For ,
A(R)

-

area

of
R


2


R
For every
x

in sample,
y

in
I
j

dist(x,y)


b

x

b

I
j

x

y

I
j

-

intersection

of all circles centered at sample


points at end of phase
j
.

For
x



X
,
C
x


-

circle

of radius
b

centered at
x

Analysis of k=1 d=2 case cont
.

Say that point
y


X

is
influential

w.r.t
I
j

if either
y


I
j

or
A
(
I
j



C
y
)
< A
(
I
j
)

-
b
b
2
/2

If in phase
j+1

select point
y


I
j

REJECT.

Otherwise, consider
I
j



C
y
. If


A
(
I
j



C
y
)
< A
(
I
j
)

, make
progress
.

Suppose influential point selected in
each
phase. After



2
p
/
b
2

phases get
y


I
j


I
j

y

Analysis of k=1 d=2 case cont
.

Claim:

In each phase
>
e
n

influential points.

Follows from geometric lemma:

Lemma:

For every
non
-
influential

y


I
j
, and
every
z



I
j

,
dist(y,z)


(1+
b
)b
.

2
/
)
(
)
(
2
b
C
I
A
I
A
y
j
j
b



y

z

Generalizing Argument

d >2 (k=1):

Instead of circles
C
x

, consider
balls

B
x
, and let
I
j

be
intersection of balls
. Modify def.
of influential, and prove analogous geometric
lemma for balls.

k >1:

Also view sample as selected in phases.
In each phase consider
all

k
-
way partitions

of
sample.

Show that w.h.p. new sample contains
influential point w.r.t.
every

partition
-
subset

of
every partition
.

After sufficient number of phases,
every

partition

has diameter
>
b
.

Finding Approximately Good Partitions

Assume
X

is
(
k,b
)
-
clusterable. Then with

prob.


2/3, partition of sample can be

used to
implicitly

define a
k
-
clustering

having diameter at most
(1+
b
)b


of
all but an
e
-
fraction of

the points
.

Idea:
assign each
non
-
influential

point to
appropriate cluster.

Lower Bound for Diameter

Suppose
X

consists of
pairs of antipodal points

on ball. Can position
(1/
b
)
(d
-
1)/2

pairs at
distance
>

(1+
b
)b
, while any two non
-
antipodal
points at distance


b
. To get pair, need

((1/
b
)
(d
-
1)/4
)


examples.

Conclusions and Further Research



Described
sub
-
linear

algorithms for
testing of
clustering

under
various cost measures
, which
can be used for finding
approximately good

clusterings.



Other natural cost measures (not covered by
[MOP]
)?



Practical

applications?

Testing Radius Clustering


under the L
2

metric

Here too apply “
natural
” algorithm:


-

Uniformly and independently

select sample from


X
, having size

-

If
sample is (
k,b
)
-
clusterable

then
ACCEPT
,


otherwise
REJECT
.


Verifying whether a sample of size
m

is (
k,b
)
-
clusterable (according to radius
-
cost), can be done
in time .

))
/
((
~
e
d
k
O
m


)
(
2


k
d
m
O
Analysis of Radius Clustering

S

-

family of subsets in ,
R

-

subset of ,
0<
e
<1. Say that subset
N

of
R

is
e
-
net of
R

w.r.t

S
,
if for every
S

in
S
s.t. , exists point
x

in . (
N
“hits”

every
S

that has non
-
negligible intersection with
R
.)

d

d

|
|
|
|
R
R
S
e


|
|
S
N

Definitions:

Subset
A

of is
shattered
by
S
, if for every


, exists S in
S
, s.t. .

The
VC
-
dim of
S
, VCD(
S
),

is the
maximum size

of
a subset
A

that is shattered by
S
.

d

S
A
A


'
A
A

'

Radius Clustering Cont.

Theorem:

For any family of subsets
S

and subset
R
,
with probability at least 2/3, a sample of size
m
>=8VCD(
S
)/
e

log(VCD(
S
)/
e
) is an
e
-
net for

R

w.r.t.
S
.

Claim:

If
X

is

e
-
far from (
k,b
)
-
clusterable by radius
cost, then alg rejects w.p. >= 2/3.

Proof:

Let
B
k,b

be be family of subsets of
defined by
unions of
k

balls of radius
b
. Let
C
k,b

be
family of
complements

of subsets in
B
k,b

. By
assumption on
X
, for every
S

in
C
k,b

, .

d

|
|
|
|
X
S
X
e


Proof Cont.

Thus, a subset of
X

is
e
-
net of
X

w.r.t.
C
k,b

, iff
contains at least one point from
every

S

in
C
k,b

.

It follows that if the sample selected is an
e
-
net
of
X
, then
it is not (
k,b
)
-
clusterable.

Since VCD(
C
k,b
)=O(
d k

log
k
), by Theorem,
suffice that sample be of size so
that get an
e
-
net and
REJECT
.

))
/
((
~
e
d
k
O
m


Testing of Diameter Clustering Under
General Metrics

Basic Idea
: Try and find points in
X

that are


representatives

of different clusters.


Show
:
-

If
X

is (
k,b
)
-
clusterable, will find
at most



k

representatives;


-

If
X

is
e
-
far from (
k,2b
)
-
clusterable,


will find
k+1

representatives w.h.p.


General Metrics Algorithm

1.
let
rep
1

be arbitrary point in
X

;

2.
i 1
;
find
-
new
-
rep

TRUE
;

3. while
i<k+1

and
find
-
new
-
rep
=
TRUE

do


(a) uniformly and independently select sample


of size
ln(
3k
)/
e
;


(b) if exists
x

in sample s.t.
dist(x,rep
j
)>b

for


every
j<=i
, then
i=i+1
,
rep
i

x
.


else
find
-
new
-
rep

FALSE
;

4. If
i<=k

then
ACCEPT
, otherwise
REJECT

rep1

rep2

rep3

Analysis of Gen. Met. Alg.

Claim 1:

If
X

is (
k,b
)
-
clusterable then alg always
accepts.

Proof of Claim 1:

Alg rejects only if finds
k+1

points at pairwise distances all >
b
.

If
X

is (
k,b
)
-
clusterable no such set exists.

Claim 2:

If
X

is
e
-
far from (
k,2b
)
-
clusterable then alg
rejects w.p >= 2/3.


Analysis of Gen. Met. Alg. Cont
.

Proof of Claim2:

Show that w.h.p. in each iteration,
sample contains new representative, resulting in
k+1

rep’s REJECT.

Consider
i
th

iteration,
i<=k
. Claim that must be at
least
e
n

points in
X

at distance >
b

from each
rep
j
,

j<=i
. Claim2 follows since prob of not selecting such
point in
some

iteration is
<= k (1
-
e
)
ln(3k)/
e

<1/3

To verify sub
-
claim, suppose that <
e
n

such points.

Let us remove these points. Then by tri
-
ineq. if
assign each other point to cluster
j (<=i<=k
) s.t.
rep
j

is at distance at most
b

from
x
, then obtain (
k,2b
)
-
clustering, contradicting assumption on
X
.

Finding an Approx. Good Part.

Assume
X

is (
k,b
)
-
clusterable. Then analysis implies
that with prob. >= 2/3, final rep’s
rep
1
,…,rep
i
, i<=k

can be used to (
implicitly
) define
e
-
good

(
k,2b
)
clustering:

For each
x

in
X

s.t. exists
rep
j
,
dist(x,rep
j
)<=b
,

assign
x

to cluster

j
. Thus obtain (
k,2b
)
-
clustering

of all but at most
e

n

points in
X
.

Lower Bound for Gen. Metrics

If all that is known about distance func. btwn. points
in
X
is that
tri
-
ineq. holds
, then cannot go below
b’=2b

unless use sample of size .

)
/
(
e
n

2

1

1

2

1

(
Matching edges have distance 2, all
others have distance 1)


-

If
X

does not contain matched pairs,


then it is (1,1)
-
clusterable.

-

If contains >
e

n matched pairs, then


e
-
far from (1,2
-
d
)
-
clusterable for


every
d
>0.

(Informal) Definition of Property Testing

For a fixed property
P
and any object
O
,

determine whether
O

has property
P


or is
far

from any other object having

property
P.

The task should be performed by observing

only a
small
(possibly random) part of
O

Property Testing
-

Background



Initially defined by
Rubinfeld and Sudan

in the


context of Program Testing (of
algebraic


functions
).



Goldreich Goldwasser and Ron

initiated study of


testing properties of (undirected)
graphs
.



Body of work deals with properties of
functions
,


graphs
,
strings

...

Property Testing (Informal Definition)

For a fixed property
P

and any object
O
,

determine whether
O

has property
P


or whether
O

is
far

from having property

P


(i.e.,
far

from any other object having
P

).

Task should be performed by
querying
the
object (in as
few
places as possible).

Property Testing
-

Background



Initially defined by
Rubinfeld and Sudan

in the


context of Program Testing (of
algebraic


functions
).



Goldreich Goldwasser and Ron

initiated study of


testing properties of (undirected)
graphs
.



Growing body of work deals with properties of


functions
,
graphs
,
strings, sets of points

...




Many algorithms with complexity that is
sub
-
linear

in


(or even
independent of
) size of object.

Related Work on Clustering


Hard to approximate

cost of optimal clustering
to within constant factor

(e.g < 2) even for L
2

metric
[HS,FG]
.


An approximation factor

of 2 can be
achieved
efficiently

under both costs
[FG]
.

Can achieve approximation factor of (
1+
b
) for
radius cost
[AP]

in time

)
)
/
(
log
(
)
(
/
1
1
2
d
k
d
O
k
k
n
O


b