Incremental Clustering

savagelizardΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 10 μήνες)

98 εμφανίσεις

Incremental Clustering


Previous clustering algorithms worked in
“batch” mode: processed all points at
essentially the same time.


Some IR applications cluster an incoming
document stream (e.g., topic tracking).


For these applications, we need incremental
clustering algorithms.

Incremental Clustering Issues


How to be efficient? Should all
documents be cached?


How to handle or support concept drift?


How to reduce sensitivity to ordering?


Goals:


minimize the maximum cluster diameter


minimize the number of clusters given a
fixed diameter

Incremental Clustering Model
[Charikar et al. 1997]


Extension to HAC as follows:


Incremental Clustering
: “for an update sequence
of
n

points in
M
, maintain a collection of
k

clusters such that as each one is presented,
either it is assigned to one of the current
k

clusters or it starts off a new cluster while two
existing clusters are merged into one.”


Maintains a HAC for points added up until
current time.

M. Charikar, C. Chekuri, T. Feder, R. Motwani. “Incremental Clustering and Dynamic
Information Retrieval”,
Proc. 29
th

Annual ACM Symposium on Theory of Computing
,
1997.

Doubling Algorithm (
a
=
b
=2)

1.
Assign first
k+1

points to
k+1

clusters with each
point as centroid, d1=distance between closest
two points.

2.
Do while more points

1.
d
t+1
=
b
d
t

2.
Merge clusters until all clusters in some new cluster:

1.
Pick an arbitrary cluster; merge all clusters within d
t+1
of centers

2.
Remove selected clusters from old clusters

3.
Calculate the centroid for the new cluster


3.
Update clusters while number of clusters
<=k
:

1.
Assign new point to closest cluster if within
a
d
t+1

of center;
otherwise create new cluster.

Example:Plot
--

Incremental

0
5
10
15
20
25
30
35
40
45
50
0
10
20
30
40
50
1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

0
5
10
15
20
25
30
35
40
45
50
0
10
20
30
40
50
Example:Doubling Merge
d2=24.08

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

X

0
5
10
15
20
25
30
35
40
45
50
0
10
20
30
40
50
Example:Doubling Update
d2=24.08

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

X

X

0
5
10
15
20
25
30
35
40
45
50
0
10
20
30
40
50
Example:Doubling Update
d2=24.08

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

X

X

0
5
10
15
20
25
30
35
40
45
50
0
10
20
30
40
50
Example:Doubling Update
d2=24.08

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

X

X

0
5
10
15
20
25
30
35
40
45
50
0
10
20
30
40
50
Example:Doubling Solution

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Clique Partition Background


A
clique

in G=(V,E) is a subset V’ of V
s.t. every two vertices in V’ are joined
by an edge in E.


A
clique partition

for G is a partition of
V into disjoint subsets V1…Vk s.t. for
1<=I<=k, the subgraph induced by Vi
is a complete graph.

Clique Partition Algorithm

1.
Assign first
k+1

points to
k+1

clusters with each point as
centroid, d1=distance between closest two points.

2.
Do while more points

1.
d
t+1
=
2
d
t

2.
Merge clusters:

1.
Compute minimum clique partition from d
t+1
threshold graph

2.
Merge clusters in each clique

3.
In each new cluster, arbitrarily assign one of the existing centers
as the center for the new cluster


3.
Update clusters while number of clusters
<=k
:

1.
Assign new point to a cluster if within d
t+1

of center of it or sub
-
clusters; otherwise create new cluster.

Example: CP: Merge d1=12.04

0
5
10
15
20
25
30
35
40
45
50
0
10
20
30
40
50
1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

0
5
10
15
20
25
30
35
40
45
50
0
10
20
30
40
50
Example: CP: Update
d2=24.08

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

23

Web Document Clustering
Applications


Organizing search engine retrieval results


Meta
-
search engine that hierarchically clusters of
results:
Vivisimo


Meta
-
search engine that graphically displays
clusters of results:
Kartoo


Detecting redundancy (e.g., mirror sites or
moved or re
-
formatted documents)


User interest profiles (aka filtering)

Vivisimo: Result Organization

Kartoo: Visual Clustering

Detecting Mirrors/Subsumed
Web Documents

Resemblance
assesses similarity between two
documents.



Containment

assesses how A is a subset of B.

A.Z. Broder, S.C. Glassman, M.S. Manasse, G. Zweig, “Syntactic Clustering

of the Web”,
Proceedings of WWW6
, 1997.

|
)
(
)
(
|
|
)
(
)
(
|
)
,
(
B
S
A
S
B
S
A
S
B
A
r



|
)
(
|
|
)
(
)
(
|
)
,
(
A
S
B
S
A
S
B
A
c


Computing R and C


S(D,w) (
shingle)
is the set of all unique
contiguous subsequences of length w in
document D.


S(D) is S(D,w) for a fixed size w.


To reduce the storage and computation, we
can sample the shingles for each doc:


First s:
MIN
s
(W)


Every mth:
MOD
m
(W)

Estimating R & C from a
Portion of a Document

Keep a
sketch of each document D, which

consists of F(D)
and/or V(D) .

)
(
)
(
)
(
)
,
(
)
(
)
(
)
(
)
(
)
,
(
|
))
(
)
(
(
|
|
)
(
)
(
))
(
)
(
(
|
)
,
(
)))
(
(
(
)
(
)))
(
(
(
)
(
n
permutatio

random

a

is

:
A
V
B
V
A
V
B
A
c
B
V
A
V
B
V
A
V
B
A
r
B
F
A
F
MIN
B
F
A
F
B
F
A
F
MIN
B
A
r
A
S
MOD
A
V
A
S
MIN
A
F
U
U
s
s
M
s
















Web Clustering with R & C


w=10, m=25, s=50?, threshold=.5


Pre
-
process documents

1.
For each doc, calculate a sketch

2.
Sort pairs of <shingle,docid>, removing lexically
-
equivalent and shingle
-
equivalent docs

3.
Compute list of doc pairs with # of shared
shingles, ignoring very common shingles

4.
Generate clusters

1.
if r(A,B) > threshold, then add link A<
-
>B

2.
Produce connected components using union
-
find




Web Clustering Results 1997


30M web pages, 150 GBytes


600M shingles


3.6M clusters of 12.3M docs


2.1M clusters of 5.3M identical docs


Took 10.5 CPU days to compute

Web Applications of
Resemblance Clusters


Find URL similar to …


relies on fixed threshold and requires URLs to
have been processed


WWW Lost and Found


requires keeping some historical sketch info


Remove similar docs from search results