Very Large-Scale Incremental Clustering

sharpfartsΤεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

80 εμφανίσεις

1

Very Large
-
Scale Incremental
Clustering

Berk Berker

Mumin Cebe

Ismet Zeki Yalniz


27 March 2007


2

Table of Contents


Why Clustering?


Why Incremental Clustering?


Related Work


Incremental C3M (C2ICM)


A Former Implementation of C2ICM for
very large datasets


Conclusion

3

Why clustering ?


It is an effective tool to manage information
overload



To browse large document collections quickly



To easily grasp the distinct topics and subtopics
(concept hierarchies)



To allow search engines to efficiently query large
document collections




4

Types of Clustering



Hierarchical vs. Non
-
hierarchical


Partitional vs. Agglomerative


Deterministic vs. Probabilistic algorithms


Incremental vs. Batch algorithms





5

Why Incremental Clustering ?



The current information explosion



Popular sources of informational text documents
such as Newswire and Blogs



Delay would be unacceptable in several important
areas



6

Related Work



The cluster
-
splitting approach


Adaptive clustering based on user
queries


Cobweb algorithm


Hierarchical Clustering in Incremental
manner


7

C2ICM

Algorithm


C3M is known as an efficient, effective
and robust algorithm for clustering
documents



C3M is well
-
developed for initial
clustering, but maintenance is also
necessary in clustering

8


C2ICM algorithm is based on
cover coefficient
concept as
C3M
.



C2ICM is suitable for dynamic environments
where there are additions and deletions of
documents



With C2ICM, reclustering for each update is
avoided.

C2ICM

Algorithm

9

C2ICM Algorithm Details


First we compute the number of clusters
and cluster seed powers in the updated
database



Then we determine the newly added
documents and falsified documents

10


How do the clusters become false?




When a seed document becomes non
-
seed
or is deleted





One or more non
-
seed documents of that
cluster becomes seed

C2ICM Algorithm Details

11

C2ICM Algorithm Details


We cluster these documents by
assigning them to the cluster of the seed
that covers them most



The documents which does not belong to
any cluster are grouped into ragbag
cluster


12

C2ICM: An example


Current state of the clusters

d5

d4 d3

d1

d7 d2

d8

d9 d15

d6

d10 d11

d18

d16 d17

d12

d13 d14

Ragbag
cluster

Seed List

d1

d6

d12


d19




13

C2ICM: CASE 1


When a seed document becomes nonseed

d5

d4 d3

d1

d7 d2

The set of
documents
to be
clustered

New Seed
List

d1

d6

d13

d19


New documents arrived

d19


d20 d21

d22

Old Seed
List

d1

d6

d12


d18

d16 d17

d12

d13 d14

d8

d9 d15

d6

d10 d11

14

C2ICM: CASE 1


Seed document d12 becomes nonseed

d5

d4 d3

d1

d7 d2

d22

d13

d14

d12 d16 d17 d18


d19

d20

d21

The set of
documents
to be
clustered

New Seed
List

d1

d6

d13

d19

d8

d9 d15

d6

d10 d11

15

C2ICM: CASE 1

d5

d4 d3

d1

d7 d2

New Seed
List

d1

d6

d13

d19

d20

d16 d12

d13

d18

d21

d14 d17

d19


d22


No elements
remaining in the
ragbag cluster


Final clusters

d8

d9 d15

d6

d10 d11

16


When a nonseed document in a cluster becomes seed

Old Seed
List

d1

d6

d12


New documents arrived

The set of
documents
to be
clustered

C2ICM: CASE 2

New Seed
List

d1

d6

d12

d14

d5

d4 d3

d1

d7 d2

d19


d20 d21

d22

d18

d16 d17

d12

d13
d14

d8

d9 d15

d6

d10 d11

17


Nonseed document d14 becomes seed.

d5

d4 d3

d1

d7 d2

d12


d13
d14

d16 d17 d18


d19 d20

d21 d22

New Seed
List

d1

d6

d12

d14

The set of
documents
to be
clustered

Becomes new seed

C2ICM: CASE 2

d8

d9 d15

d6

d10 d11

18

C2ICM: CASE 2

d5

d4 d3

d1

d7 d2

d20

d16 d13

d12

d22 d18

d21

d19 d17

d14



New Seed
List

d1

d6

d12

d14

No elements
remaining in the
ragbag cluster

Becomes new seed


Final clusters

d8

d9 d15

d6

d10 d11

19

A Former Implementation of C2ICM
for Very Large Datasets


C2ICM is implemented by two programs (VS
Pascal)


Program I selects the seeds


Program II clusters documents by using
C2ICM algorithm.


These programs communicate by exchanging
files.

Program I

Seed Selector

Program II

C2ICM

text files

documents

clusters

20

Former Experiments



C2ICM is tested with a subset of MARIAN
database (~43K documents) in 1995.



6 experiments are done. Each incremental
update added ~6K documents to the different
sizes of initially clustered documents

21

Results for the Former Experiments



C2ICM provides time savings



Clusters formed with C2ICM was very similar to the clusters
formed with C3M

22

Conclusion


Cluster maintenance problem is challenging



Our aim is to conduct experiments for C2ICM with very
large number of documents (i.e. millions of
documents)



HARD dataset will be used for evaluation. Information
retrieval performance will be measured.



Implementation of C2ICM must be time and memory
efficient.

23

References


Can, F., Ozkarahan, E. A.


"Concepts and effectiveness of the
cover coefficient
-
based clustering methodology for text
databases."


ACM Transactions on Database Systems
.


Vol. 15,
No. 4 (December, 1990), pp. 483
-
517.



Can, F.


"Incremental clustering for dynamic information
processing."


ACM Transactions on Information Systems
.


Vol.
11, No. 2 (April, 1993), 143
-
164.




Can, F., Fox, E. A., Snavely, C. D., France, R. K.


"Incremental
clustering for very large document databases: initial MARIAN
experience."


Information Sciences.


Vol. 84 (1995), pp. 101
-
114.




A. K. Jain , M. N. Murty , P. J. Flynn, Data clustering: a review,
ACM Computing Surveys (CSUR), v.31 n.3, p.264
-
323, Sept.
1999

24







Questions?