1
Very Large
-
Scale Incremental
Clustering
Berk Berker
Mumin Cebe
Ismet Zeki Yalniz
27 March 2007
2
Table of Contents
Why Clustering?
Why Incremental Clustering?
Related Work
Incremental C3M (C2ICM)
A Former Implementation of C2ICM for
very large datasets
Conclusion
3
Why clustering ?
It is an effective tool to manage information
overload
To browse large document collections quickly
To easily grasp the distinct topics and subtopics
(concept hierarchies)
To allow search engines to efficiently query large
document collections
4
Types of Clustering
Hierarchical vs. Non
-
hierarchical
Partitional vs. Agglomerative
Deterministic vs. Probabilistic algorithms
Incremental vs. Batch algorithms
5
Why Incremental Clustering ?
The current information explosion
Popular sources of informational text documents
such as Newswire and Blogs
Delay would be unacceptable in several important
areas
6
Related Work
The cluster
-
splitting approach
Adaptive clustering based on user
queries
Cobweb algorithm
Hierarchical Clustering in Incremental
manner
7
C2ICM
Algorithm
C3M is known as an efficient, effective
and robust algorithm for clustering
documents
C3M is well
-
developed for initial
clustering, but maintenance is also
necessary in clustering
8
C2ICM algorithm is based on
cover coefficient
concept as
C3M
.
C2ICM is suitable for dynamic environments
where there are additions and deletions of
documents
With C2ICM, reclustering for each update is
avoided.
C2ICM
Algorithm
9
C2ICM Algorithm Details
First we compute the number of clusters
and cluster seed powers in the updated
database
Then we determine the newly added
documents and falsified documents
10
How do the clusters become false?
When a seed document becomes non
-
seed
or is deleted
One or more non
-
seed documents of that
cluster becomes seed
C2ICM Algorithm Details
11
C2ICM Algorithm Details
We cluster these documents by
assigning them to the cluster of the seed
that covers them most
The documents which does not belong to
any cluster are grouped into ragbag
cluster
12
C2ICM: An example
Current state of the clusters
d5
d4 d3
d1
d7 d2
d8
d9 d15
d6
d10 d11
d18
d16 d17
d12
d13 d14
Ragbag
cluster
Seed List
d1
d6
d12
d19
13
C2ICM: CASE 1
When a seed document becomes nonseed
d5
d4 d3
d1
d7 d2
The set of
documents
to be
clustered
New Seed
List
d1
d6
d13
d19
New documents arrived
d19
d20 d21
d22
Old Seed
List
d1
d6
d12
d18
d16 d17
d12
d13 d14
d8
d9 d15
d6
d10 d11
14
C2ICM: CASE 1
Seed document d12 becomes nonseed
d5
d4 d3
d1
d7 d2
d22
d13
d14
d12 d16 d17 d18
d19
d20
d21
The set of
documents
to be
clustered
New Seed
List
d1
d6
d13
d19
d8
d9 d15
d6
d10 d11
15
C2ICM: CASE 1
d5
d4 d3
d1
d7 d2
New Seed
List
d1
d6
d13
d19
d20
d16 d12
d13
d18
d21
d14 d17
d19
d22
No elements
remaining in the
ragbag cluster
Final clusters
d8
d9 d15
d6
d10 d11
16
When a nonseed document in a cluster becomes seed
Old Seed
List
d1
d6
d12
New documents arrived
The set of
documents
to be
clustered
C2ICM: CASE 2
New Seed
List
d1
d6
d12
d14
d5
d4 d3
d1
d7 d2
d19
d20 d21
d22
d18
d16 d17
d12
d13
d14
d8
d9 d15
d6
d10 d11
17
Nonseed document d14 becomes seed.
d5
d4 d3
d1
d7 d2
d12
d13
d14
d16 d17 d18
d19 d20
d21 d22
New Seed
List
d1
d6
d12
d14
The set of
documents
to be
clustered
Becomes new seed
C2ICM: CASE 2
d8
d9 d15
d6
d10 d11
18
C2ICM: CASE 2
d5
d4 d3
d1
d7 d2
d20
d16 d13
d12
d22 d18
d21
d19 d17
d14
New Seed
List
d1
d6
d12
d14
No elements
remaining in the
ragbag cluster
Becomes new seed
Final clusters
d8
d9 d15
d6
d10 d11
19
A Former Implementation of C2ICM
for Very Large Datasets
C2ICM is implemented by two programs (VS
Pascal)
Program I selects the seeds
Program II clusters documents by using
C2ICM algorithm.
These programs communicate by exchanging
files.
Program I
Seed Selector
Program II
C2ICM
text files
documents
clusters
20
Former Experiments
C2ICM is tested with a subset of MARIAN
database (~43K documents) in 1995.
6 experiments are done. Each incremental
update added ~6K documents to the different
sizes of initially clustered documents
21
Results for the Former Experiments
•
C2ICM provides time savings
•
Clusters formed with C2ICM was very similar to the clusters
formed with C3M
22
Conclusion
Cluster maintenance problem is challenging
Our aim is to conduct experiments for C2ICM with very
large number of documents (i.e. millions of
documents)
HARD dataset will be used for evaluation. Information
retrieval performance will be measured.
Implementation of C2ICM must be time and memory
efficient.
23
References
Can, F., Ozkarahan, E. A.
"Concepts and effectiveness of the
cover coefficient
-
based clustering methodology for text
databases."
ACM Transactions on Database Systems
.
Vol. 15,
No. 4 (December, 1990), pp. 483
-
517.
Can, F.
"Incremental clustering for dynamic information
processing."
ACM Transactions on Information Systems
.
Vol.
11, No. 2 (April, 1993), 143
-
164.
Can, F., Fox, E. A., Snavely, C. D., France, R. K.
"Incremental
clustering for very large document databases: initial MARIAN
experience."
Information Sciences.
Vol. 84 (1995), pp. 101
-
114.
A. K. Jain , M. N. Murty , P. J. Flynn, Data clustering: a review,
ACM Computing Surveys (CSUR), v.31 n.3, p.264
-
323, Sept.
1999
24
Questions?
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο