Evaluation of Clustering techniques using DMOZ data

mudlickfarctateΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 6 μήνες)

35 εμφανίσεις

Evaluation of Clustering Techniques on
DMOZ Data



Alper R
i
fat Uluçınar


Rıfat Özcan


Mustafa Canım

Outline



What is
DMOZ and why do we use it?



What is our aim?



Evaluation of partitioning clustering algorithms


Evaluation of hierarchical clustering algorithms



Conclusion



What is
DMOZ and why do we use it?


www.dmoz.org



Another name for ODP, Open Directory Project


The largest human edited directory on the
Internet


5,
300
,
000

sites


72,
000

editors


590,000 categories

What is our aim?


Evaluating cluster algorithms is not easy


1)
We will use DMOZ as reference point (ideal cluster
structure)


2)
Run our own cluster algorithms on same data


3)
Finally compare results.

All DMOZ documents
(websites)

Applying Clustering Algorithms

such as

C3M, K Means etc.

Human Evaluation

DMOZ Clusters

??

?

A)
Evaluation of Partitioning Clustering
Algorithms



20,000 documents from DMOZ


flat partitioned data (214 folders)


We applied html parsing, stemming, stop word
list elimination


We will apply two
clustering algorithms

:


C3M


K
-
Means

Before applying html parsing, stemming, stop word list
elimination

After applying html parsing, stemming, stop word list
elimination

20,000 DMOZ
documents

Applying

C3M

Human

Evaluation

214 Clusters

642 Clusters

214 Clusters

642 Clusters

How to compare DMOZ Clusters and C3M clusters ?

C3M Clusters

DMOZ Clusters

Answer: Corrected Rand

Validation of Partitioning Clustering


Comparison of two clustering structures


N documents


Clustering structure 1:


R clusters


Clustering structure 2:


C clusters


Metrics [
1
]:


Rand Index


Jaccard Coefficient


Corrected Rand Coefficient

Validation of Partitioning Clustering

…..

d1,d2

…..

d1,d2

Type I, Frequency: a

…..

d1,d2

…..

d2

d1

Type II, Frequency: b

…..

d2

d1

…..

d1,d2

Type III, Frequency: c

…..

d2

d1

…..

d2

d1

Type IV, Frequency: d

Validation of Partitioning Clustering


Rand Index = (a+
d
) / (a+b+c+d)


Jaccard Coefficient = a / (a+b+c)


Corrected Rand Coefficient


Accounts for randomness


Normalize rand index so that 0 when the
partitions are selected by chance and 1
when a perfect match achieved.


CR = (R


E(R)) / (1


E(R))


Validation of Partitioning Clustering


Example:


Docs: d
1
,

d
2
,

d
3
,

d
4
,

d
5
,

d
6


Clustering Structure 1:


C1: d
1
,

d
2
,

d
3


C2: d
4
,

d
5
,

d
6


Clustering Structure 2:


D1: d
1
,

d
2


D2: d
3
,

d
4


D3: d
5
,

d
6

Validation of Partitioning Clustering


Contingency Table:

D1

D2

D3

C1

2

1

0

3

C2

0

1

2

3

2

2

2

6

a : (d1, d2), (d
5
, d
6
)

b : (d1, d3), (d2, d3), (d4, d5), (d4, d6)

c : (d3, d4)

d : remaining 8 pairs (15
-
7)


Rand Index = (2+8)/15 = 0.66

Jaccard Coeff. = 2/(2+4+1) = 0.29

Corrected Rand = 0.24




Results


Results:


Low corrected rand
and jaccard
values




~=0.01


Rand index ~= 0.77


Possible Reasons:


Noise in the data


Ex: 300 Document Not Found pages.


Problem is difficult
:


Ex:
Homepages category.

B)
Evaluation of
Hierarchical

Clustering
Algorithms


Obtain a partitioning of DMOZ


Determine a depth (experiment?)



Collect documents of higher (or equal)
depth at that level



Documents of lower depths?


Ignore them…

Hierarchical Clustering: Steps


Obtain the hierarchical clusters using:


Single Linkage


Average Linkage


Complete Linkage



Obtain a partitioning on the
hierarchical cluster…

Hierarchical Clustering: Steps


One way, treat DMOZ clusters as
“queries”:


For each selected cluster of DMOZ


Find the number of “target clusters” on
computerized partitioning


Take the average


See if N
t

< N
tr


If not, either choice of partitioning or hierarchical
clustering did not perform well…

Hierarchical Clustering: Steps


Another way:


Compare the two partitions using an
index, i.e. C
-
RAND…

Choice of Partition: Outline


Obtain the dendrogram


Single linkage


Complete linkage


Group average linkage


Ward’s methods

Choice of Partition: Outline


How to convert a hierarchical cluster
structure into a partition?



Visually inspect the dendrogram?



Use tools from statistics?

Choice of Partition:

Inconsistency Coefficient


At each fusion level:


Calculate the “inconsistency coefficient”


Utilize statistics from the previous fusion
levels



Choose the fusion level for which
inconsistency coefficient is at
maximum.

Choice of Partition:

Inconsistency Coefficient


Inconsistency coefficient (I.C.) at fusion
level i:


i
i
z
i
i
z
i
z
z
i
i
c











before

levels
fusion
highest

z

the
and


level

of

heights

the
of
deviation

standard

the
:



before

levels
fusion
highest

z

the
and


level

of

heights

the
of
mean

the
:


level
fusion

i

the
of
height

:


obj ects)

of
number

the
being

(N

1
-
N

,

2,

1,

:
i

th
where



Choice of Partition:

I.C. Hands on, Objects


Plot of the objects


Distance measure: Euclidean Distance

0
2
4
6
8
10
12
0
0.5
1
1.5
2
2.5
3
3.5
4
Objects
Choice of Partition:

I.C. Hands on, Single Linkage

1
2
3
4
5
6
7
8
1
1.5
2
2.5
3
Single Linkage
Fusion
Distance
Objects
Level 7
Level 6
Choice of Partition:

I.C. Single Linkage Results




Level 1


0




Level 2


0




Level 3


0




Level 4


0




Level 5


0




Level 6


1.1323




Level 7


0.6434



=> Cut the dendrogram at a height between level 5 & 6


Choice of Partition:

I.C. Single Linkage Results

1
2
3
4
5
6
7
8
1
1.5
2
2.5
3
Single Linkage
Fusion
Distance
Objects
Level 7
Level 6
Level 4, 5
Optimal partitioning
interval
Choice of Partition:

I.C. Hands on, Average Linkage

1
2
3
4
5
6
7
8
1
2
3
4
5
6
Fusion
Distance
Objects
Average Linkage
Level 7
Level 6
Choice of Partition:

I.C. Average Linkage Results




Level 1


0




Level 2


0




Level 3


0.7071




Level 4


0




Level 5


0.7071




Level 6


1.0819




Level 7


0.9467



=> Cut the dendrogram at a height between level 5 & 6


Choice of Partition:

I.C. Hands on, Complete Linkage

1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
9
Complete Linkage
Fusion
Distance
Objects
Level 7
Level 6
Choice of Partition:

I.C. Complete Linkage Results




Level 1


0




Level 2


0




Level 3


0.7071




Level 4


0




Level 5


0.7071




Level 6


1.0340




Level 7


1.0116



=> Cut the dendrogram at a height between level 5 & 6


Conclusion


Our aim is to evaluate

c
lustering
t
echniques on
DMOZ Data
.


Analysis on partitioning & hierarchical
clustering algorithms.


If the experiments
are

succesfull we will apply
same experiments on larger DMOZ data after
we download it.


Else


We will try other methodologies to improve our
experiment results.

References


[
1
]

A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice
Hall, 1988.


[2] Korenius T.,
Laurikkala

J.,
Juhola

M.,
Jarvelin

K. Hierarchical
clustering of a Finnish newspaper article collection with graded
relevance assessments. Information Retrieval, 9(1). Kluwer Academic
Publishers, 2006.


www.dmoz.org