# Evaluation of Clustering techniques using DMOZ data

AI and Robotics

Nov 25, 2013 (4 years and 5 months ago)

50 views

Evaluation of Clustering Techniques on
DMOZ Data

Alper R
i
fat Uluçınar

Rıfat Özcan

Mustafa Canım

Outline

What is
DMOZ and why do we use it?

What is our aim?

Evaluation of partitioning clustering algorithms

Evaluation of hierarchical clustering algorithms

Conclusion

What is
DMOZ and why do we use it?

www.dmoz.org

Another name for ODP, Open Directory Project

The largest human edited directory on the
Internet

5,
300
,
000

sites

72,
000

editors

590,000 categories

What is our aim?

Evaluating cluster algorithms is not easy

1)
We will use DMOZ as reference point (ideal cluster
structure)

2)
Run our own cluster algorithms on same data

3)
Finally compare results.

All DMOZ documents
(websites)

Applying Clustering Algorithms

such as

C3M, K Means etc.

Human Evaluation

DMOZ Clusters

??

?

A)
Evaluation of Partitioning Clustering
Algorithms

20,000 documents from DMOZ

flat partitioned data (214 folders)

We applied html parsing, stemming, stop word
list elimination

We will apply two
clustering algorithms

:

C3M

K
-
Means

Before applying html parsing, stemming, stop word list
elimination

After applying html parsing, stemming, stop word list
elimination

20,000 DMOZ
documents

Applying

C3M

Human

Evaluation

214 Clusters

642 Clusters

214 Clusters

642 Clusters

How to compare DMOZ Clusters and C3M clusters ?

C3M Clusters

DMOZ Clusters

Validation of Partitioning Clustering

Comparison of two clustering structures

N documents

Clustering structure 1:

R clusters

Clustering structure 2:

C clusters

Metrics [
1
]:

Rand Index

Jaccard Coefficient

Corrected Rand Coefficient

Validation of Partitioning Clustering

…..

d1,d2

…..

d1,d2

Type I, Frequency: a

…..

d1,d2

…..

d2

d1

Type II, Frequency: b

…..

d2

d1

…..

d1,d2

Type III, Frequency: c

…..

d2

d1

…..

d2

d1

Type IV, Frequency: d

Validation of Partitioning Clustering

Rand Index = (a+
d
) / (a+b+c+d)

Jaccard Coefficient = a / (a+b+c)

Corrected Rand Coefficient

Accounts for randomness

Normalize rand index so that 0 when the
partitions are selected by chance and 1
when a perfect match achieved.

CR = (R

E(R)) / (1

E(R))

Validation of Partitioning Clustering

Example:

Docs: d
1
,

d
2
,

d
3
,

d
4
,

d
5
,

d
6

Clustering Structure 1:

C1: d
1
,

d
2
,

d
3

C2: d
4
,

d
5
,

d
6

Clustering Structure 2:

D1: d
1
,

d
2

D2: d
3
,

d
4

D3: d
5
,

d
6

Validation of Partitioning Clustering

Contingency Table:

D1

D2

D3

C1

2

1

0

3

C2

0

1

2

3

2

2

2

6

a : (d1, d2), (d
5
, d
6
)

b : (d1, d3), (d2, d3), (d4, d5), (d4, d6)

c : (d3, d4)

d : remaining 8 pairs (15
-
7)

Rand Index = (2+8)/15 = 0.66

Jaccard Coeff. = 2/(2+4+1) = 0.29

Corrected Rand = 0.24

Results

Results:

Low corrected rand
and jaccard
values

~=0.01

Rand index ~= 0.77

Possible Reasons:

Noise in the data

Problem is difficult
:

Ex:
Homepages category.

B)
Evaluation of
Hierarchical

Clustering
Algorithms

Obtain a partitioning of DMOZ

Determine a depth (experiment?)

Collect documents of higher (or equal)
depth at that level

Documents of lower depths?

Ignore them…

Hierarchical Clustering: Steps

Obtain the hierarchical clusters using:

Obtain a partitioning on the
hierarchical cluster…

Hierarchical Clustering: Steps

One way, treat DMOZ clusters as
“queries”:

For each selected cluster of DMOZ

Find the number of “target clusters” on
computerized partitioning

Take the average

See if N
t

< N
tr

If not, either choice of partitioning or hierarchical
clustering did not perform well…

Hierarchical Clustering: Steps

Another way:

Compare the two partitions using an
index, i.e. C
-
RAND…

Choice of Partition: Outline

Obtain the dendrogram

Ward’s methods

Choice of Partition: Outline

How to convert a hierarchical cluster
structure into a partition?

Visually inspect the dendrogram?

Use tools from statistics?

Choice of Partition:

Inconsistency Coefficient

At each fusion level:

Calculate the “inconsistency coefficient”

Utilize statistics from the previous fusion
levels

Choose the fusion level for which
inconsistency coefficient is at
maximum.

Choice of Partition:

Inconsistency Coefficient

Inconsistency coefficient (I.C.) at fusion
level i:

i
i
z
i
i
z
i
z
z
i
i
c

before

levels
fusion
highest

z

the
and

level

of

heights

the
of
deviation

standard

the
:

before

levels
fusion
highest

z

the
and

level

of

heights

the
of
mean

the
:

level
fusion

i

the
of
height

:

obj ects)

of
number

the
being

(N

1
-
N

,

2,

1,

:
i

th
where

Choice of Partition:

I.C. Hands on, Objects

Plot of the objects

Distance measure: Euclidean Distance

0
2
4
6
8
10
12
0
0.5
1
1.5
2
2.5
3
3.5
4
Objects
Choice of Partition:

1
2
3
4
5
6
7
8
1
1.5
2
2.5
3
Fusion
Distance
Objects
Level 7
Level 6
Choice of Partition:

Level 1

0

Level 2

0

Level 3

0

Level 4

0

Level 5

0

Level 6

1.1323

Level 7

0.6434

=> Cut the dendrogram at a height between level 5 & 6

Choice of Partition:

1
2
3
4
5
6
7
8
1
1.5
2
2.5
3
Fusion
Distance
Objects
Level 7
Level 6
Level 4, 5
Optimal partitioning
interval
Choice of Partition:

1
2
3
4
5
6
7
8
1
2
3
4
5
6
Fusion
Distance
Objects
Level 7
Level 6
Choice of Partition:

Level 1

0

Level 2

0

Level 3

0.7071

Level 4

0

Level 5

0.7071

Level 6

1.0819

Level 7

0.9467

=> Cut the dendrogram at a height between level 5 & 6

Choice of Partition:

1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
9
Fusion
Distance
Objects
Level 7
Level 6
Choice of Partition:

Level 1

0

Level 2

0

Level 3

0.7071

Level 4

0

Level 5

0.7071

Level 6

1.0340

Level 7

1.0116

=> Cut the dendrogram at a height between level 5 & 6

Conclusion

Our aim is to evaluate

c
lustering
t
echniques on
DMOZ Data
.

Analysis on partitioning & hierarchical
clustering algorithms.

If the experiments
are

succesfull we will apply
same experiments on larger DMOZ data after

Else

We will try other methodologies to improve our
experiment results.

References

[
1
]

A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice
Hall, 1988.

[2] Korenius T.,
Laurikkala

J.,
Juhola

M.,
Jarvelin

K. Hierarchical
clustering of a Finnish newspaper article collection with graded
relevance assessments. Information Retrieval, 9(1). Kluwer Academic
Publishers, 2006.

www.dmoz.org