Clustering for web

naivenorthAI and Robotics

Nov 8, 2013 (3 years and 11 months ago)

91 views

Clustering for web documents

1

Clustering for web
documents

박흠

Clustering for web documents

2

Contents

Cluto


Criterion Functions for Document
Clustering*
Experiments and Analysis

(2002)

by Ying Zhao and George Karypis


Department of Computer Science, University of
Minnesota, Minneapolis, MN 55455

Feature selection for web documents
(2004)



Clustering for web documents

3

Cluto

Clustering Toolkit. 2.1.1

Department of Computer Science, University of
Minnesota, Minneapolis

http://www
-
users.cs.umn.edu/~karypis/

platform

Linux 2.4.18

Sun OS 5.7

Win32

programs

CLUTO's user callable library

vcluster

scluster

Clustering for web documents

4

Cluto

What is Cluto.(1/2)

Clustering algorithms

partitional clustering

agglomerative clustering

graph
-
partitioning clustering

clustering criterion function

provide
seven different criterion functions


both partitional and agglomerative clustering
algorithms

provide some of the more
traditional local criteria

(
e.g
.,
single
-
link, complete
-
link, and UPGMA
)

agglomerative clustering.

Clustering for web documents

5

Cluto

What is Cluto.(2/2)

Analyze discovered clusters

relations between the objects assigned to each
cluster

relations between the different clusters

identify the features that best describe and/or
discriminate each cluster.

relationships between the clusters, objects, and
features.

operate on very large datasets

the number of objects

the number of dimensions.

Clustering for web documents

6

Cluto

Programs

vcluster

operate in the object

s feature space

scluster

operate in the object

s similarity space.

Interface

vcluster [optional parameters] MatrixFile Ncluster

n*m matrix. rows to objects, cols to features space

Ncluster : number of cluster

Clustering for web documents

7

Cluto

Parameters of Algorithms

rd, rdr

k
-
1 repeated bisections. (rdr : optimize the
criterion function)

direct

computed by simultaneously finding all
k
clusters

agglo

the
agglomerative
paradigm

graph

using a nearest
-
neighbor graph

bagglo

Clustering for web documents

8

Cluto

Parameters of the similarity function

cos

the cosine function. default.

corr

the correlation coefficient.

dist

the Euclidean distance

applicable when
-
clmethod=graph
.

jacc

the extended Jaccard coefficient.

applicable when
-
clmethod=graph
.

Clustering for web documents

9

Cluto

Parameters of the criterion function

i1, i2, e1, g1, g1p, h1, h2

Clustering for web documents

10

Cluto

Parameters of the criterion function

slink

single link

wslink

weighted single link

clink

complete link

wclink

weighted complete link

upgma

UPGMA

cstype

fulltree

rowmodel, colmodel

showfeatures

Clustering for web documents

11

Clustering for web documents

12

Criterion Functions for Document
Clustering
Experiments and Analysis
(2002)


by Ying Zhao and George Karypis Department of
Computer Science, University of Minnesota,
Minneapolis, MN 55455


Clustering for web documents

13

Data Clustering

A.K. JAIN

Michigan State University

M.N. MURTY

Indian Institute of Science

AND

P.J. FLYNN

The Ohio State University

ACM Computing Surveys

Clustering for web documents

14

Introduction(1/2)

Clustering algorithms

Agglomerative algorithms

UPGMA, single
-
link, complete
-
link, CURE, ROCK,
Chameleon

Partitional algorithms

K
-
means, K
-
medoids, Autoclass, graph
-
partitional
-
based,
spectral
-
partitional
-
based

well suit for large datasets. so fast.

Seven Criterion functions

measure intra
-
cluster similarity, inter
-
cluster similarity,
two combinations. i1, i2, e1, g1, g1p, h1, h2

Clustering for web documents

15

Introduction(2/2)

Datasets

15 different data sets









Clustering for web documents

16

Preliminaries(1/3)

Document Representation

use vector space model for each document



d : document, tf : term frequency, tf
i
: frequency of i
-
th term in the doc

use idf or tf*idf


N : total documents

Similarity Measures

The similarity between two docs
di, dj

Cosine functions



||d||

:
normalize the length of doc vector


1 : identical, 0 : nothing in common

Clustering for web documents

17

Preliminaries(2/3)

Euclidean functions



if dis=0, docs are identical, if , nothing in common.

Definitions

S

: set of documents


S
1
, S
2
, … S
k

: set of document of

k
-
th

cluster

k

: number of clusters

n
1
, n
2
, … n
k

: size docs of the corresponding clusters

A : a set of docs




composite

vector D
A

centroid

vector C
A.



sum of all docs vector in A average the weight of terms of docs in A

Clustering for web documents

18

Preliminaries(3/3)

Vector Properties

Si, Sj

: two sets of docs containing ni, nj documents


Di, Dj

: the composite vector,
Ci, Cj

: the centroid vector

The sum of the pair similarity between the docs in
Si

and
Sj

is
D
j
t
D
j



The sum of the pair similarity between the docs in
Si
is
||D
i
||
2


Clustering for web documents

19

Criterion Functions(1/5)

Internal Criterion Functions

maximize sum of the average pairwise similarities between
the docs to each cluster

use cosine function. I1




is similar to function of hierarchical agglomerative clustering that uses
group average heuristics to determine merge.

use cosine function. I2


: vector space of K
-
means algorithm.


Cr : centroid vector of clusters


Clustering for web documents

20

Criterion Functions (2/5)

External Criterion Functions. E1, E2

optimize a function that different from each cluster

external function derived that the centroid vectors of the
different clusters as orthogonal as possible






C : the centroid vector of the entire docs





D : the composite vector of the entire docs. 1/||D|| is constant.


Clustering for web documents

21

Criterion Functions (3/5)








define with the Euclidean distance function.


Hybrid Criterion Functions. H1, H2

maximize the similarity of docs in each cluster,
minimize the similarity between the cluster’s docs
and the entire docs

H1.

combine criterion function
I1, E1

Clustering for web documents

22

Criterion Functions (4/5)

H2.

combine criterion function
I2, E1




Graph Based Criterion Functions

view the relations between docs is to use graphs

G1
: computing pairwise similarities between the docs

G2
: computing pairwise similarities between the docs
and terms

S : given collection of n docs

Gs : similarity graph


Clustering for web documents

23

Criterion Functions (5/5)

G1.





G2.



Clustering for web documents

24

Clustering for web documents

25

Clustering for web documents

26

Experimental Results

Direct
k
-
way Clustering

Clustering for web documents

27

Experimental Results

Clustering for web documents

28

Experimental Results

Clustering for web documents

29

Data Sets


the Natural Science


category in
Naver
directory
(http://dir.naver.com)

6 subcategories in corpora






1,215 docs, 17,223 terms, 20 clusters,


5 features per a doc, idf

Sub Category

No. of Docs.

Sub Category

No. of Docs.


Physics

102


Earth science

149


Biology

426


Astrology

323


Mathematics

102


Chemistry

113


Total

1,215

Clustering for web documents

30

Experimental parameters

Algorithms

rd, rdr

k
-
1 repeated bisections. (rdr : optimize the criterion function)

direct

computed by simultaneously finding all
k
clusters

agglo

the
agglomerative
paradigm

graph

using a nearest
-
neighbor graph

Clustering for web documents

31

Experimental parameters

Criterion Functions

i1, i2, e1, g1, g1p, h1, h2, clink, slink

Similarity Functions

cosine measure


Clustering for web documents

32

Experimental results

Entropy


rb

rbr

direct

agglo

graph

I1

.464

.452

.490

.642

.417

I2

.379

.375

.374

.564

E1

.388

.398

.416

.540

G1

.389

.418

.398

.895

G1p

.326

.366

.391

.562

H1

.386

.392

.386

.541

H2

.348

.352

.367

.559

Clink

.761

slink

.895

Clustering for web documents

33

Entropy

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
rd
rdr
direct
agglo
graph
I1
I2
E1
G1
G1p
H1
H2
Clink
slink
Clustering for web documents

34

Experimental results

Purity


rb

rbr

direct

agglo

graph

I1

.686

.690

.683

.548

.749

I2

.772

.762

.761

.629

E1

.741

.737

.723

.647

G1

.768

.739

.752

.367

G1p

.780

.758

.758

.647

H1

.753

.744

.758

.634

H2

.780

.782

.751

.650

Clink

.458

Cut
functions

slink

.368

Clustering for web documents

35

Purity

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
rb
rbr
direct
agglo
graph
I1
I2
E1
G1
G1p
H1
H2
Clink
slink
Clustering for web documents

36

Best results

rb

rbr

direct

agglo

graph

entr

puri

entr

puri

entr

puri

entr

puri

entr

puri

g1p

h2

h1

h1

cut

0.326

0.780

0.352

0.782

0.386

0.758

0.541

0.634

0.417

0.749