CLUTO A Clustering Toolkit

plantationscarfΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

88 εμφανίσεις




BY


ROSELI NE

ANTAI

CLUTO

A Clustering Toolkit

What is CLUTO?



CLUTO is a software package which is used for
clustering high dimensional datasets and for
analyzing the characteristics of the various clusters.

Algorithms of CLUTO


v
cluster


s
cluster


Major difference: Input format


vcluster
: actual multidimensional representation of
the objects to be clustered.



scluster
: The similarity matrix (or graph) between
these objects.





Calling Sequence


vcluster

[optional parameters]
MatrixFile

Nclusters

scluster

[optional parameters]
MatrixFile

NClusters


Optional Parameters


Standard specification


-
paramname

or

paramname

= value



Three categories:


Clustering algorithm parameters


Reporting and Analysis parameters


Cluster Visualization parameters






Clustering algorithm parameters


Control how CLUTO computes the clustering
solution.


Examples


1.
-
clmethod
=string (
rb
,
agglo,direct,graph
, etc)

2.
-
sim

= string (
cos,corr,dist,jacc
)

3.
-
crfun

= string (i1,i2 etc)

4.
-
fulltree






Reporting and Analysis Parameters


Control the amount of information that
vcluster

and
scluster

report about the clusters as well as the
analysis performed on discovered clusters.



Examples

1.
-
clustfile

= string. ( Default is
MatrixFile.clustering.Nclusters
( or
GraphFile
))

2.
-
clabelfile

= string (name of the file that’s stores the labels
of the columns. Used when

showfeatues
,
-
showsummaries

or

labeltree

are used)


3.
-
rlabelfile
=string

4.
-
rclassfile
=string (Stores the labels of the rows


objects to
be clustered).

5.
-
showtree

6.
-
showfeatures

(descriptive and discriminating)

Cluster Visualization Parameters


Simple plots of the original input matrix which show
how the different objects (rows) and features
(columns) are clustered together.


Examples

1.
-
plottree

= string; gives graphic representation of the entire
hierarchical tree

2.
-
plotmatrix

= string; shows how the rows of the original
matrix are clustered together.

A practical example



../
cluto
/Linux/
vcluster

-
clmethod
=
rb

-
sim
=
cos

-
fulltree

-
rlabelfile
=
Final_Results
/
rlabelfile

-
rclassfile
=
Final_Results
/
classfile

-
showtree

-
plotformat
=gif
-
plottree
=
Final_Results
/Images/PT
-
Final10d
-
plotmatrix
=
Final_Results
/Images/PM
-
Final10d
-
plotclusters
=
Final_Results
/Images/PC
-
Final10d
-
showfeatures

Final_Results
/FinalOutput10d
-
Vt.mat 4

roselineantai@ubuntu:~/JLSI/jlsi$
./clusterscript.sh********************************************************************************

vcluster (CLUTO

2.1.1) Copyright 2001
-
03, Regents of the University of Minnesota


Matrix Information
-----------------------------------------------------------


Name: Final_Results/FinalOutput5d
-
Vt.mat, #Rows: 59, #Columns: 5, #NonZeros: 295


Options
------------------
----------------------------------------------------


CLMethod=RB, CRfun=I2, SimFun=Cosine, #Clusters: 4


RowModel=None, ColModel=None, GrModel=SY
-
DIR, NNbrs=40


Colprune=1.00, EdgePrune=
-
1.00, VtxPrune=
-
1.00, MinComponent=5


CSType=Best, AggloFrom=0, AggloCRFun=I2, NTrials=10, NIter=10


Solution
---------------------------------------------------------------------


------------------------------------------------------------------------

4
-
way clustering: [I2=5.41e+01] [59 of

59], Entropy: 0.473, Purity: 0.780

------------------------------------------------------------------------

cid Size ISim ISdev ESim ESdev Entpy Purty | Sem Imp Deo Evo

------------------------------------------------------------------------



0 17 +0.731 +0.207 +0.095 +0.158 0.661 0.706 | 1 2 2 12


1 18 +0.931 +0.034 +0.327 +0.081 0.252 0.889 | 0 16 2 0


2 13 +0.811 +0.175 +0.270 +0.145 0.570 0.692 | 9 1 3 0


3 11 +0.902 +0.022 +0.441 +0.095
0.433 0.818 | 1 1 9 0

------------------------------------------------------------------------

--------------------------------------------------------------------------------

4
-
way clustering solution
-

Descriptive & Discriminating Features..
.

--------------------------------------------------------------------------------

Cluster 0, Size: 17, ISim: 0.731, ESim: 0.095


Descriptive: col00001 29.6%, col00005 26.6%, col00003 25.8%, col00004 12.5%, col00002 5.6%


Discriminating: co
l00003 58.4%, col00004 21.0%, col00005 17.3%, col00001 2.8%, col00002 0.5%


Cluster 1, Size: 18, ISim: 0.931, ESim: 0.327


Descriptive: col00003 44.6%, col00001 42.7%, col00005 10.5%, col00004 2.0%, col00002 0.3%


Discriminating: col0000
3 62.4%, col00002 23.1%, col00005 9.1%, col00001 4.1%, col00004 1.4%


Cluster 2, Size: 13, ISim: 0.811, ESim: 0.270


Descriptive: col00001 43.1%, col00005 31.2%, col00002 24.0%, col00004 1.5%, col00003 0.1%


Discriminating: col00005 83.1%, col00003 10.3%, col00002 4.0%, col00001 2.2%, col00004 0.4%


Cluster 3, Size: 11, ISim: 0.902, ESim: 0.441


Descriptive: col00001 38.6%, col00003 26.3%, col00004 17.7%, col00002 17.4%, col00005 0.0%


Di
scriminating: col00004 42.7%, col00003 29.6%, col00001 15.9%, col00005 10.5%, col00002 1.2%

--------------------------------------------------------------------------------


------------------------------------------------------------------------------

H
ierarchical Tree that optimizes the I2 criterion function...

------------------------------------------------------------------------------


Sem Imp Deo Evo

------------------------------------

6

|
-------
0 1 2 2 12

|
-
5


|
-----
2 9 1 3 0


|
-
4


|
---
3 1 1 9 0


|
---
1 0 16 2 0

------------------------------------

------------------------------------------------------------------------------


Timing Information
------
-----------------------------------------------------


I/O: 0.004 sec


Clustering: 0.008 sec


Reporting: 0.268 sec

********************************************************************************




Classfile

and
rlabelfile


Evo

Sem

Imp

Imp

Deo

Deo

Imp

Imp

Deo

Deo

Imp

Deo

Deo

Imp

Sem

Deo

Sem

Imp

Imp

Evo



0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15


Plotclusters

output

The plot uses red to
denote positive
values and green to
denote negative
values. Bright
red/green indicate

large
positive/negative
values, whereas
colors close to white
indicate values close
to zero.