Methods for Intelligent Systems

aroocarmineAI and Robotics

Oct 29, 2013 (3 years and 1 month ago)

79 views

MethodsforIntelligentSystems
LectureNotesonClustering(III)
DavideEynard
eynard@elet.polimi.it
DepartmentofElectronicsandInformation
PolitecnicodiMilano
DavideEynard-LectureNotesonClustering(III)p.1/32
CourseSchedule
Date
Topic
28/03/2006
ClusteringIntroduction&Algorithms(I)
(K-Means,Hierarchical)
11/04/2006
ClusteringAlgorithms(II)
(Fuzzy,SOM,Gaussians,PDDP)
16/05/2006
Howmanyclusters?
(Evaluationsandtuning)
20/06/2006
MonographyonTextClustering(I)
21/06/2006
MonographyonTextClustering(II)
(14.15AM2)
(+exercises)
DavideEynard-LectureNotesonClustering(III)p.2/32
Lectureoutline

K-Meanslimits

Hierarchicalalgorithmslimits

DBSCAN

ClusterEvaluation

Internalmeasures

Externalmeasures

Findingthecorrectnumberofclusters

Frameworkforclustervalidity
DavideEynard-LectureNotesonClustering(III)p.3/32
K-Meanslimits
Importanceofchoosinginitialcentroids
DavideEynard-LectureNotesonClustering(III)p.4/32
K-Meanslimits
Importanceofchoosinginitialcentroids
DavideEynard-LectureNotesonClustering(III)p.4/32
K-Meanslimits
Differingsizes
DavideEynard-LectureNotesonClustering(III)p.4/32
K-Meanslimits
Differingdensity
DavideEynard-LectureNotesonClustering(III)p.4/32
K-Meanslimits
Non-globularshapes
DavideEynard-LectureNotesonClustering(III)p.4/32
K-Means:higherK
WhatifwetriedtoincreaseKtosolveK-Meansproblems?
DavideEynard-LectureNotesonClustering(III)p.5/32
K-Means:higherK
WhatifwetriedtoincreaseKtosolveK-Meansproblems?
DavideEynard-LectureNotesonClustering(III)p.5/32
K-Means:higherK
WhatifwetriedtoincreaseKtosolveK-Meansproblems?
DavideEynard-LectureNotesonClustering(III)p.5/32
Hierarchicalalgorithmslimits
StrengthofMIN

Easilyhandlesclustersofdifferentsizes

Canhandlenonellipticalshapes
DavideEynard-LectureNotesonClustering(III)p.6/32
Hierarchicalalgorithmslimits
LimitationsofMIN

Sensitivetonoiseandoutliers
DavideEynard-LectureNotesonClustering(III)p.6/32
Hierarchicalalgorithmslimits
StrengthofMAX

Lesssensibletonoiseandoutliers
DavideEynard-LectureNotesonClustering(III)p.6/32
Hierarchicalalgorithmslimits
LimitationsofMAX

Tendstobreaklargeclusters

Biasedtowardglobularclusters
DavideEynard-LectureNotesonClustering(III)p.6/32
Question
Whatifwehadadatasetlikethis?
DavideEynard-LectureNotesonClustering(III)p.7/32
DBSCAN

DensityBasedSpatialClusteringofApplicationswithNoise

Datapointsareconnectedthroughdensity

Findsclustersofarbitraryshapes

Handleswellnoiseinthedataset

Singlescanonalltheelementsofthedataset
DavideEynard-LectureNotesonClustering(III)p.8/32
DBSCAN:background

Twoparameterstodenedensity:

Eps:radius

MinPts:minimumnumberofpointswithinthespecied
radius

Numberofpointswithinaspeciedradius:

N
Eps
(p):{q∈D|dist(p,q)≤Eps}
DavideEynard-LectureNotesonClustering(III)p.9/32
DBSCAN:background

Apointpis
directlydensity-reachable
fromqwithrespectto
(Eps,MinPts)if:
1.p∈N
Eps
(q)
2.qisaCorepoint

Apointpis
density-reachable
fromqifthereisasequence
p
1
,...,p
n
(wherep
1
=qandp
n
=p)suchthatp
i+1
isdirectly
density-reachablefromp
i
foreveryi

Apointpis
density-connected
toqifthere'sapointosuch
thatbothpandqaredensity-reachablefromo
DavideEynard-LectureNotesonClustering(III)p.9/32
DBSCAN:background

Apointisa
corepoint
ifithasmorethanMinPtspointswithinEps

A
borderpoint
hasfewerthanMinPtswithinEps,butisintheneighborhoodofa
corepoint

A
noisepoint
isanypointthatisnotacorepointoraborderpoint.
DavideEynard-LectureNotesonClustering(III)p.9/32
DBSCAN:core,borderandnoisepoints
Eps=10,MinPts=4
DavideEynard-LectureNotesonClustering(III)p.10/32
DBSCANalgorithm

Eliminatenoisepoints

Performclusteringontheremainingpoints
DavideEynard-LectureNotesonClustering(III)p.11/32
WhenDBSCANworkswell

Resistanttonoise

Canhandleclustersofdifferentshapesandsizes
DavideEynard-LectureNotesonClustering(III)p.12/32
ClusterEvaluation

Everyalgorithmhasitsprosandcons

(Notonlyaboutclusterquality:complexity,#clustersinadvance,etc.)

Forwhatconcernsclusterquality,wecanevaluate(or,better,
validate
)clusters

Forsupervisedclassicationwehaveavarietyofmeasuresto
evaluatehowgoodourmodelis

Accuracy,precision,recall

Forclusteranalysis,theanalogousquestionis:howcanwe
evaluatethe"goodness"oftheresultingclusters?

Butmostofall...
why
shouldweevaluateit?
DavideEynard-LectureNotesonClustering(III)p.13/32
Clusterfoundinrandomdata
"Clustersareintheeyeofthebeholder"
DavideEynard-LectureNotesonClustering(III)p.14/32
Whyevaluate?

Todeterminethe
clusteringtendency
ofthedataset,thatis
distinguishwhethernon-randomstructureactuallyexistsinthe
data

Todeterminethe
correctnumberofclusters

Toevaluatehowwelltheresultsofaclusteranalysistthedata
withoutreferencetoexternalinformation

Tocomparetheresultsofaclusteranalysistoexternallyknown
results,suchasexternallyprovidedclasslabels

Tocomparetwosetsofclusterstodeterminewhichisbetter
Note:

therstthreeareunsupervisedtechniques,whilethelasttworequireexternalinfo

thelastthreecanbeappliedtotheentireclusteringorjusttoindividualclusters
DavideEynard-LectureNotesonClustering(III)p.15/32
Openchallenges
Clusterevaluationhasanumberofchallenges:

ameasureofclustervaliditymaybequitelimitedinthescope
ofitsapplicability

ie.dimensionsoftheproblem:mostworkhasbeendone
onlyon2-or3-dimensionaldata

weneedaframeworktointerpretanymeasure

Howgoodis"10"?

ifameasureistoocomplicatedtoapplyortounderstand,
nobodywilluseit
DavideEynard-LectureNotesonClustering(III)p.16/32
MeasuresofClusterValidity
Numericalmeasuresthatareappliedtojudgevariousaspectsof
clustervalidityareclassiedintothefollowingthreetypes:

Internal(unsupervised)Indices:
Usedtomeasurethe
goodnessofaclusteringstructurewithoutrespecttoexternal
information

clustercohesionvsclusterseparation

i.e.SumofSquaredError(SSE)

External(supervised)Indices:
Usedtomeasuretheextentto
whichclusterlabelsmatchexternallysuppliedclasslabels

Entropy

RelativeIndices:
Usedtocomparetwodifferentclusteringsor
clusters

Oftenanexternalorinternalindexisusedforthisfunction,
e.g.,SSEorentropy
DavideEynard-LectureNotesonClustering(III)p.17/32
ExternalMeasures

Entropy

Thedegreetowhicheachclusterconsistsofobjectsofa
singleclass

Forclusteriwecomputep
ij
,theprobabilitythatamember
of
cluster
ibelongsto
class
j,asp
ij
=m
ij
/m
i
,wherem
i
isthenumberofobjectsinclusteriandm
ij
isthenumberof
objectsofclassjinclusteri

The
entropy
ofeachclusteriise
i
=−
￿
L
j=1
p
ij
log
2
p
ij
,
whereListhenumberofclasses

The
totalentropy
ise=
￿
K
i=1
m
i
m
e
i
,whereKisthenumber
ofclustersandmisthetotalnumberofdatapoints
DavideEynard-LectureNotesonClustering(III)p.18/32
ExternalMeasures

Purity

Anothermeasureoftheextenttowhichaclustercontains
objectsofasingleclass

Usingthepreviousterminology,the
purity
ofclusteriis
p
i
=max(p
ij
)forallthej

The
overallpurity
ispurity=
￿
K
i=1
m
i
m
p
i
DavideEynard-LectureNotesonClustering(III)p.18/32
ExternalMeasures

Precision

Thefractionofaclusterthatconsistsofobjectsofa
speciedclass

Theprecisionofclusteriwithrespecttoclassjis
precision(i,j)=p
ij

Recall

Theextenttowhichaclustercontainsallobjectsofa
speciedclass

Therecallofclusteriwithrespecttoclassjis
recall(i,j)=m
ij
/m
j
,wherem
j
isthenumberofobjectsin
classj
DavideEynard-LectureNotesonClustering(III)p.18/32
ExternalMeasures

F-measure

Acombinationofbothprecisionandrecallthatmeasures
theextenttowhichaclustercontainsonlyobjectsofa
particularclassandallobjectsofthatclass

TheF-measureofclusteriwithrespecttoclassjis
F(i,j)=
2×precision(i,j)×recall(i,j)
precision(i,j)+recall(i,j)
DavideEynard-LectureNotesonClustering(III)p.18/32
ExternalMeasures:example
DavideEynard-LectureNotesonClustering(III)p.19/32
Internalmeasures:CohesionandSeparation

Graph-basedview

Prototype-basedview
DavideEynard-LectureNotesonClustering(III)p.20/32
Internalmeasures:CohesionandSeparation

ClusterCohesion:
Measureshowcloselyrelatedareobjects
inacluster
cohesion(C
i
)=
X
x∈C
i
,y∈C
i
proximity(x,y)
cohesion(C
i
)=
X
x∈C
i
proximity(x,c
i
)

ClusterSeparation:
Measurehowdistinctorwell-separateda
clusterisfromotherclusters
separation(C
i
,C
j
)=
X
x∈C
i
,y∈C
j
proximity(x,y)
separation(C
i
,C
j
)=proximity(c
i
,c
j
)
separation(C
i
)=proximity(c
i
,c)
DavideEynard-LectureNotesonClustering(III)p.20/32
Cohesionandseparationexample

Cohesionismeasuredbythewithinclustersumofsquares
(SSE)
WSS=
￿
i
￿
x∈C
i
(x−m
i
)
2

Separationismeasuredbythebetweenclustersumofsquares
BSS=
￿
i
|C
i
|(m−m
i
)
2
where|C
i
|isthesizeofclusteri
DavideEynard-LectureNotesonClustering(III)p.21/32
Cohesionandseparationexample

K=1cluster:
WSS=(1−3)
2
+(2−3)
2
+(4−3)
2
+(5−3)
2
=10
BSS=4×(3−3)
2
=0
Total=10+0=10

K=2clusters:
WSS=(1−1.5)
2
+(2−1.5)
2
+(4−4.5)
2
+(5−4.5)
2
=1
BSS=2×(3−1.5)
2
+2×(4.5−3)
2
=9
Total=1+9=10
DavideEynard-LectureNotesonClustering(III)p.21/32
EvaluatingindividualclustersandObjects

Sofar,wehavefocusedonevaluationofagroupofclusters

Manyofthesemeasures,however,alsocanbeusedto
evaluateindividualclustersandobjects

Forexample,aclusterwithahighcohesionmaybe
consideredbetterthanaclusterwithalowerone

Thisinformationoftencanbeusedtoimprovethequalityofthe
clustering

Splitnotverycohesiveclusters

Mergenotveryseparatedones

Wecanalsoevaluatetheobjectswithinaclusterintermsof
theircontributiontotheoverallcohesionorseparationofthe
cluster
DavideEynard-LectureNotesonClustering(III)p.22/32
TheSilhouetteCoefcient

SilhouetteCoefcientcombineideasofbothcohesionand
separation,butforindividualpoints,aswellasclustersand
clusterings

Foranindividualpoint,i

Calculatea
i
=averagedistanceofitothepointsinits
cluster

Calculateb
i
=min(averagedistanceofitopointsinanother
cluster)

Thesilhouettecoefcientforapointisthengivenby
s
i
=(b
i
−a
i
)/max(a
i
,b
i
)
DavideEynard-LectureNotesonClustering(III)p.23/32
TheSilhouetteCoefcient

SilhouetteCoefcientcombineideasofbothcohesionand
separation,butforindividualpoints,aswellasclustersand
clusterings
DavideEynard-LectureNotesonClustering(III)p.23/32
MeasuringClusterValidityviaCorrelation
Ifwearegiventhesimilaritymatrixforadatasetandthecluster
labelsfromaclusteranalysisofthedataset,thenwecanevaluate
the"goodness"oftheclusteringbylookingatthe
correlation
betweenthesimilaritymatrixandanidealversionofthesimilarity
matrixbasedontheclusterlabels

Similarity/ProximityMatrix

IdealMatrix

Onerowandonecolumnforeachdatapoint

Anentryis1iftheassociatedpairofpointsbelongtothe
samecluster

Anentryis0iftheassociatedpairofpointsbelongsto
differentclusters
DavideEynard-LectureNotesonClustering(III)p.24/32
MeasuringClusterValidityviaCorrelation

Computethecorrelationbetweenthetwomatrices

Sincethematricesaresymmetric,onlythecorrelationbetweenn(n−1)/2entries
needstobecalculated

Highcorrelationindicatesthatpointsthatbelongtothesame
clusterareclosetoeachother
DavideEynard-LectureNotesonClustering(III)p.24/32
UsingSimilarityMatrixforClusterValidation

Orderthesimilaritymatrixwithrespecttoclusterlabelsand
inspectvisually
DavideEynard-LectureNotesonClustering(III)p.25/32
UsingSimilarityMatrixforClusterValidation

Orderthesimilaritymatrixwithrespecttoclusterlabelsand
inspectvisually
DavideEynard-LectureNotesonClustering(III)p.25/32
UsingSimilarityMatrixforClusterValidation

Clustersinrandomdataarenotsocrisp
DavideEynard-LectureNotesonClustering(III)p.25/32
UsingSimilarityMatrixforClusterValidation

Clustersinrandomdataarenotsocrisp
DavideEynard-LectureNotesonClustering(III)p.25/32
UsingSimilarityMatrixforClusterValidation

Clustersinrandomdataarenotsocrisp
DavideEynard-LectureNotesonClustering(III)p.25/32
FindingtheCorrectNumberofClusters

Lookforthenumberofclustersforwhichthereisaknee,peak,
ordipintheplotoftheevaluationmeasurewhenitisplotted
againstthenumberofclusters
DavideEynard-LectureNotesonClustering(III)p.26/32
FindingtheCorrectNumberofClusters

Ofcourse,thisisn'talwayseasy...
DavideEynard-LectureNotesonClustering(III)p.26/32
FrameworkforClusterValidity

Needaframeworktointerpretanymeasure.

Forexample,ifourmeasureofevaluationhasthevalue"10",isthatgood,fair,or
poor?

Statisticsprovideaframeworkforclustervalidity

Themoreatypicalaclusteringresultis,themorelikelyitrepresentsvalidstructure
inthedata

Cancomparethevaluesofanindexthatresultfromrandomdataorclusteringsto
thoseofaclusteringresult:ifthevalueoftheindexisunlikely,thenthecluster
resultsarevalid

Theseapproachesaremorecomplicatedandhardertounderstand

Forcomparingtheresultsoftwodifferentsetsofcluster
analyses,aframeworkislessnecessary

However,thereisthequestionofwhetherthedifferencebetweentwoindexvalues
issignicant
DavideEynard-LectureNotesonClustering(III)p.27/32
StatisticalFrameworkforSSE

Example

CompareSSEof0.005againstthreeclustersinrandomdata

HistogramshowsSSEofthreeclustersin500setsofrandomdatapointsofsize
100distributedovertherange0.20.8forxandyvalues
DavideEynard-LectureNotesonClustering(III)p.28/32
StatisticalFrameworkforCorrelation

CorrelationofincidenceandproximitymatricesfortheK-means
clusteringsofthefollowingtwodatasets
DavideEynard-LectureNotesonClustering(III)p.29/32
FinalCommentonClusterValidity
"Thevalidationofclusteringstructuresisthemostdifcultand
frustratingpartofclusteranalysis.Withoutastrongeffortinthis
direction,clusteranalysiswillremainablackartaccessibleonlyto
thosetruebelieverswhohaveexperienceandgreatcourage."
AlgorithmsforClusteringData,JainandDubes
DavideEynard-LectureNotesonClustering(III)p.30/32
Bibliography

SlidesaboutclusteringfortheDataMiningcourse
prof.SalvatoreOrlando(link)

Tan,Steinbach,Kumar:"IntroductiontoDataMining",Ch.8
http://www-users.cs.umn.edu/kumar/dmbook/index.php

Asusual,moreinfoondel.icio.us
DavideEynard-LectureNotesonClustering(III)p.31/32