Data Warehousing and Data Mining - dEI

sentencehuddleΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 10 μήνες)

85 εμφανίσεις

A.A.04-05Datawarehousing&Datamining1
Data Warehousing
and
Data Mining
A.A.04-05Datawarehousing&Datamining2
Outline
1.Introductionand Terminology
2.Data Warehousing
3.Data Mining
Associationrules
Sequentialpatterns
Classification
Clustering
A.A.04-05Datawarehousing&Datamining3
Introduction and Terminology
Evolutionofdatabasetechnology
Fileprocessing(60s)
RelationalDBMS(70s)
Advanceddatamodelse.g.,
Object-Oriented,
application-oriented(80s)
Web-BasedRepositories(90s)
DataWarehousingand
DataMining(90s)
Global/IntegratedInformationSystems(2000s)
A.A.04-05Datawarehousing&Datamining4
Introduction and Terminology
Majortypesofinformationsystemswithinanorganization
TRANSACTION
PROCESSING
SYSTEMS
EnterpriseResourcePlanning(ERP)
CustomerRelationshipManagement(CRM)
KNOWLEDGE
LEVEL
SYSTEMS
Officeautomation,Groupware,Content
Distributionsystems,Workflow
mangementsystems
MANAGEMENT
LEVEL
SYSTEMS
DecisionSupportSystems(DSS)
ManagementInformationSys.(MIS)
EXECUTIVE
SUPPORT
SYSTEMS
Few
sophisticated
users
Manyun-
sophisticated
users
A.A.04-05Datawarehousing&Datamining5
Introduction and Terminology
Transactionprocessingsystems:
•Supporttheoperationalleveloftheorganization,possibly
integratingneedsofdifferentfunctionalareas(ERP);
•Performandrecordthedailytransactionsnecessarytothe
conductofthebusiness
•Executesimpleread/updateoperationsontraditional
databases,aimingatmaximizingtransactionthroughput
•Theiractivityisdescribedas:
OLTP(On-LineTransactionProcessing)
A.A.04-05Datawarehousing&Datamining6
Introduction and Terminology
Knowledgelevelsystems:providedigitalsupportformanaging
documents(officeautomation),usercooperationand
communication(groupware),storingandretrieving
information(contentdistribution),automationofbusiness
procedures(workflowmanagement)
Managementlevelsystems:supportplanning,controllingand
semi-structureddecisionmakingatmanagementlevelby
providing
reports
andanalysesof
currentandhistoricaldata
Executivesupportsystems:supportunstructureddecision
makingatthestrategicleveloftheorganization
A.A.04-05Datawarehousing&Datamining7
Introduction and Terminology
Decision
Making
DataPresentation
Reporting/Visualizationengines
DataAnalysis
OLAP,DataMiningengines
DataWarehouses/DataMarts
DataSources
TransactionalDB,ERP,CRM,Legacysystems
Multi-Tierarchitecturefor
Managementleveland
Executivesupportsystems
PRESENTATION
BUSINESS
LOGIC
DATA
A.A.04-05Datawarehousing&Datamining8
Introduction and Terminology
OLAP(On-LineAnalyticalProcessing):
•Reportingbasedon(multidimensional)dataanalysis
•Read-onlyaccessonrepositoriesofmoderate-largesize
(typically,datawarehouses),aimingatmaximizing
responsetime
DataMining:
•Discoveryofnovel,implicitpatternsfrom,possibly
heterogeneous,datasources
•Useamixofsophisticatedstatisticalandhigh-
performancecomputingtechniques
A.A.04-05Datawarehousing&Datamining9
Outline
1.Introductionand Terminology
2.Data Warehousing
3.Data Mining
Associationrules
Sequentialpatterns
Classification
Clustering
A.A.04-05Datawarehousing&Datamining10
Data Warehousing
DATAWAREHOUSE
Databasewiththefollowingdistinctivecharacteristics:
•Separatefromoperationaldatabases
•Subjectoriented:providesasimple,conciseviewononeor
moreselectedareas,insupportofthedecisionprocess
•Constructedbyintegratingmultiple,heterogeneousdata
sources
•Containshistoricaldata:spansamuchlongertimehorizon
thanoperationaldatabases
•(Mostly)Read-Onlyaccess:periodic,infrequentupdates
A.A.04-05Datawarehousing&Datamining11
Data Warehousing
TypesofDataWarehouses
•EnterpriseWarehouse:coversallareasofinterestforan
organization
•DataMart:coversasubsetofcorporate-widedatathatis
ofinterestforaspecificusergroup(e.g.,marketing).
•VirtualWarehouse:offersasetofviewsconstructedon
demandonoperationaldatabases.Someoftheviewscould
bematerialized(precomputed)
A.A.04-05Datawarehousing&Datamining12
Data Warehousing
Multi-TierArchitecture
DB
DB
DataWarehouse
Server
Analysis
Reporting
DataMining
DatasourcesDataStorage
OLAP
engine
Front-End
Tools
Cleaning
extraction
A.A.04-05Datawarehousing&Datamining13
Data Warehousing
Multidimensional(logical)Model
DataareorganizedaroundoneormoreFACTTABLEs.Each
FactTablecollectsasetofomogeneousevents(facts)
characterizedbydimensionsanddependentattributes
Example:Salesatachainofstores
100
30
Units
9000€2qtrSt3S1P2
1500€1qtrSt1S1P1
SalesPeriod
Store
Supplier
Product
dimensions
dependentattributes
A.A.04-05Datawarehousing&Datamining14
Data Warehousing
Multidimensional(logical)Model(cont’d)
Eachdimensioncaninturnconsistofanumberofattributes.
Inthiscasethevalueinthefacttableisaforeignkeyreferring
toanappropriatedimensiontable
Address
Name
Code
supplier
Description
Code
product
Address
Manager
Name
Code
Store
Units
Store
Period
Sales
Supplier
Product
Facttable
dimension
table
dimension
table
dimension
table
STAR
SCHEMA
A.A.04-05Datawarehousing&Datamining15
Data Warehousing
ConceptualStarSchema(E-RModel)
FACTs
PRODUCT
(1,1)
(0,N)
(0,N)
(0,N)
(1,1)
(1,1)
STORE
SUPPLIER
PERIOD
(1,1)
(0,N)
A.A.04-05Datawarehousing&Datamining16
Data Warehousing
OLAPServerArchitectures
Theyareclassifiedbasedontheunderlyingstoragelayouts
ROLAP(RelationalOLAP):usesrelationalDBMStostore
andmanagewarehousedata(i.e.,table-oriented
organization),andspecificmiddlewaretosupportOLAP
queries.
MOLAP(MultidimensionalOLAP):usesarray-baseddata
structuresandpre-computedaggregateddata.Itshowshigher
performancethanOLAPbutmaynotscalewellifnotproperly
implemented
HOLAP(HybirdOLAP):ROLAPapproachforlow-levelraw
data,MOLAPapproachforhigher-leveldata(aggregations).
A.A.04-05Datawarehousing&Datamining17
Data Warehousing
MOLAPApproach
Period
P
r
o
d
u
c
t
Region
CD
TV
DVD
PC
1Qtr2Qtr3Qtr4Qtr
Europe
NorthAmerica
MiddleEast
FarEast
DATACUBE
representinga
FactTable
TotalsalesofPCsin
europeinthe4th
quarteroftheyear
A.A.04-05Datawarehousing&Datamining18
Data Warehousing
OLAPOperations:SLICE
Fixvaluesforoneor
moredimensions
E.g.Product=CD
A.A.04-05Datawarehousing&Datamining19
Data Warehousing
OLAPOperations:DICE
Fixrangesforoneor
moredimensions
E.g.
Product=CDorDVD
Period=1Qrtor2Qrt
Region=Europeor
NorthAmerica
A.A.04-05Datawarehousing&Datamining20
Data Warehousing
OLAPOperations:Roll-Up
Period
P
r
o
d
u
c
t
Region
CD
TV
DVD
PC
year
Europe
NorthAmerica
MiddleEast
FarEast
Aggregatedatabygrouping
alongone(ormore)
dimensions
E.g.:groupquarters
Drill-Down=(Roll-Up)-1
A.A.04-05Datawarehousing&Datamining21
Data Warehousing
CubeOperator:summariesforeachsubsetofdimensions
NorthAmerica
MiddleEast
CD
TV
DVD
PC
1Qtr2Qtr3Qtr4Qtr
Europe
FarEast
SUM
SUM
SUM
Yearlysalesof
electronicsinthe
middleeast
YearlysalesofPCs
inthemiddleeast
A.A.04-05Datawarehousing&Datamining22
Data Warehousing
CubeOperator:itisequivalenttocomputingthefollowing
latticeofcuboids
FACTTABLE
product,periodproduct,regionperiod,region

productperiodregion
product,period,region
A.A.04-05Datawarehousing&Datamining23
Data Warehousing
CubeOperatorInSQL:
SELECTProduct,Period,Region,SUM(TotalSales)
FROMFACT-TABLE
GROUPBYProduct,Period,Region
WITHCUBE
A.A.04-05Datawarehousing&Datamining24
Data Warehousing
*

ALL
ALL
*ALLALLALL
*
Europe
ALL
ALL
*



*
ALL
1qtr
ALL
*



*
ALL
ALL
CD
*

ALL

*
Europe
ALL
CD
*


ALL
*
Europe
1qtr
ALL
*
ALL


*
ALL
1qtr
CD
*



*
Europe
1qtr
CD
Tot. Sales
Region
Period
Product
AllcombinationsofProduct,
PeriodandRegion
AllcombinationsofProduct
andPeriod
AllcombinationsofProduct
A.A.04-05Datawarehousing&Datamining25
Data Warehousing
Roll-up(=partialcube)OperatorInSQL:
SELECTProduct,Period,Region,SUM(TotalSales)
FROMFACT-TABLE
GROUPBYProduct,Period,Region
WITHROLL-UP
A.A.04-05Datawarehousing&Datamining26
Data Warehousing
*ALLALLALL
*
ALL
ALL

*
ALL
ALL
CD
*
ALL


*
ALL
1qtr
CD
*



*
Europe
1qtr
CD
Tot. Sales
Region
Period
Product
AllcombinationsofProduct,
PeriodandRegion
AllcombinationsofProduct
andPeriod
AllcombinationsofProduct
Reducesthecomplexityfromexponentialtolinearinthe
numberofdimensions
A.A.04-05Datawarehousing&Datamining27
Data Warehousing
itisequivalenttocomputingthefollowingsubsetofthelattice
ofcuboids
product,periodproduct,regionperiod,region

productperiodregion
product,period,region
A.A.04-05Datawarehousing&Datamining28
Outline
1.Introductionand Terminology
2.Data Warehousing
3.Data Mining
Associationrules
Sequentialpatterns
Classification
Clustering
A.A.04-05Datawarehousing&Datamining29
Data Mining
DataExplosion:tremendousamountofdataaccumulatedin
digitalrepositories
aroundtheworld(e.g.,databases,data
warehouses,web,etc.)
Productionofdigitaldata/Year:
•3-5Exabytes(1018
bytes)in2002
•30%increaseperyear(99-02)
Wearedrowningindata,butstarvingforknowledge
See:
www.sims.berkeley.edu/how-much-info
A.A.04-05Datawarehousing&Datamining30
Data Mining
KnowledgeDiscoveryinDatabases
DataWarehouse
DB
DB
Informationrepositories
Training
data
KNOWLEDGE
Datacleaning
&integration
Selection&
transformation
DataMining
Evaluation&
presentation
Domain
knowledge
A.A.04-05Datawarehousing&Datamining31
Data Mining
Typologiesofinputdata:
•Unaggregateddata(e.g.,records,transactions)
•Aggregateddata(e.g.,summaries)
•Spatial,geographicdata
•Datafromtime-seriesdatabases
•Text,video,audio,webdata
A.A.04-05Datawarehousing&Datamining32
Data Mining
DATAMINING
Processofdiscoveringinterestingpatternsor
knowledgefroma(typically)largeamountofdata
storedeitherindatabases,datawarehouses,or
otherinformationrepositories
Alternativenames:knowledgediscovery/extraction,
informationharvesting,businessintelligence
Infact,dataminingisastepofthemoregeneralprocess
ofknowledgediscoveryindatabases(KDD)
Interesting:non-trivial,implicit,previouslyunknown,
potentiallyuseful
A.A.04-05Datawarehousing&Datamining33
Data Mining
Interestingnessmeasures
Purpose:filterirrelevantpatternstoconveyconciseand
usefulknowledge.Certaindataminingtaskscanproduce
thousandsormillionsofpatternsmostofwhichare
redundant,trivial,irrelevant.
Objectivemeasures:basedonstatisticsandstructureof
patterns(e.g.,frequencycounts)
Subjectivemeasures:basedonuser’sbeliefaboutthe
data.Patternsmaybecomeinterestingiftheyconfirmor
contradictauser’shypothesis,dependingonthecontext.
Interestingnessmeasurescanemployedbothafterandduring
thepatterndiscovery.Inthelattercase,theyimprovethe
searchefficiency
A.A.04-05Datawarehousing&Datamining34
Data Mining
Data Mining
Algorithms
MultidisciplinarityofDataMining
Statistics
High-
Performance
Computing
Machine
learning
DB
Technology
Visualization
A.A.04-05Datawarehousing&Datamining35
Data Mining
DataMiningProblems
AssociationRules:discoveryofrulesXY(“objectsthat
satisfyconditionXarealso
likely
tosatisfyconditionY”).The
problemfirstfoundapplicationinmarketbasketortransaction
dataanalysis,where“objects”aretransactionsand
“conditions”arecontainmentofcertainitemsets
A.A.04-05Datawarehousing&Datamining36
Association Rules
StatementoftheProblem
I=Setofitems
D=Setoftransactions:tDthentI
ForanitemsetXI,
support(X)=
fractionornumberoftransactionscontainingX
ASSOCIATIONRULE:X-Y

Y,withYXI
support=support(X)
confidence=support(X)/support(X-Y)
PROBLEM:findallassociationruleswithsupportthanmin
supandconfidencethanminconfidence
A.A.04-05Datawarehousing&Datamining37
Association Rules
MarketBasketAnalysis:
Items=productssoldinastoreorchainofstores
Transactions=customers’shoppingbaskets
RuleX-Y

Y=customerswhobuyitemsinX-Yarelikelyto
buyitemsinY
Ofdiapersandbeer…..
Analysisofcustomersbehaviourinasupermarketchainhas
revealedthatmaleswhoonthursdaysandsaturdaysbuy
diapersarelikelytobuyalsobeer
….That’swhythesetwoitemsarefoundclosetoeach
otherinmoststores
A.A.04-05Datawarehousing&Datamining38
Association Rules
Applications:
•Cross/Up-selling
(especiallyine-comm.,e.g.,Amazon)
Cross-selling:pushcomplemetaryproducts
Up-selling:pushsimilarproducts
•Catalogdesign
•Storelayout
(e.g.,diapersandbeerclosetoeachother)
•Financialforecast
•Medicaldiagnosis
A.A.04-05Datawarehousing&Datamining39
Association Rules
Def.:Frequentitemset=itemsetwithsupportminsup
GeneralStrategytodiscoverallassociationrules:
1.Findallfrequentitemsets
2.frequentitemsetX,outputallrulesX-Y

Y,withYX,
whichsatisfytheminconfindenceconstraint
Observation:
Minsupandminconfidenceareobjectivemeasuresof
interestingness.Theirpropersetting,however,requires
user’sdomainknowledge.Lowvaluesmayyieldexponentially
(in|I|)manyrules,highvaluesmaycutoffinterestingrules
A.A.04-05Datawarehousing&Datamining40
Association Rules
Example
Forrule{A}{C}:
support=support({A,C})=50%
confidence=support({A,C})/support({A})=66.6%
Forrule{C}{A}(samesupportas{A}{C}):
confidence=support({A,C})/support({C})=100%
Min. sup 50%
Min. confidence 50%
B,E,F40
A,D30
A,C20
A,B,C10
Itemsbought
Transaction-id
50%{A,C}
50%{C}
50%{B}
75%{A}
Support
FrequentItemsets
A.A.04-05Datawarehousing&Datamining41
Association Rules
DealingwithLargeOutputs
Observation:dependingonthevaluesofminsupandon
thedataset,thenumberfrequentitemsetscanbe
exponentiallylarge
butmaycontainalotofredundancy
Goal:determineasubsetoffrequentitemsetsof
considerablysmallersize,whichprovidesthesame
informationcontent,i.e.,fromwhichthecompletesetof
frequentitemsetscanbederivedwithoutfurtherinformation.
A.A.04-05Datawarehousing&Datamining42
Association Rules
NotionofClosedness
I(items),D(transactions),minsup(supportthreshold)
ClosedFrequentItemsets=
{XI:supp(X)minsup&supp(Y)<supp(X)IY X}
•It’sasubsetofallfrequentitemsets
•ForeveryfrequentitemsetXthereexistsaclosed
frequentitemsetYXsuchthatsupp(Y)=supp(X),i.e.,Y
andXoccurinexactlythesametranscations
•Allfrequentitemsetsandtheirfrequenciescanbederived
fromtheclosedfrequentitemsetswithoutfurther
information
A.A.04-05Datawarehousing&Datamining43
Association Rules
NotionofMaximality
MaximalFrequentItemsets=
{XI:supp(X)minsup&supp(Y)<minsupIY X}
•It’sasubsetoftheclosedfrequentitemsets,henceofall
frequentitemsets
•Forevery(closed)frequentitemsetXthereexistsa
maximalfrequentitemsetYX
•Allfrequentitemsetscanbederivedfromthemaximal
frequentitemsetswithoutfurtherinformation,however
theirfrequenciesmustbedeterminedfromD


informationloss
A.A.04-05Datawarehousing&Datamining44
Association Rules
B,L,G,H40
B,A,D,F50
B,A,D,F,L,G60
B,A,D,F,L,H30
B,L20
B,A,D,F,G,H10
Items
Tid
YES
YES
YES
YES
NO
Maximal
3
3
4
4
6
Support
10,40,60B,G
10,30,40B,H
20,30,40,60B,L
10,30,50,60B,A,D,F
allB
Supporting
Transactions
ClosedFrequent
Itemsets
Minsup=3
DATASET
Exampleofclosedand
maximalfrequentitemsets
A.A.04-05Datawarehousing&Datamining45
Association Rules
(B,6)(A,4)(D,4)(F,4)(L,4)(G,3)(H,3)
(B,4)(B,4)(A,4)(B,4)(A,4)(D,4)(B,4)(B,3)(B,3)
(B,4)(B,4)(B,4)(A,4)
(B,4)
(,6)
=closed&maximal
=closedbutnotmaximal
(Allfrequent)VS(closedfrequent)VS(maximalfrequent)
A.A.04-05Datawarehousing&Datamining46
Sequential Patterns
DataMiningProblems
SequentialPatterns:discoveryoffrequentsubsequencesin
acollectionofsequences(sequencedatabase),each
representingasetofeventsoccurringatsubsequenttimes.
Theorderingoftheeventsinthesubsequencesisrelevant.
A.A.04-05Datawarehousing&Datamining47
Sequential Patterns
SequentialPatterns
PROBLEM:Givenasetofsequences,findthecompleteset
of
frequentsubsequences
sequencedatabase
A
sequence:<(ef)(ab)(df)(c)(b)>
Anelementmaycontainasetofitems.
Itemswithinanelementareunordered
andarelistedalphabetically.
<(a)(bc)(d)(c)>
isasubsequenceof
<(
<(a
)(abc
)(ac)(d
)(c
f)>
Forminsup=2,<(ab)(c)>isafrequentsubsequence
<(e)(g)(af)(c)(b)(c)>
40
<(ef)(ab
)(df)(c)(b)>
30
<(ad)(c)(bc)(ae)>
20
<(a)(ab
c)(ac
)(d)(cf)>
10
sequence
SID
A.A.04-05Datawarehousing&Datamining48
Sequential Patterns
Applications:
•Marketing
•Naturaldisasterforecast
•Analysisofweblogdata
•DNAanalysis
A.A.04-05Datawarehousing&Datamining49
Classification
DataMiningProblems
Classification/Regression:discoveryofamodelorfunction
thatmapsobjectsintopredefinedclasses(classification)or
intosuitablevalues(regression).Themodel/functionis
computedonatrainingset(supervisedlearning)
A.A.04-05Datawarehousing&Datamining50
Classification
StatementoftheProblem
TrainingSet:T={t1,…,tn}setofnexamples
Eachexampleti
•characterizedbymfeatures(ti(A1),…,ti(Am))
•belongstooneofkclasses(Ci
:1
i
k)
GOAL
Fromthetrainingdatafindamodeltodescribetheclasses
accuratelyandsyntheticallyusingthedata’sfeatures.The
modelwillthenbeusedtoassignclasslabelstounknown
(previouslyunseen)records
A.A.04-05Datawarehousing&Datamining51
Classification
Applications:
•Classificationof(potential)customersfor:Creditapproval,
riskprediction,selectivemarketing
•Performancepredictionbasedonselectedindicators
•Medicaldiagnosisbasedonsymptomsorreactionsto
therapy
A.A.04-05Datawarehousing&Datamining52
Classification
Classificationprocess
TrainingSet
TestData
UnseenData
VALIDATE
BUILD
PREDICT
classlabels
model
A.A.04-05Datawarehousing&Datamining53
Classification
Observations:
•Featurescanbeeithercategoricalifbelongingto
unordereddomains(e.g.,CarType),orcontinuous,if
belongingtoordereddomains(e.g.,Age)
•Theclasscouldberegardedasanadditionalattributeof
theexamples,whichwewanttopredict
•ClassificationvsRegression:
Classification:buildsmodelsforcategoricalclasses
Regression:buildsmodelsforcontinuousclasses
•Severaltypesofmodelsexist:decisiontrees,neural
networks,bayesan(statistical)classifiers.
A.A.04-05Datawarehousing&Datamining54
Classification
Classificationusingdecisiontrees
Definition:DecisionTreeforatrainingsetT
•Labeledtree
•Eachinternalnodevrepresentsatestonafeature.The
edgesfromvtoitschildrenarelabeledwithmutually
exclusiveresultsofthetest
•EachleafwrepresentsthesubsetofexamplesofT
whosefeaturesvaluesareconsistentwiththetest
resultsfoundalongthepathfromtheroottow.The
leafislabeledwiththemajorityclassoftheexamplesit
contains
A.A.04-05Datawarehousing&Datamining55
Classification
Example:
examples=carinsuranceapplicants,class=insurancerisk
high
family
20
low
truck
32
low
family
68
high
sports
43
high
sports
18
high
family
23
Risk
CarType
Age
featuresclass
Age<25
Cartype{sports}
High
High
Low
Model
(decisiontree)
A.A.04-05Datawarehousing&Datamining56
Clustering
DataMiningProblems
Clustering:groupingobjectsintoclasseswiththe
objectiveofmaximizingintra-classsimilarityand
minimizinginter-classsimilarity(unsupervisedlearning)
A.A.04-05Datawarehousing&Datamining57
Clustering
StatementoftheProblem
GIVEN:Nobjects,eachcharacterizedbypattributes
(a.k.a.variables)
GROUP:theobjectsintoKclustersfeaturing
•Highintra-cluster
similarity
•Lowinter-cluster
similarity
Remark:Clusteringisaninstanceofunsupervisedlearningor
learningbyobservations,asopposedtosupervisedlearning
orlearningbyexamples(classification)
A.A.04-05Datawarehousing&Datamining58
Clustering
Example
outlier
attributeX
attributeY
attributeX
attributeY
object
A.A.04-05Datawarehousing&Datamining59
Clustering
Severaltypesofclusteringproblemsexistdepending
onthespecificinput/outputrequirements,andonthe
notionofsimilarity:
•ThenumberofclustersKmaybeprovidedininputornot
•Asoutput,foreachclusteronemaywantarepresentative
object,orasetofaggregatemeasurements,orthe
completesetofobjectsbelongingtothecluster
•Distance-basedclustering:similarityofobjectsisrelatedto
somekindofgeometricdistance
•Conceptualclustering:agroupofobjectsformsaclusterif
theydefineacertainconcept
A.A.04-05Datawarehousing&Datamining60
Clustering
Applications:
•Marketing:identifygroupsofcustomersbasedontheir
purchasingpatterns
•Biology:categorizegeneswithsimilarfunctionalities
•Imageprocessing:clusteringofpixels
•Web:clusteringofdocumentsinmeta-searchengines
•GIS:identificationofareasofsimilarlanduse
A.A.04-05Datawarehousing&Datamining61
Clustering
Challenges:
•Scalability:strategiestodealwithverylargedatasets
•Varietyofattributetypes:definingagoodnotionof
similarityishardinthepresenceofdifferenttypesof
attributesand/ordifferentscalesofvalues
•Varietyofclustershapes:commondistancemeasures
provideonlysphericalclusters
•Noisydata:outliersmayaffectthequalityoftheclustering
•Sensitivitytoinputordering:theorderingoftheinputdata
shouldnotaffecttheoutput(oritsquality)
A.A.04-05Datawarehousing&Datamining62
Clustering
MainDistance-BasedClusteringMethods
PartitioningMethods:createaninitialpartitionoftheobjects
intoKclusters,andrefinetheclusteringusingiterative
relocationtechniques.Aclusterisrepresentedeitherbythe
meanvalueofitscomponentobjectsorbyacentrallylocated
componentobject(medoid)
HierarchicalMethods:startwithallobjectsbelongingto
distinctclustersandthensuccessivelymergethepairof
closestclusters/objectsintoonesinglecluster(agglomerative
approach);orstartwithallobjectsbelongingtoonecluster
andsuccessivelysplitupaclusterintosmallerones(divisive
approach)