slides - University of Nottingham

companyscourgeΤεχνίτη Νοημοσύνη και Ρομποτική

19 Οκτ 2013 (πριν από 4 χρόνια και 22 μέρες)

77 εμφανίσεις


Data Mining Techniques and Applications

The University of Nottingham

Clustering


Alvaro Garcia
-
Piquer

Research Group in Intelligent Systems (GRSI)

La Salle


Ramon
Llull

University

alvarog@salle.url.edu

Outline



1
.

Introduction


2
.

Clustering

Taxonomy


3
.

Some

Algorithms


4
.

Validation

of

Clustering

Solutions


5
.

Summary

2

Outline



1
.

Introduction


2
.

Clustering

Taxonomy


3
.

Some

Algorithms


4
.

Validation

of

Clustering

Solutions


5
.

Summary

3

Grouping Data




4

1


Data

Mining









Clustering


To

group

data

according

to

a

set

of

criteria,

providing

to

experts

a

possible

classification

or

categorization

of

the

elements

[Kaufman,

2005
]






[Han, 2006]

We have rich data,
but poor information

Data mining
-
searching for
knowledge (interesting
patterns) in your data

Clustering example

5







1

by family

by age

Applications




6

1



Marketing
:

finding

groups

of

customers

with

similar

behavior

given

a

large

database

of

customer

data

containing

their

properties

and

past

buying

records



Biology
:

classification

of

living

organisms

according

to

their

DNA



Image

segmentation
:

identifying

objects

in

images

according

to

the

features

of

each

pixel

(position,

color

)







Outline



1
.

Introduction


2
.

Clustering

Taxonomy


3
.

Some

Algorithms


4
.

Validation

of

Clustering

Solutions


5
.

Summary

7

Clustering steps

8

2


To choose the
number of
clusters

To choose the
type of
clustering

Clustering
process

Convergence?

(optimization
of clusters)

Validation of
clustering
solution

Clustering algorithm


No


Yes


Cases distribution into the clusters


Relationship

of

the

clusters


Search

typology


Number of clusters

9

2


The

determination

of

the

number

of

clusters

to

find

can

be
:


Manual


The

search

space

of

the

algorithm

is

reduced



Automatic


The

search

space

is

not

delimited

and

is

more

difficult

to

the

algorithm

to

converge

Can you group these data according
to the colour?


How many groups can you identify?

Can you group these data in two
clusters according to the colour?

-

Cluster

1
:

blue

data

-

Cluster

2
:

green

data

Relationships of the clusters

10

2


Partitional


There

are

not

relationships

between

the

clusters



Hierarchical


All

the

clusters

have

some

relationships

between

them


Two

types


Agglomerative


Divisive


[
Gan
,

2007
;

Duda
,

2000
]


Partitional

11

2


Hierarchical agglomerative

12

2


Hierarchical divisive

13

2

Cases distribution into the clusters

14

2


Hard


Each

data

element

belongs

to

exactly

one

cluster


Fuzzy

(
soft
)


Data

elements

can

belong

to

more

than

one

cluster


Associated

with

each

of

the

objects

are

membership

grades

which

indicate

the

degree

to

which

the

objects

belong

to

the

different

clusters


The

sum

of

all

the

membership

grades

of

each

object

have

to

be

the

same

(normally

1
)


[
Gan
,

2007
;

Duda
,

2000
]


Hard clustering

15

2


Fuzzy clustering

16

2

Red
cluster
: 0.7

Green
cluster
: 0.3

Red
cluster
: 0.6

Green
cluster
: 0.4

Red
cluster
: 1

Green
cluster
: 0

Red
cluster
: 0.2

Green
cluster
: 0.8

Search typology (1)

17

2


Centre
-
based

algorithms

[
Gan
,

2007
]


Each

cluster

is

defined

by

a

prototype,

and

the

instances

are

assigned

to

the

closest

prototype


The

clusters

have

convex

shapes

and

each

cluster

is

represented

by

a

centre


They

can

not

find

clusters

of

arbitrary

shapes


They

are

sensible

to

the

initialization

and

they

may

fall

in

a

local

optimal

solution

x

y

x

y

prototype

Search typology (2)

18

2


Graph
-
based

algorithms

[
Gan
,

2007
]


They

construct

a

graph

or

hypergraph

and

then

apply

some

heuristic

to

partition

it


They

can

find

arbitrarily

shaped

clusters


They

are

sensible

to

the

initialization

and

they

may

fall

in

a

local

optimal

solution




x

y

x

y

Graph

construction
:

each

instance

is

related

with

the

nearest

neighbour

not

visited

Eliminating

edges
:

the

edges

that

are

longer

than

a

threshold

are

eliminated

Search typology (3)

19

2


Model
-
based

algorithms

[
Gan
,

2007
]


Is

assumed

that

the

data

are

generated

by

a

mixture

of

probability

distributions

in

which

each

one

represents

a

different

cluster


The

distributions

are

estimated

from

the

data

and

each

data

instance

is

assigned

to

each

one


They

are

sensible

to

the

initialization

and

they

may

fall

in

a

local

optimal

solution


x

y

µ
1
σ
1

µ
2
σ
2

Gaussian distributions

µ
1

σ
1

σ
2

µ
2

Search typology (4)

20

2


Search
-
based

algorithms

[
Gan
,

2007
]


They

are

a

complement

of

the

previous

strategies


The

previous

strategies

may

not

be

able

to

find

the

globally

optimal

clustering

that

fits

the

data

set


This

strategy

tries

to

search

in

the

overall

solution

space

and

find

a

globally

optimal

clustering

that

fits

the

data

set


Genetic

algorithms


Ant

colony

optimization


Simulated

annealing


They

are

very

time

expensive



Search typology (5)

21

2


Density
-
based

algorithms

[
Gan
,

2007
]


Clusters

are

defined

as

dense

regions

separated

by

low
-
density

regions


They

need

only

one

scan

of

the

original

data

set

and

can

handle

noise


The

number

of

clusters

is

not

required


They

can

find

arbitrarily

shaped

clusters


x

y

x

y

noise

Search typology (6)

22

2


Subspace
-
based

algorithms

[
Gan
,

2007
]


They

are

applied

to

high

dimensional

data

sets


They

consist

on

finding

clusters

in

each

dimension

identifying

dense

units


The

final

clusters

are

found

overlapping

the

clusters

of

each

dimension


x

y

y

x

Optimization of the clusters

23

2


Several

clustering

algorithms

are

iterative,

and

consists

on

optimize

the

evaluation

of

the

clusters

according

to

one

or

several

objectives



Single

objective


The

clustering

process

consists

on

optimize

a

single

objective



Several

objectives


The

clustering

process

consists

on

optimize

several

objectives

obtaining

a

trade
-
off

between

them




[Law, 2004]

Single objective (1)

24

2


The

clusters

are

obtained

taking

into

account

the

attributes

‘x’

and

‘y’


Criterion

to

optimize
:



1
)

Each

cluster

has

to

contain

elements

of

the

same

shape

Criteria

to

optimize
:


1
)

Each

cluster

has

to

contain

elements

of

the

same

shape


2
)

The

number

of

clusters

has

to

be

minimized

x

y

x

y

These

two

criteria

are

considered

as

a

single

objective

due

to

optimize

a

criterion

doesn’t

affect

to

the

other

criterion

Single objective (2)

25

2


The

clusters

are

obtained

taking

into

account

the

attributes

‘x’

and

‘y’


Criteria

to

optimize
:



1
)

Minimize

intra
-
cluster

variance


2
)

Maximize

inter
-
cluster

variance

x

y

x

y

Intra
-
cluster

variance

optimized

Inter
-
cluster

variance

optimized

Is

impossible

to

optimize

both

criteria

at

the

same

time

Single objective (3)

26

2


Validation

indexes

[
Halkidi
,

2002
]


They

evaluate

a

clustering

solution

according

to

the

quality

of

the

clusters

(shape)

using

the

inter
-
cluster

and

intra
-
cluster

variance

simultaneously
.


Some

indexes


Davies
-
Bouldin

index


Dunn’s

index


Silhouette

index


...


Example
:

Davies
-
Bouldin

index

[Dunn,

1974
]


Several objectives (1)

27

2


Ensemble

clustering

[Law,

2004
]


x

y

x

y

x

y

?

Combination
of the results

Criteria

to

optimize
:



1
)

Minimize

intra
-
cluster

variance


2
)

Maximize

inter
-
cluster

variance

Several objectives (2)

28

2


Multi
-
objective

clustering

Criteria

to

optimize
:



1
)

Minimize

intra
-
cluster

variance


2
)

Maximize

inter
-
cluster

variance

intra
-
cluster variance

1
-
inter
-
cluster variance

dominated

solution

x

y

x

y

x

y

x

y

x

y

Taxonomy Summary




29

2







Search

typology


Centre
-
based


Search
-
based


Graph
-
based


Density
-
based


Model
-
based


Subspace
-
based


...



Single
objective


Cases

distribution

into

the

clusters


Optimization

of

the

clusters

Several

objectives

Ensemble
clustering

Multi
-
objective

clustering


Relationships

of

the

clusters



Partitional




Hierarchical




Hard




Fuzzy

(soft)




Number

of

clusters


Manual


Automatic

Outline



1
.

Introduction


2
.

Clustering

Taxonomy


3
.

Some

Algorithms


4
.

Validation

of

Clustering

Solutions


5
.

Summary

30

k
-
means

31

3


MacQueen
,

1967

[
MacQueen
,

1967
]


Partitional


Centre
-
based


Hard

clustering


Number

of

clusters

manual


Single

objective



It

consists

on

group

the

instances

into

k

circular

clusters

according

to

the

distance

between

them

and

the

centre

of

the

cluster,

updating

the

centres

with

the

new

assignments
.

This

process

is

repeated

until

convergence

has

been

reached



Similar

algorithms
:

x
-
means

(Number

of

clusters

automatic),

fuzzy

C
-
means

(fuzzy

clustering)

Single
-
link

32

3


Johnson,

1967

[Johnson,

1967
]


Hierarchical

agglomerative


Centre
-
based


Hard

clustering


Number

of

clusters

automatic


Single

objective



In

each

step

the

two

clusters

whose

two

closest

members

have

the

smallest

distance

are

merged






Similar

algorithms
:

Complete
-
link,

Average
-
link


They

follow

other

heuristic

to

merge

the

instances


Outline



1
.

Introduction


2
.

Clustering

Taxonomy


3
.

Some

Algorithms


4
.

Validation

of

Clustering

Solutions


5
.

Summary

33

Clustering validation (1)

34


How

to

validate

a

clustering

solution?


The

data

is

not

labelled



External

criteria

[
Halkidi
,

2002
]


Expert

in

the

domain

of

the

problem

as

judge


Comparing

with

an

intuitive

solution

o
F
-
Measure,

Rand

Index,

Adjusted

Rand

Index
...


Explanations

of

each

cluster

to

justify

them

o
Main

features

(attributes)

of

the

elements

of

each

cluster

4

35


Relative

criteria

[
Halkidi
,

2002
]


Comparing

the

clustering

results

according

to

a

validation

index

(or

a

combination

of

them)


Validation

index

use

only

the

information

of

the

data

set


Normally

is

used

to

select

the

best

solution

from

several

clustering

results

obtained

with

different

clustering

algorithms


This

does

not

means

that

the

solution

is

a

good

solution

to

the

problem


The

selected

solution

depends

on

the

validation

index

used


4

Clustering validation (2)

Outline



1
.

Introduction


2
.

Clustering

Taxonomy


3
.

Some

Algorithms


4
.

Validation

of

Clustering

Solutions


5
.

Summary

36

Summary

37


How

to

solve

a

clustering

problem?


Data

analysis


Pre
-
process

the

data

if

it

is

necessary

(noise,

unknown

values
...
)


Selection

of

the

clustering

algorithm


Is

important

to

know

the

domain

of

the

problem


Is

there

a

known

number

of

clusters?


Can

be

overlapping

between

clusters?


Is

necessary

a

hierarchical

relationship

between

clusters?


Is

important

to

detect

arbitrary

shapes?


What

are

the

clustering

criteria?


...




5

References

38


A
.

P
.

Dempster
,

N
.

M
.

Laird,

and

D
.

B
.

Rubin
.

Maximum

likelihood

from

incomplete

data

via

the

EM

algorithm
.

Journal

of

the

Royal

Statistical

Society,

vol
.

39
,

pp
.

1
-
38
,

1977
.


G
.

Corral,

A
.

Garcia
-
Piquer,

A
.

Orriols
-
Puig,

A
.

Fornells,

and

E
.

Golobardes
.

Analysis

of

Vulnerability

Assessment

Results

based

on

CAOS
.

Applied

Softcomputing

Journal
,

in

press,

2010
.


R
.

O
.

Duda
,

P
.

E
.

Hart

and

D
.

G
.

Stork
.

Pattern

classification
.

John

Wiley

&

Sons,

Inc,

2000
.


J
.

C
.

Dunn
.

Well

separated

clusters

and

optimal

fuzzy

partitions
.

Journal

of

Cybernetics
,

95
-
104
,

1974
.


G
.

Gan
,

M
.

Chaoqun
,

and

J
.

Wu
.

Data

Clustering

Theory,

Algorithms,

and

Applications
.

ASA
-
SIAM,

2007
.


M
.

Halkidi
,

Y
.

Batistakis
,

and

M
.

Vazirgiannis
.

Cluster

validity

methods
:

part

I
.

ACM

SIGMOD

Record
,

31
(
2
)
:
40
-
45
,

2002
.


J
.

Han,

M
.

Kamber
.

Data

Mining
.

Concepts

and

techniques
.

Morgan

Kaufmann,

2006
.


S
.

C
.

Johnson
.

Hierarchical

Clustering

Schemes
.

Psychometrika
,

2
:
241
-
254
,

1967
.


L
.

Kaufman,

and

P
.

J
.

Rousseeuw
.

Finding

Groups

in

Data
:

An

Introduction

to

Cluster

Analysis
.

John

Wiley

&

Sons,

Inc,

2005
.



M
.

Law,

A
.

Topchy
,

and

A
.

Jain
.

Multiobjective

data

clustering
.

IEEE

Computer

Society

Conference

on

Computer

Vision

and

Pattern

Recognition,

2
:
424
-
430
,

2004
.


M
.

Matteucci
.

A

Tutorial

on

Clustering

Algorithms
,

Politecnico

di

Milano
.

<
http
:
//home
.
dei
.
polimi
.
it/matteucc/Clustering/tutorial_html/index
.
html
>


J
.

MacQueen
.

Some

methods

for

classification

and

analysis

of

multivariate

observations
.

In

Proceedings

of

the

5
th

Berkeley

symposium

on

mathematical

statistics

and

probability,

1
:
281
-
297
,

1967
.


I
.

H
.

Witten

and

E
.

Frank
.

DataMining
:

Practical

machine

learning

tools

and

techniques
.

Morgan

Kaufmann

Publishers,

2005
.