# Data clustering

IA et Robotique

8 nov. 2013 (il y a 7 années et 10 mois)

263 vue(s)

A Novel Framework

to Elucidate Core Classes in a Dataset

Daniele Soria

daniele.soria@nottingham.ac.uk

G54DMT
-

Data Mining Techniques and Applications

Topic 4: Applications

Lecture 2

3
rd

May 2012

Outline

Aims and motivations

Framework

Clustering and consensus

Supervised learning

Validation on two clinical data sets

Conclusions and questions

2

Aims and motivations

3

Develop an
original framework

with multiple steps
to extract most
representative classes

from any
dataset

Refine the
phenotypic characterisation

of breast cancer

Move the
medical decision making

process from a
single technique approach to a
multi
-
technique
one

Guide clinicians in the choice of the favourite and

most
powerful treatment
to reach a
personalised healthcare

Dataset

Pre
-
processing

Clustering

HCA

FCM

KM

PAM

n
known?

Validity indices

Characterisation

& agreement

Classes

Framework (1)

no

yes

4

Supervised

learning

C4.5

Data ~
N
(
μ
,
σ
)?

Naïve Bayes

MLP
-

ANN

NPBC

Classes
characterisation

Framework (2)

no

yes

Classes

5

Data pre
-
processing

Dealing with
missing values

Homogeneous

variables

Compute

descriptive statistics

6

Clustering

Four different algorithms:

1)

Hierarchical (
HCA
)

2)

Fuzzy c
-
means (
FCM
)

3)

K
-
means (
KM
)

4)

Partitioning Around Medoids (
PAM
)

Methods run with the
number

of clusters

varying

from
2

to

20

7

8

Hierarchical method

Hierarchy

of clusters

Represented with a tree

(
dendrogram
)

Clusters obtained
cutting

the dendrogram
at specific

height

9

Hierarchical method (cont.)

0.0
0.5
1.0
1.5
6 clusters

10

FCM method

Minimisation

of the o. f.

X
= {
x
1
,x
2
,...,
x
n
}:

n

data points

V =
{v
1
,v
2
,…,v
c
}:

c

cluster centres

U=
(
μ
ij
)
n*c
:

fuzzy partition matrix

μ
ij
:

membership of
x
i

to
v
j

m:

fuzziness index

n
i
c
j
j
i
m
ij
x
V
U
J
1
1
2
)
(
)
,
(

11

FCM method (cont.)

-0.5
0.0
0.5
1.0
1.5
-0.5
0.0
0.5
1.0
1.5
-0.5
0.0
0.5
1.0
1.5
-0.5
0.0
0.5
1.0
1.5
12

KM method

Minimisation

of the o. f.

||x
i
-
v
j
||:

Euclidean distance

between

x
i

and
v
j

c
j
:

data points in cluster

j

v
j

k
j
c
i
j
i
j
x
V
J
1
1
2
)
(

k
j
x
c
j
c
i
i
j
,...,
1

,
1
1

13

KM method (cont.)

-0.5
0.0
0.5
1.0
1.5
-0.5
0.0
0.5
1.0
1.5
-0.5
0.0
0.5
1.0
1.5
-0.5
0.0
0.5
1.0
1.5
14

PAM method

Search for
k

representative

objects (
medoids
) among

the observations

Minimum
sum of dissimilarities

k

clusters are constructed by

assigning each observation

to the nearest medoid

-0.5
0.0
0.5
1.0
1.5
-0.5
0.0
0.5
1.0
1.5
-0.5
0.0
0.5
1.0
1.5
-0.5
0.0
0.5
1.0
1.5
15

PAM method (cont.)

If
n

is unknown…

Validity indices

computation

Defined considering the
data

dispersion within

and

between

clusters

According to decision rules,

the
best number of clusters

may be selected

16

17

Validity Indices

Index

Decision rule

Calinski

Hartigan

Scott

Marriot

Trace (W)

Friedman & Rubin

))
i
-
(i
-
)
i
-
((i
min
1
-
n
n
n
1
n
n

))
i
-
(i
-
)
i
-
((i
min
1
-
n
n
n
1
n
n

)
i
-
(i
max
1
-
n
n
n
))
i
-
(i
-
)
i
-
((i
max
1
-
n
n
n
1
n
n

)
i
-
(i
max
1
-
n
n
n
))
i
-
(i
-
)
i
-
((i
max
1
-
n
n
n
1
n
n

Characterisation & agreement

Visualisation

techniques

Biplots

Boxplots

Indices for assessing

agreement

Cohen’s kappa (
κ
)

Rand

18

Definition of classes

Consensus clustering:

Align labels to have
similar

clusters

named in the

same way

by different algorithms

Take into account
points

assigned to groups
with

the
same label

19

Supervised learning (1)

Model
-
based classification

for prediction of future cases

Aims

High quality
prediction

Reduce
number

of variables

(biomarkers)

Prefer ‘
white
-
box
’ prediction

models

20

Supervised learning (2)

Different techniques:

C4.5

Multi
-
Layer Perceptron Artificial

Neural Network (
MLP
-
ANN
)

Naïve Bayes (
NB
) or

Non
-
Parametric Bayesian

Classifier (
NPBC
)

21

22

C4.5 classifier

Each attribute can be used to

make a decision that splits

the data into
smaller subsets

Information gain

that results

from choosing an attribute

for splitting the data

Attribute with
highest

information gain is the one

used to make the decision

23

Multi
-
Layer Perceptron

Feed
-
forward
ANN

Nonlinear activation function

used by each neuron

Layers of
hidden nodes
connected

with every other node

in the following layer

Learning carried out through

back
-
propagation

24

Naïve Bayes classifier

Probabilistic classifier

based

on Bayes’ theorem

Good for
multi
-
dimensional data

Common
assumptions:

Independence of variables

Normality

25

NPBC:
ratio

between

areas

(1)

Similar to
Naïve Bayes

Useful for
non
-
normal

data

Based on
ratio between

areas

under the histogram

The
closer

a data point is

to the
median
, the
higher

the
probability

to belong to

that specific class

25

Soria et al. (2011): A `Non
-
Parametric’ Version of the Naïve Bayes Classifier.
Knowledge
-
Based Systems
, 24, 775
-
784

26

NPBC:
ratio

between

areas

(2)

ER
Density
0
50
100
150
200
250
300
0.000
0.005
0.010
0.015
ER
Density
0
50
100
150
200
250
300
0.000
0.005
0.010
0.015
ER
Density
0
50
100
150
200
250
300
0.000
0.005
0.010
0.015
ER
Density
0
50
100
150
200
250
300
0.000
0.005
0.010
0.015
ER

ER

median
m

datapoint
x

median
m

datapoint
x

x

< m

x > m

26

Characterisation of classes

Biplots

Boxplots

Relation

with clinical information

Survival

analysis

27

Validation of the framework (1)

Set of
markers

involved in breast cancer
cell
cycle regulation

347

patients and
4

markers

Survival

and

available

K
-
means, PAM
and
Fuzzy C
-
means

used

Validity indices
for best number of clusters

28

KM validity indices

2
4
6
8
10
12
14
-10
-5
0
5
10
15
clusters
index
Calinski index
2
4
6
8
10
12
14
-0.2
-0.1
0.0
0.1
clusters
index
Hartigan index
2
4
6
8
10
12
14
0
50
100
200
300
clusters
index
Scott index
2
4
6
8
10
12
14
-4e+24
-2e+24
0e+00
2e+24
clusters
index
Marriot index
2
4
6
8
10
12
14
-2e+05
0e+00
2e+05
4e+05
clusters
index
TraceW index
2
4
6
8
10
12
14
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
clusters
index
TraceW^(-1)B index
29

PAM validity indices

2
4
6
8
10
12
14
-20
-10
0
10
20
clusters
index
Calinski index
2
4
6
8
10
12
14
-0.4
-0.3
-0.2
-0.1
0.0
clusters
index
Hartigan index
2
4
6
8
10
12
14
0
50
100
150
200
250
300
clusters
index
Scott index
2
4
6
8
10
12
14
-4e+24
-2e+24
0e+00
2e+24
clusters
index
Marriot index
2
4
6
8
10
12
14
0e+00
2e+05
4e+05
6e+05
clusters
index
TraceW index
2
4
6
8
10
12
14
0.0
0.5
1.0
1.5
2.0
2.5
3.0
clusters
index
TraceW^(-1)B index
30

Optimum number of clusters

Index

K
-
means

PAM

Calinski

and
Harabasz

14

3

Hartigan

3

3

Scott and Symons

3

3

Marriot

14

3

TraceW

3

3

TraceW
-
1
B

3

3

Minimum sum of ranks

3

3

31

FCM validity indices

5
10
15
20
5.0e+07
1.5e+08
2.5e+08
3.5e+08
clusters
index
Fuzzy Hypervolume
5
10
15
20
3e-07
5e-07
7e-07
9e-07
clusters
index
Partition Density
5
10
15
20
0.001
0.003
0.005
clusters
index
Xie-Beni Index
5
10
15
20
0.2
0.3
0.4
0.5
0.6
clusters
index
Partition Coefficient
32

Optimum number of clusters

Index

K
-
means

PAM

Calinski

and
Harabasz

14

3

Hartigan

3

3

Scott and Symons

3

3

Marriot

14

3

TraceW

3

3

TraceW
-
1
B

3

3

Minimum sum of ranks

3

3

Index

Fuzzy C
-
means

Fuzzy
Hypervolume

2

Partition Density

3

Xie
-
Beni

2

Partition Coefficient

2

Minimum sum of ranks

2

33

Agreement and consensus

Kappa index
high
for

3 groups
classification

3

common
classes

found (8.9% not classified)

Intermediate

expression (class 1)

High

expression (class 2)

Low

expression (class 3)

K
-
means

PAM

FCM

0.93

0.89

0.91

0.88

K
-
means

---

0.91

---

0.91

34

-0.15
-0.10
-0.05
0.00
0.05
0.10
-0.15
-0.10
-0.05
0.00
0.05
0.10
0.15
Comp.1
Comp.2
SUV
SUV
hMOF
hMOF
ACH4K16
ACH4K16
H3K9Me3
H3K9Me3
Biplots

of classes (1)

°
: Class 1

°
: Class 2

°
: Class 3

35

-0.15
-0.10
-0.05
0.00
0.05
0.10
-0.15
-0.10
-0.05
0.00
0.05
0.10
0.15
Comp.1
Comp.2
SUV
SUV
hMOF
hMOF
ACH4K16
ACH4K16
H3K9Me3
H3K9Me3
Biplots

of classes (2)

°
: Class 1

°
: Class 2

°
: Class 3

°
: N.C.

36

Boxplots

of classes

SUV
hMOF
ACH4K16
H3K9Me3
0
50
100
150
200
250
300
For patients in classes 1 - 3
SUV
hMOF
ACH4K16
H3K9Me3
0
50
100
150
200
250
300
Class 1
SUV
hMOF
ACH4K16
H3K9Me3
0
50
100
150
200
250
300
Class 2
SUV
hMOF
ACH4K16
H3K9Me3
0
50
100
150
200
250
300
Class 3
SUV
hMOF
ACH4K16
H3K9Me3
0
50
100
150
200
250
300
For patients in classes 1 - 3
SUV
hMOF
ACH4K16
H3K9Me3
0
50
100
150
200
250
300
Class 1
SUV
hMOF
ACH4K16
H3K9Me3
0
50
100
150
200
250
300
Class 2
SUV
hMOF
ACH4K16
H3K9Me3
0
50
100
150
200
250
300
Class 3
37

Kaplan
-
Meier curves

0
20
40
60
80
100
120
0.6
0.7
0.8
0.9
1.0
Survival Time in Months
Proportion Surviving
Class 1 - 119 patients - 55 events
Class 2 - 87 patients - 34 events
Class 3 - 85 patients - 41 events
38

Clinical information

patients (poor prognosis) in
classes 1
and

3
(worst survival)

Common
classes group
patients with
similar
outcome

39

C4.5 decision tree

40

41

Validation of the framework (2)

Patients entered into the
Nottingham

Tenovus Primary
Breast Carcinoma Series

between 1986 and 1998

1076 cases

informative for all
25

biological
markers

Clinical information

survival, follow
-
up, etc.) available

42

Previous work: 4 groups

Breast

Cancer

Luminal

Basal

Luminal A

Luminal B

HER2

43

Consensus Clustering

Breast

Cancer

ER+

Luminal CKs+

ER
-

HER2+

ER
-

Basal CKs
-

Mixed
Class
(38.4%)

Luminal

Basal

p53+

p53
-

PgR+

PgR
-

HER3+

HER4+

HER3
-

HER4
-

HER3+

HER4+

Luminal N

Class 2
(14.2%)

Luminal B

Class 3

(7.4%)

p53
altered

Class 4
(7.6%)

p53
normal

Class 5
(6.4%)

Luminal A

Class 1
(18.8%)

HER2

Class 6

(7.2%)

Conclusions

Original framework

for identification of core
classes

Formed by
different

logical
steps

Validation

over
novel data sets

3 classes (
Low
,
Intermediate

and

High
)

High

marker
levels

for
better survival

Discover of
novel
cancer

subtypes

44

45

Main references

1.
L
. Kaufman, P.J.
Rousseeuw
.
Finding groups in data
, Wiley series in probability and
mathematical statistics, 1990.

2.
A.
Weingessel
, et al.
An Examination Of Indexes For Determining The Number Of
Clusters In Binary Data Sets,
Working Paper No.29, 1999
.

3.
I. H. Witten and E. Frank.
DataMining
: Practical machine learning tools and
techniques.
Morgan Kaufmann Publishers, 2005.

4.
A.K. Jain and R.C.
Dubes
.
Algorithms for Clustering Data
. Prentice
-
reference series, Prentice
-
Hall. Englewood Cliffs, NJ, USA, 1988.

5.
A.K. Jain, M.N.
Murty
, and P.J. Flynn.
Data clustering: A review
. ACM Computing
Surveys, 31(3):264

323, 1999.

6.
P.F.
Velleman

and D.C.
Hoaglin
.
Applications, Basics and Computing of Exploratory
Data Analysis
. Boston, Mass.: Duxbury Press, 1981.

7.
J
. Quinlan.
C4.5: Programs for Machine Learning.

Morgan Kaufmann, Los Altos,
California, 1993.

8.
S.
Haykin
.
Neural Networks: A Comprehensive Foundation.

Prentice Hall, 2 edition,
1998.

9.
G. John and P. Langley.
Estimating continuous distributions in
bayesian

classifiers.

Proceeding of the Eleventh Conference on Uncertainty in Artificial Intelligence, 1995.

46

Acknowledgements

Dr JM Garibaldi, Dr J Bacardit

Nottingham Breast Cancer Pathology RG

Prof IO Ellis, Dr AR Green, Dr D Powe, Prof G Ball

46

Thank You!

Contact:
daniele.soria@nottingham.ac.uk

47