Data clustering

naivenorthAI and Robotics

Nov 8, 2013 (4 years and 2 days ago)

72 views

A Novel Framework

to Elucidate Core Classes in a Dataset

Daniele Soria

daniele.soria@nottingham.ac.uk


G54DMT
-

Data Mining Techniques and Applications

Topic 4: Applications


Lecture 2

3
rd

May 2012

Outline


Aims and motivations


Framework


Clustering and consensus


Supervised learning


Validation on two clinical data sets


Conclusions and questions

2

Aims and motivations

3


Develop an
original framework

with multiple steps
to extract most
representative classes

from any
dataset


Refine the
phenotypic characterisation

of breast cancer


Move the
medical decision making

process from a
single technique approach to a
multi
-
technique
one


Guide clinicians in the choice of the favourite and

most
powerful treatment
to reach a
personalised healthcare

Dataset

Pre
-
processing

Clustering

HCA

FCM

KM

PAM

n
known?

Validity indices

Characterisation

& agreement

Classes

Framework (1)

no

yes

4

Supervised

learning

C4.5

Data ~
N
(
μ
,
σ
)?

Naïve Bayes

MLP
-

ANN

NPBC

Classes
characterisation

Framework (2)

no

yes

Classes

5

Data pre
-
processing


Dealing with
missing values



Homogeneous

variables



Compute

descriptive statistics

6

Clustering


Four different algorithms:


1)

Hierarchical (
HCA
)

2)

Fuzzy c
-
means (
FCM
)

3)

K
-
means (
KM
)

4)

Partitioning Around Medoids (
PAM
)



Methods run with the
number


of clusters

varying


from
2

to

20

7

8

Hierarchical method


Hierarchy

of clusters


Represented with a tree


(
dendrogram
)


Clusters obtained
cutting


the dendrogram
at specific


height

9

Hierarchical method (cont.)

0.0
0.5
1.0
1.5
6 clusters

10

FCM method


Minimisation

of the o. f.





X
= {
x
1
,x
2
,...,
x
n
}:

n

data points


V =
{v
1
,v
2
,…,v
c
}:

c

cluster centres


U=
(
μ
ij
)
n*c
:

fuzzy partition matrix


μ
ij
:

membership of
x
i

to
v
j



m:

fuzziness index







n
i
c
j
j
i
m
ij
x
V
U
J
1
1
2
)
(
)
,
(


11

FCM method (cont.)

-0.5
0.0
0.5
1.0
1.5
-0.5
0.0
0.5
1.0
1.5
-0.5
0.0
0.5
1.0
1.5
-0.5
0.0
0.5
1.0
1.5
12

KM method


Minimisation

of the o. f.




||x
i
-
v
j
||:

Euclidean distance


between

x
i

and
v
j



c
j
:

data points in cluster

j


v
j








k
j
c
i
j
i
j
x
V
J
1
1
2
)
(

k
j
x
c
j
c
i
i
j
,...,
1

,
1
1




13

KM method (cont.)

-0.5
0.0
0.5
1.0
1.5
-0.5
0.0
0.5
1.0
1.5
-0.5
0.0
0.5
1.0
1.5
-0.5
0.0
0.5
1.0
1.5
14

PAM method


Search for
k

representative


objects (
medoids
) among


the observations


Minimum
sum of dissimilarities


k

clusters are constructed by


assigning each observation


to the nearest medoid

-0.5
0.0
0.5
1.0
1.5
-0.5
0.0
0.5
1.0
1.5
-0.5
0.0
0.5
1.0
1.5
-0.5
0.0
0.5
1.0
1.5
15

PAM method (cont.)

If
n

is unknown…


Validity indices

computation


Defined considering the
data


dispersion within

and


between

clusters


According to decision rules,


the
best number of clusters



may be selected

16

17

Validity Indices

Index

Decision rule

Calinski

Hartigan

Scott

Marriot

Trace (W)

Friedman & Rubin

))
i
-
(i
-
)
i
-
((i
min
1
-
n
n
n
1
n
n

))
i
-
(i
-
)
i
-
((i
min
1
-
n
n
n
1
n
n

)
i
-
(i
max
1
-
n
n
n
))
i
-
(i
-
)
i
-
((i
max
1
-
n
n
n
1
n
n

)
i
-
(i
max
1
-
n
n
n
))
i
-
(i
-
)
i
-
((i
max
1
-
n
n
n
1
n
n

Characterisation & agreement


Visualisation

techniques


Biplots


Boxplots



Indices for assessing

agreement


Cohen’s kappa (
κ
)


Rand

18

Definition of classes


Consensus clustering:



Align labels to have
similar


clusters

named in the

same way



by different algorithms



Take into account
points


assigned to groups
with


the
same label


19

Supervised learning (1)


Model
-
based classification



for prediction of future cases


Aims


High quality
prediction


Reduce
number

of variables


(biomarkers)


Prefer ‘
white
-
box
’ prediction


models

20

Supervised learning (2)


Different techniques:


C4.5


Multi
-
Layer Perceptron Artificial


Neural Network (
MLP
-
ANN
)


Naïve Bayes (
NB
) or


Non
-
Parametric Bayesian


Classifier (
NPBC
)

21

22

C4.5 classifier


Each attribute can be used to


make a decision that splits


the data into
smaller subsets


Information gain

that results


from choosing an attribute


for splitting the data


Attribute with
highest



information gain is the one


used to make the decision

23

Multi
-
Layer Perceptron


Feed
-
forward
ANN


Nonlinear activation function


used by each neuron


Layers of
hidden nodes
connected


with every other node


in the following layer


Learning carried out through


back
-
propagation

24

Naïve Bayes classifier


Probabilistic classifier

based


on Bayes’ theorem


Good for
multi
-
dimensional data


Common
assumptions:


Independence of variables


Normality


25

NPBC:
ratio

between

areas

(1)


Similar to
Naïve Bayes


Useful for
non
-
normal

data


Based on
ratio between


areas

under the histogram


The
closer

a data point is


to the
median
, the
higher


the
probability

to belong to


that specific class



25


Soria et al. (2011): A `Non
-
Parametric’ Version of the Naïve Bayes Classifier.
Knowledge
-
Based Systems
, 24, 775
-
784

26

NPBC:
ratio

between

areas

(2)

ER
Density
0
50
100
150
200
250
300
0.000
0.005
0.010
0.015
ER
Density
0
50
100
150
200
250
300
0.000
0.005
0.010
0.015
ER
Density
0
50
100
150
200
250
300
0.000
0.005
0.010
0.015
ER
Density
0
50
100
150
200
250
300
0.000
0.005
0.010
0.015
ER

ER

median
m

datapoint
x

median
m

datapoint
x

x

< m

x > m

26

Characterisation of classes


Biplots


Boxplots


Relation

with clinical information


Survival

analysis

27

Validation of the framework (1)


Set of
markers

involved in breast cancer
cell
cycle regulation


347

patients and
4

markers


Survival

and
grade

available


K
-
means, PAM
and
Fuzzy C
-
means

used


Validity indices
for best number of clusters

28

KM validity indices

2
4
6
8
10
12
14
-10
-5
0
5
10
15
clusters
index
Calinski index
2
4
6
8
10
12
14
-0.2
-0.1
0.0
0.1
clusters
index
Hartigan index
2
4
6
8
10
12
14
0
50
100
200
300
clusters
index
Scott index
2
4
6
8
10
12
14
-4e+24
-2e+24
0e+00
2e+24
clusters
index
Marriot index
2
4
6
8
10
12
14
-2e+05
0e+00
2e+05
4e+05
clusters
index
TraceW index
2
4
6
8
10
12
14
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
clusters
index
TraceW^(-1)B index
29

PAM validity indices

2
4
6
8
10
12
14
-20
-10
0
10
20
clusters
index
Calinski index
2
4
6
8
10
12
14
-0.4
-0.3
-0.2
-0.1
0.0
clusters
index
Hartigan index
2
4
6
8
10
12
14
0
50
100
150
200
250
300
clusters
index
Scott index
2
4
6
8
10
12
14
-4e+24
-2e+24
0e+00
2e+24
clusters
index
Marriot index
2
4
6
8
10
12
14
0e+00
2e+05
4e+05
6e+05
clusters
index
TraceW index
2
4
6
8
10
12
14
0.0
0.5
1.0
1.5
2.0
2.5
3.0
clusters
index
TraceW^(-1)B index
30

Optimum number of clusters

Index

K
-
means

PAM

Calinski

and
Harabasz

14

3

Hartigan

3

3

Scott and Symons

3

3

Marriot

14

3

TraceW

3

3

TraceW
-
1
B

3

3

Minimum sum of ranks

3

3

31

FCM validity indices

5
10
15
20
5.0e+07
1.5e+08
2.5e+08
3.5e+08
clusters
index
Fuzzy Hypervolume
5
10
15
20
3e-07
5e-07
7e-07
9e-07
clusters
index
Partition Density
5
10
15
20
0.001
0.003
0.005
clusters
index
Xie-Beni Index
5
10
15
20
0.2
0.3
0.4
0.5
0.6
clusters
index
Partition Coefficient
32

Optimum number of clusters

Index

K
-
means

PAM

Calinski

and
Harabasz

14

3

Hartigan

3

3

Scott and Symons

3

3

Marriot

14

3

TraceW

3

3

TraceW
-
1
B

3

3

Minimum sum of ranks

3

3

Index

Fuzzy C
-
means

Fuzzy
Hypervolume

2

Partition Density

3

Xie
-
Beni

2

Partition Coefficient

2

Minimum sum of ranks

2

33

Agreement and consensus


Kappa index
high
for

3 groups
classification





3

common
classes

found (8.9% not classified)


Intermediate

expression (class 1)


High

expression (class 2)


Low

expression (class 3)

K
-
means

PAM

FCM

0.93

0.89

0.91

0.88

K
-
means

---

0.91

---

0.91

34

-0.15
-0.10
-0.05
0.00
0.05
0.10
-0.15
-0.10
-0.05
0.00
0.05
0.10
0.15
Comp.1
Comp.2
SUV
SUV
hMOF
hMOF
ACH4K16
ACH4K16
H3K9Me3
H3K9Me3
Biplots

of classes (1)

°
: Class 1

°
: Class 2

°
: Class 3

35

-0.15
-0.10
-0.05
0.00
0.05
0.10
-0.15
-0.10
-0.05
0.00
0.05
0.10
0.15
Comp.1
Comp.2
SUV
SUV
hMOF
hMOF
ACH4K16
ACH4K16
H3K9Me3
H3K9Me3
Biplots

of classes (2)

°
: Class 1

°
: Class 2

°
: Class 3

°
: N.C.

36

Boxplots

of classes

SUV
hMOF
ACH4K16
H3K9Me3
0
50
100
150
200
250
300
For patients in classes 1 - 3
SUV
hMOF
ACH4K16
H3K9Me3
0
50
100
150
200
250
300
Class 1
SUV
hMOF
ACH4K16
H3K9Me3
0
50
100
150
200
250
300
Class 2
SUV
hMOF
ACH4K16
H3K9Me3
0
50
100
150
200
250
300
Class 3
SUV
hMOF
ACH4K16
H3K9Me3
0
50
100
150
200
250
300
For patients in classes 1 - 3
SUV
hMOF
ACH4K16
H3K9Me3
0
50
100
150
200
250
300
Class 1
SUV
hMOF
ACH4K16
H3K9Me3
0
50
100
150
200
250
300
Class 2
SUV
hMOF
ACH4K16
H3K9Me3
0
50
100
150
200
250
300
Class 3
37

Kaplan
-
Meier curves

0
20
40
60
80
100
120
0.6
0.7
0.8
0.9
1.0
Survival Time in Months
Proportion Surviving
Class 1 - 119 patients - 55 events
Class 2 - 87 patients - 34 events
Class 3 - 85 patients - 41 events
38

Clinical information






High grade
patients (poor prognosis) in
classes 1
and

3
(worst survival)


Common
classes group
patients with
similar
outcome

39

C4.5 decision tree

40

41

Validation of the framework (2)


Patients entered into the
Nottingham

Tenovus Primary
Breast Carcinoma Series

between 1986 and 1998


1076 cases

informative for all
25

biological
markers


Clinical information

(grade, size, age,
survival, follow
-
up, etc.) available

42

Previous work: 4 groups

Breast

Cancer

Luminal

Basal

Luminal A

Luminal B

HER2



43

Consensus Clustering

Breast

Cancer

ER+

Luminal CKs+

ER
-

HER2+

ER
-

Basal CKs
-

Mixed
Class
(38.4%)

Luminal

Basal

p53+

p53
-

PgR+

PgR
-

HER3+

HER4+

HER3
-

HER4
-

HER3+

HER4+

Luminal N

Class 2
(14.2%)

Luminal B

Class 3

(7.4%)

p53
altered

Class 4
(7.6%)

p53
normal

Class 5
(6.4%)

Luminal A

Class 1
(18.8%)

HER2

Class 6

(7.2%)

Conclusions


Original framework

for identification of core
classes


Formed by
different

logical
steps


Validation

over
novel data sets


3 classes (
Low
,
Intermediate

and

High
)


High

marker
levels

for
better survival


Discover of
novel
cancer

subtypes

44

45

Main references

1.
L
. Kaufman, P.J.
Rousseeuw
.
Finding groups in data
, Wiley series in probability and
mathematical statistics, 1990.

2.
A.
Weingessel
, et al.
An Examination Of Indexes For Determining The Number Of
Clusters In Binary Data Sets,
Working Paper No.29, 1999
.

3.
I. H. Witten and E. Frank.
DataMining
: Practical machine learning tools and
techniques.
Morgan Kaufmann Publishers, 2005.

4.
A.K. Jain and R.C.
Dubes
.
Algorithms for Clustering Data
. Prentice
-
Hall advanced
reference series, Prentice
-
Hall. Englewood Cliffs, NJ, USA, 1988.

5.
A.K. Jain, M.N.
Murty
, and P.J. Flynn.
Data clustering: A review
. ACM Computing
Surveys, 31(3):264

323, 1999.

6.
P.F.
Velleman

and D.C.
Hoaglin
.
Applications, Basics and Computing of Exploratory
Data Analysis
. Boston, Mass.: Duxbury Press, 1981.

7.
J
. Quinlan.
C4.5: Programs for Machine Learning.

Morgan Kaufmann, Los Altos,
California, 1993.

8.
S.
Haykin
.
Neural Networks: A Comprehensive Foundation.

Prentice Hall, 2 edition,
1998.

9.
G. John and P. Langley.
Estimating continuous distributions in
bayesian

classifiers.

Proceeding of the Eleventh Conference on Uncertainty in Artificial Intelligence, 1995.

46

Acknowledgements


Dr JM Garibaldi, Dr J Bacardit


Nottingham Breast Cancer Pathology RG


Prof IO Ellis, Dr AR Green, Dr D Powe, Prof G Ball






46

Thank You!

Contact:
daniele.soria@nottingham.ac.uk

47