Clusters from Gene Expression Data

coachkentuckyAI and Robotics

Nov 25, 2013 (3 years and 10 months ago)

146 views

Generating Robust and Consensus
Clusters from Gene Expression Data

Allan Tucker
a
, Stephen Swift
a
, Xiaohui Liu
a
,
Nigel Martin
b
, Christine Orengo
c
, Paul Kellam
c


a

b

c

Introduction


Many

different

clustering

algorithms

used

for

gene

expression

analysis


Little

work

on

inter
-
method

consistency

or

cross
-
comparison


Important

due

to

differing

results

(each

algorithm

implicitly

forces

a

structure

on

data)


Obtaining

a

consensus

across

methods

should

improve

confidence

The Talk


Compare a number of existing methods for
clustering gene expression data


Algorithms for generating
robust clusters

and
consensus clusters



Tested on a set of Amersham Scorecard data with
known structure and experimentally obtained virus
B
-
Cell data


Provides specific advantages in the analysis of
array based gene expression data

Clustering Methods


Hierarchical Clustering (R)


PAM (R)


CAST (C++)


Simulated Annealing (C++)

Datasets


Amersham Scorecard


597 genes, 24 blocks with 32 columns and 12
rows under 30 experimental conditions


Repeated experiments which we assume should
cluster together


B Cell Data


1987 genes

Comparison of Methods

The Agreement Matrix

Robust Clustering


Takes agreement matrix as input


Place all genes into
robust clusters

that have
full agreement


Deterministic algorithm


Should give higher degree of confidence in
clusters


Not all genes will be assigned

Robust Clustering

Dataset

ASC

B
-
cell

No. of Robust
Clusters

24

154

% of variables
assigned

79%

25%

Max. Robust
Cluster size

44

14

Min. Robust
Cluster size

2

2

Mean Robust
Cluster size

10.2

3.2

Consensus Clustering


“Full agreement” requirement for robust
clusters can be too restrictive


Algorithm for generating
consensus clusters
given
minimum agreement

parameter


Approximate stochastic algorithm

Consensus Clustering


































0
0
0
0
0
0
0
1
3
34
2
24
23
1
14
13
12

























































n
n
ij
n
n
n
f
f
f
f
f
f
f
f
f
f
f
Agreement Matrix

Consensus Clusters

Input Cluster Results

-10
0
10
20
-10
-5
0
5
cmdscale(disthhv8)[,1]
cmdscale(disthhv8)[,2]
-10
0
10
20
-10
-5
0
5
cmdscale(disthhv8)[,1]
cmdscale(disthhv8)[,2]
-10
0
10
20
-10
-5
0
5
cmdscale(disthhv8)[,1]
cmdscale(disthhv8)[,2]
-10
0
10
20
-10
-5
0
5
cmdscale(disthhv8)[,1]
cmdscale(disthhv8)[,2]
Consensus Clustering

ASC Dataset

B
-
Cell Dataset

Consensus Clustering

Consensus Clustering

Summary


Clustering biological data is very useful


Biases in clustering algorithms can mean
success in identification of patterns vary


Consensus algorithms used in protein
secondary structure prediction


We apply similar strategy with robust and
consensus clustering

Conclusions


Robust

clusters

good

for

identifying

common

transcriptional

modules



Also

for

identifying

genes

with

common

functional

pathway


Useful

for

creating

clusters

of

genes

with

high

confidence


Can

be

restrictive

in

discarding

genes

that

do

not

have

full

agreement
.

Conclusions


Consensus

clustering

relaxes

full

agreement

requirement


Resembles

defined

clusters

in

synthetic

data

very

well


Reliably

picks

out

features

in

the

virus

gene

expression

data


Fulfils

desire

not

to

rely

on

one

clustering

algorithm

during

gene

expression

analysis

Acknowledgements


The Biotechnology and Biological Sciences
Research Council (BBSRC), UK


The Engineering and Physical Sciences
Research Council (EPSRC), UK