Clustering analysis of microarray

sharpfartsΤεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

68 εμφανίσεις

Clustering analysis of microarray
gene expression data


Ping Zhang


November 19
th
, 2008





Outline


Gene expression


Similarity between gene expression profiles


Concept of clustering


K
-
Means clustering


Hierarchical clustering


Minimum spanning tree
-
based clustering


What is a DNA Microarray?

DNA microarray technology allows measuring expressions
for tens of thousands of genes at a time

Scanning/Signal Detection

equal expression

higher expression in Cy3

higher expression in Cy5

Cy3 channel

Cy5 channel

Data
-
flow schema of
microarray data analysis

Outline


Gene expression


Similarity between gene expression profiles


Concept of clustering


K
-
Means clustering


Hierarchical clustering


Minimum spanning tree
-
based clustering


Time/Condition

Expression (relatively levels to reference point at 0)

Gene expression profiles

Similarity between Profiles

Similarity measure:


Euclidean distance


Correlation coefficient


Trend





Correlation coefficient

often works better.

0

expression

time

Expression profile

Pearson Correlation Coefficient


Compares
scaled

profiles!


Can detect inverse relationships


Most commonly used























n
i
y
i
x
i
s
y
y
s
x
x
n
r
1
1
1
n=number of conditions

x=average expression of gene x in all n conditions

y=average expression of gene y in all n conditions

s
x
=standard deviation of x

S
y
=standard deviation of y

Correlation Pitfalls

Raw Data
0
20
40
60
80
100
120
chip 1
chip2
chip 3
chip 4
chip 5
chip 6
chip7
Gene A
Gene B
Normalized Data
-1
-0.5
0
0.5
1
1.5
2
2.5
chip 1
chip2
chip 3
chip 4
chip 5
chip 6
chip7
Gene A
Gene B
Correlation=0.97

Euclidean Distance


Scaled versus unscaled


Cannot detect inverse
relation ships









2
2
2
2
2
1
1
.
.
.
,
n
n
y
x
y
x
y
x
Y
X
d






For Gene X=(x
1
, x
2
,…x
n
) and Gene Y=(y
1
, y
2
,…y
n
)

Outline


Gene expression


Similarity between gene expression profiles


Concept of clustering


K
-
Means clustering


Hierarchical clustering


Minimum spanning tree
-
based clustering


Data
-
Mining through Clustering






Degradation



Synthesis



Chromatin



Glycolysis

Assumptions for clustering analysis:


Expression level of a gene reflects the gene’s activity.


Genes involved in same biological process exhibit


statistical relationship in their expression profiles.


Clustering:

group objects into clusters so that


o
objects in each cluster have “
similar
” features;

o
objects of different clusters have “
dissimilar
” features











Idea of Clustering

Methods of Clustering


discriminant analysis

(Fisher,1931)


K
-
means





(Lloyd,1948)


support vector machines (Vapnik, 1985)


self
-
organizing maps


(Kohonen, 1980)


hierarchical clustering

Issues in Cluster Analysis


A lot of clustering algorithms


A lot of distance/similarity metrics


Which clustering algorithm runs faster
and uses less memory?


How many clusters after all?


Are the clusters stable?


Are the clusters meaningful?

Which Clustering Method
Should I Use?


What is the biological question?


Do I have a preconceived notion of
how many clusters there should be?


How strict do I want to be? Spilt or
Join?


Can a gene be in multiple clusters?


Hard or soft boundaries between
clusters

Outline


Gene expression


Similarity between gene expression profiles


Concept of clustering


K
-
Means clustering


Hierarchical clustering


Minimum spanning tree
-
based clustering


K
-
means clustering for
expression profiles

Step 1: Transform n (genes) * m (experiments) matrix
into n(genes) * n(genes) distance matrix

Step 2: Cluster genes based on a k
-
means
clustering algorithm

Exp 1
Exp 2
Exp 3
Exp 4
Gene A
Gene B
Gene C
Gene A
Gene B
Gene C
Gene A
0
Gene B
?
0
Gene C
?
?
0
To transform the n*m matrix into n*n matrix, use
a similarity (distance) metric.

K
-
means algorithm

The most popular algorithm for clustering

What is so attractive?


Simple


Mathematically correct


Fast


Invariant to dimension


Easy to implement

K
-
Means Clustering


Basic Ideas : using cluster centre (means) to represent
cluster


Assigning data elements to the closet cluster (centre).


Goal: Minimize square error (intra
-
class dissimilarity) :
=


There is no hierarchy.


Must supply the number of clusters (
k
) into which the
data are to be grouped.


))
(
,
(
i
i
i
x
C
x
d



2

Initialization 1

Specify the number of cluster
k


--

for example,
k
= 4

gene

conditions

Expression matrix

Each point is called “gene”

K
-
means Clustering : Procedure (1)

Initialization 2

Genes are
randomly assigned

to one of
k

clusters

K
-
means Clustering : Procedure (2)

or choose random starting centers

Calculate the mean of each cluster





C
N
i
i
C
i
c
g
N
m
1
1
(1,2)

(3,2)

(3,4)

(6,7)

4
1

i
BLUE
m
[(6,7) + (3,4) + …]

K
-
means Clustering : Procedure (3)



Each gene is
reassigned

to the nearest cluster

2
|
|
min
arg
i
i
j
j
g
m
c


Gene
i

to cluster
c

K
-
means Clustering : Procedure (4)

K
-
means Clustering : Procedure (5)

Iterate until the means are converged

Outline


Gene expression


Similarity between gene expression profiles


Concept of clustering


K
-
Means clustering


Hierarchical clustering


Minimum spanning tree
-
based clustering


Hierarchical clustering (1)

Step 2: Cluster genes
based on distance matrix
and draw a dendrogram
until single node remains

Step 1: Transform genes * experiments matrix into

genes * genes distance matrix

Exp 1
Exp 2
Exp 3
Exp 4
Gene A
Gene B
Gene C
Gene A
Gene B
Gene C
Gene A
0
Gene B
?
0
Gene C
?
?
0
G 1
G 2
G 3
G 4
G 5
G 1
0
G 2
2
0
G 3
6
5
0
G 4
10
9
4
0
G 5
9
8
5
3
0
G (12)
G 3
G 4
G5
G (12)
0
G 3
6
0
G 4
10
4
0
G 5
9
5
3
0
G (12)
G 3
G (45)
G (12)
0
G 3
6
0
G (45)
10
5
0
Stage
Groups
P5
[1], [2], [3], [4], [5]
P4
[1 2], [3], [4], [5]
P3
[1 2], [3], [4 5]
P2
[1 2], [3 4 5]
P1
[1 2 3 4 5]
1

2

3

4

5

Hierarchical clustering (2)

Hierarchical
Clustering Results

Outline


Gene expression


Similarity between gene expression profiles


Concept of clustering


K
-
Means clustering


Hierarchical clustering


Minimum spanning tree
-
based clustering


Graph Representation


Represent a set of n
-
dimensional points as a graph

o
each data point (gene) represented as a

node


o
each pair of genes represented as an
edge

with a weight
defined by the “dissimilarity” between the two genes

n
-
D data points

graph
representation

0 1 1.5 2 5 6 7 9

1 0 2 1 6.5 6 8 8

1.5 2 0 1 4 4 6 5.5

.

.

.

distance matrix

Minimum Spanning Tree


Spanning tree
: a sub
-
graph that has all nodes
connected and has no cycles







Minimum spanning tree (MST)
: a spanning tree with
the minimum total distance

(b)

(c)

(a)

Prim’s algorithm and Kruskal’s algorithm

Kruskal’s algorithm


step 1: select an edge with the smallest distance from graph


step 2: add to tree as along as no cycle is formed


step 3: remove the edge from graph


step 4: repeat steps 1
-
3 till all nodes are connected in tree.


10

3

(b)

4

3

(c)

4

3

5

(d)

4

7

3

5

(e)

How to Construct

Minimum Spanning Tree

4

6

7

3

5

8

(a)

14


Significantly simplifies the data clustering problem, while
losing very little essential information for clustering.









We have mathematically proved:


Foundation of MST Approach

A multi
-
dimensional clustering problem is
equivalent to a tree
-
partitioning problem!

Clustering by Cutting Long Edge

1

Hierarchical cutting


1
st

cut: longest edge


2
nd

cut: second longest edge




Work well for “easy” cases.

Produce many clusters with
single element for some
“difficult” cases.


2

Tree
-
Based Clustering


For each edge, calculate
the assessment value


Find the edge that give the
minimum assessment value
as the place to cut


g
*


Clustering using iterative method


guarantee to find the global optimality


using tree
-
based dynamic programming

Clustering through Removing
Long MST
-
Edges


Objective: partition an MST into K subtrees so that
the total edge
-
distance of all the K subtrees in
minimized



Finding K
-
1 longest MST
-
edges and cutting them
=> we get K clusters



This works as long as the inter
-
cluster edge
-
distances are clearly larger than the intra
-
cluster
edge
-
distances

An Iterative Clustering
Algorithm


Find K subtrees T
i

of an MST such that to
minimize:




Informally, the total distance between the center of
each cluster and its data points is minimized


The center c of a cluster C is defined as:


the sum of the distances between c and all the data
points in C is minimized


Does not work well if the cluster boundary is not
convex

A Globally Optimal
Clustering Algorithm


Given an MST T, partition T into K subtrees T
i

and
find a set of data points d
i
, i = 1…k, d
i

in D such
that to minimize:




Informally, group data points around the “best”
representatives rather than around the “center”


Using Dynamic Programming for this algorithm

Automated Selection

of Number of Clusters


Select “transition point” in the assessment value

as the“correct” number of clusters.



Transition Profiles

indicator[n] = (A[n
-
1]


A[n]) / (A[n]


A[n+1])

A[k] is the assessment value for partition with k clusters


Our clustering of yeast data

Reference


[1] Ying
Xu
, Victor
Olman
, and Dong
Xu
. Clustering Gene
Expression Data Using a Graph
-
Theoretic Approach: An
Application of Minimum Spanning Trees. Bioinformatics.
18:526
-
535, 2002.


[2] Dong
Xu
, Victor
Olman
, Li Wang, and Ying
Xu
.
EXCAVATOR: a computer program for gene expression data
analysis. Nucleic Acid Research. 31: 5582
-
5589. 2003.


Using slides from:


Michael
Hongbo

Xie
, Temple University (in 2006)


Vipin

Kumar, University of Minnesota


Dong
Xu
, University of Missouri

Acknowledgement