# Microarray Data Analysis

AI and Robotics

Nov 25, 2013 (4 years and 7 months ago)

107 views

Microarray Data Analysis

Data preprocessing and visualization

Supervised learning

Machine learning approaches

Unsupervised learning

Clustering and pattern detection

Gene regulatory regions predictions based co
-
regulated genes

Linkage between gene expression data and gene
sequence/function databases

Unsupervised learning

Supervised methods

Can only validate or reject hypotheses

Can not lead to discovery of unexpected partitions

Unsupervised learning

No prior knowledge is used

Explore structure of data on the basis of
corrections

and
similarities

DEFINITION OF THE CLUSTERING PROBLEM

Eytan Domany

CLUSTER ANALYSIS YIELDS DENDROGRAM

T
(RESOLUTION)

Eytan Domany

Eytan Domany

Centroid methods

K
-
means

Data points at
X
i

, i=
1
,...,N

Centroids at
Y

,

=
1
,...,K

Assign data point i to centroid

㬠†
S
i

=

Cost E:

E(S
1

, S
2
,...,S
N

; Y
1
,...Y
K
) =

Minimize

E

over
S
i

, Y

2
1
1
)
)(
,
(

Y
X
S
i
N
i
K
i

Eytan Domany

K
-
means

“Guess” K=
3

Eytan Domany

positions of centroids.

K
-
means

Iteration = 0

Eytan Domany

K
-
means

Iteration =
1

positions of centroids.

Assign each data point

to closest centroid.

Eytan Domany

K
-
means

Iteration = 2

positions of centroids.

Assign each data point

to closest centroid.

Move centroids to

center of assigned

points

Eytan Domany

K
-
means

Iteration =
3

positions of centroids.

Assign each data point

to closest centroid.

Move centroids to

center of assigned

points

Iterate till minimal cost

Eytan Domany

Fast

algorithm: compute distances from data
points to centroids

Result depends on initial centroids’ position

Must preset K

Fails for “non
-
spherical” distributions

K
-
means
-

Summary

5

2

4

1

3

Agglomerative Hierarchical Clustering

3

1

4

2

5

Distance between joined clusters

Dendrogram

The dendrogram induces a
linear ordering

of the data points

at each step merge pair of nearest

clusters

initially

each point = cluster

Need to define the distance between the

new cluster and the other clusters.

distance between closest pair.

distance between farthest pair.

average distance between all pairs

or distance between cluster centers

Eytan Domany

Hierarchical Clustering
-
Summary

Results depend on distance update method

Greedy iterative process

NOT robust against noise

No inherent measure to identify stable clusters

the most widely used clustering
method in gene expression analysis

nature
2002
breast
cancer

Heat map

Cluster both genes and samples

Sample should
cluster together
based on
experimental design

Often a way to catch
labelling errors or
heterogeneity in
samples

Epinephrine Treated

Rat Fibroblast Cell

ID

Probe

1h

5h

10h

18h

24h

1

D21869_s_at

25.7

55.0

170.7

305.5

807.9

2

D25233_at

705.2

578.2

629.2

641.7

795.3

3

D25543_at

2148.7

1303.0

915.5

149.2

96.3

4

L03294_g_at

241.8

421.5

577.2

866.1

2107.3

5

J03960_at

774.5

439.8

314.3

256.1

44.4

6

M81855_at

1487.6

1283.7

1372.1

1469.1

1611.7

7

L14936_at

1212.6

1848.5

2436.2

3260.5

4650.9

8

L19998_at

767.9

290.8

300.2

129.4

51.5

9

AB017912_at

1813.7

3520.6

4404.3

6853.1

9039.4

10

M32855_at

234.1

23.1

789.4

312.7

67.8

Heap map

Correlation coeff

Normalized across each gene

Distance Issues

Euclidean distance

Pearson distance

g
1

g
2

g
3

g
4

0
50
100
150
200
250
300
350
400
gene1
gene2
gene3
gene4
time0
time1
time2
time3
Exercise

Algorithm

and
Manhattan

distance.

Gene
ID

Exp1

Exp2

1

45

55

2

55

78

3

148

1303

4

241

765

5

774

439

6

607

383

Exercise

Issues in Cluster Analysis

A lot of clustering algorithms

A lot of distance/similarity metrics

Which clustering algorithm runs faster and uses
less memory?

How many clusters after all?

Are the clusters stable?

Are the clusters meaningful?

Which Clustering Method
Should I Use?

What is the biological question?

Do I have a preconceived notion of how many
clusters there should be?

How strict do I want to be? Spilt or Join?

Can a gene be in multiple clusters?

Hard or soft boundaries between clusters

The End

Thank you for taking this course. Bioinformatics is a very
diverse and fascinating subject. We hope you all decide to