Microarray Data Analysis

mudlickfarctateΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 10 μήνες)

88 εμφανίσεις

Microarray Data Analysis


Data preprocessing and visualization


Supervised learning


Machine learning approaches


Unsupervised learning


Clustering and pattern detection


Gene regulatory regions predictions based co
-
regulated genes


Linkage between gene expression data and gene
sequence/function databases




Unsupervised learning


Supervised methods


Can only validate or reject hypotheses


Can not lead to discovery of unexpected partitions


Unsupervised learning


No prior knowledge is used


Explore structure of data on the basis of
corrections

and
similarities

DEFINITION OF THE CLUSTERING PROBLEM

Eytan Domany

CLUSTER ANALYSIS YIELDS DENDROGRAM

T
(RESOLUTION)

Eytan Domany

BUT WHAT ABOUT THE OKAPI?

Eytan Domany

Centroid methods


K
-
means

Data points at
X
i

, i=
1
,...,N

Centroids at
Y


,



=
1
,...,K


Assign data point i to centroid


㬠†
S
i

=




Cost E:


E(S
1

, S
2
,...,S
N

; Y
1
,...Y
K
) =


Minimize

E

over
S
i

, Y


2
1
1
)
)(
,
(




Y
X
S
i
N
i
K
i





Eytan Domany

K
-
means




“Guess” K=
3

Eytan Domany


Start with random


positions of centroids.

K
-
means

Iteration = 0

Eytan Domany

K
-
means

Iteration =
1


Start with random


positions of centroids.


Assign each data point


to closest centroid.

Eytan Domany

K
-
means

Iteration = 2


Start with random


positions of centroids.


Assign each data point


to closest centroid.


Move centroids to


center of assigned


points

Eytan Domany

K
-
means

Iteration =
3




Start with random


positions of centroids.


Assign each data point


to closest centroid.


Move centroids to


center of assigned


points



Iterate till minimal cost

Eytan Domany


Fast

algorithm: compute distances from data
points to centroids



Result depends on initial centroids’ position


Must preset K


Fails for “non
-
spherical” distributions


K
-
means
-

Summary

5

2

4

1

3

Agglomerative Hierarchical Clustering

3

1

4

2

5

Distance between joined clusters

Dendrogram

The dendrogram induces a
linear ordering

of the data points

at each step merge pair of nearest

clusters

initially


each point = cluster

Need to define the distance between the

new cluster and the other clusters.

Single Linkage:



distance between closest pair.

Complete Linkage:

distance between farthest pair.

Average Linkage:

average distance between all pairs


or distance between cluster centers

Eytan Domany

Hierarchical Clustering
-
Summary


Results depend on distance update method


Greedy iterative process


NOT robust against noise


No inherent measure to identify stable clusters


Average Linkage


the most widely used clustering
method in gene expression analysis


nature
2002
breast
cancer

Heat map

Cluster both genes and samples


Sample should
cluster together
based on
experimental design


Often a way to catch
labelling errors or
heterogeneity in
samples

Epinephrine Treated

Rat Fibroblast Cell

ID

Probe

1h

5h

10h

18h

24h

1

D21869_s_at

25.7

55.0

170.7

305.5

807.9

2

D25233_at

705.2

578.2

629.2

641.7

795.3

3

D25543_at

2148.7

1303.0

915.5

149.2

96.3

4

L03294_g_at

241.8

421.5

577.2

866.1

2107.3

5

J03960_at

774.5

439.8

314.3

256.1

44.4

6

M81855_at

1487.6

1283.7

1372.1

1469.1

1611.7

7

L14936_at

1212.6

1848.5

2436.2

3260.5

4650.9

8

L19998_at

767.9

290.8

300.2

129.4

51.5

9

AB017912_at

1813.7

3520.6

4404.3

6853.1

9039.4

10

M32855_at

234.1

23.1

789.4

312.7

67.8

Heap map

Correlation coeff


Normalized across each gene

Distance Issues


Euclidean distance



Pearson distance

g
1

g
2

g
3

g
4

0
50
100
150
200
250
300
350
400
gene1
gene2
gene3
gene4
time0
time1
time2
time3
Exercise


Use Average Linkage
Algorithm

and
Manhattan

distance.

Gene
ID

Exp1

Exp2

1

45

55

2

55

78

3

148

1303

4

241

765

5

774

439

6

607

383

Exercise

Issues in Cluster Analysis


A lot of clustering algorithms


A lot of distance/similarity metrics


Which clustering algorithm runs faster and uses
less memory?


How many clusters after all?


Are the clusters stable?


Are the clusters meaningful?

Which Clustering Method
Should I Use?


What is the biological question?


Do I have a preconceived notion of how many
clusters there should be?


How strict do I want to be? Spilt or Join?


Can a gene be in multiple clusters?


Hard or soft boundaries between clusters

The End


Thank you for taking this course. Bioinformatics is a very
diverse and fascinating subject. We hope you all decide to
continue your pursuit of it.



We will be very glad to answer your emails or schedule
appointments to talk about any bioinformatics related
questions you might have.



We wish you all have a wonderful summer break!