Microarray Data Analysis
Data preprocessing and visualization
Supervised learning
Machine learning approaches
Unsupervised learning
Clustering and pattern detection
Gene regulatory regions predictions based co

regulated genes
Linkage between gene expression data and gene
sequence/function databases
…
Unsupervised learning
Supervised methods
Can only validate or reject hypotheses
Can not lead to discovery of unexpected partitions
Unsupervised learning
No prior knowledge is used
Explore structure of data on the basis of
corrections
and
similarities
DEFINITION OF THE CLUSTERING PROBLEM
Eytan Domany
CLUSTER ANALYSIS YIELDS DENDROGRAM
T
(RESOLUTION)
Eytan Domany
BUT WHAT ABOUT THE OKAPI?
Eytan Domany
Centroid methods
–
K

means
Data points at
X
i
, i=
1
,...,N
Centroids at
Y
,
=
1
,...,K
Assign data point i to centroid
㬠†
S
i
=
Cost E:
E(S
1
, S
2
,...,S
N
; Y
1
,...Y
K
) =
Minimize
E
over
S
i
, Y
2
1
1
)
)(
,
(
Y
X
S
i
N
i
K
i
Eytan Domany
K

means
“Guess” K=
3
Eytan Domany
Start with random
positions of centroids.
K

means
Iteration = 0
Eytan Domany
K

means
Iteration =
1
Start with random
positions of centroids.
Assign each data point
to closest centroid.
Eytan Domany
K

means
Iteration = 2
Start with random
positions of centroids.
Assign each data point
to closest centroid.
Move centroids to
center of assigned
points
Eytan Domany
K

means
Iteration =
3
Start with random
positions of centroids.
Assign each data point
to closest centroid.
Move centroids to
center of assigned
points
Iterate till minimal cost
Eytan Domany
Fast
algorithm: compute distances from data
points to centroids
Result depends on initial centroids’ position
Must preset K
Fails for “non

spherical” distributions
K

means

Summary
5
2
4
1
3
Agglomerative Hierarchical Clustering
3
1
4
2
5
Distance between joined clusters
Dendrogram
The dendrogram induces a
linear ordering
of the data points
at each step merge pair of nearest
clusters
initially
–
each point = cluster
Need to define the distance between the
new cluster and the other clusters.
Single Linkage:
distance between closest pair.
Complete Linkage:
distance between farthest pair.
Average Linkage:
average distance between all pairs
or distance between cluster centers
Eytan Domany
Hierarchical Clustering

Summary
Results depend on distance update method
Greedy iterative process
NOT robust against noise
No inherent measure to identify stable clusters
Average Linkage
–
the most widely used clustering
method in gene expression analysis
nature
2002
breast
cancer
Heat map
Cluster both genes and samples
Sample should
cluster together
based on
experimental design
Often a way to catch
labelling errors or
heterogeneity in
samples
Epinephrine Treated
Rat Fibroblast Cell
ID
Probe
1h
5h
10h
18h
24h
1
D21869_s_at
25.7
55.0
170.7
305.5
807.9
2
D25233_at
705.2
578.2
629.2
641.7
795.3
3
D25543_at
2148.7
1303.0
915.5
149.2
96.3
4
L03294_g_at
241.8
421.5
577.2
866.1
2107.3
5
J03960_at
774.5
439.8
314.3
256.1
44.4
6
M81855_at
1487.6
1283.7
1372.1
1469.1
1611.7
7
L14936_at
1212.6
1848.5
2436.2
3260.5
4650.9
8
L19998_at
767.9
290.8
300.2
129.4
51.5
9
AB017912_at
1813.7
3520.6
4404.3
6853.1
9039.4
10
M32855_at
234.1
23.1
789.4
312.7
67.8
Heap map
Correlation coeff
Normalized across each gene
Distance Issues
Euclidean distance
■
Pearson distance
g
1
g
2
g
3
g
4
0
50
100
150
200
250
300
350
400
gene1
gene2
gene3
gene4
time0
time1
time2
time3
Exercise
Use Average Linkage
Algorithm
and
Manhattan
distance.
Gene
ID
Exp1
Exp2
1
45
55
2
55
78
3
148
1303
4
241
765
5
774
439
6
607
383
Exercise
Issues in Cluster Analysis
A lot of clustering algorithms
A lot of distance/similarity metrics
Which clustering algorithm runs faster and uses
less memory?
How many clusters after all?
Are the clusters stable?
Are the clusters meaningful?
Which Clustering Method
Should I Use?
What is the biological question?
Do I have a preconceived notion of how many
clusters there should be?
How strict do I want to be? Spilt or Join?
Can a gene be in multiple clusters?
Hard or soft boundaries between clusters
The End
Thank you for taking this course. Bioinformatics is a very
diverse and fascinating subject. We hope you all decide to
continue your pursuit of it.
We will be very glad to answer your emails or schedule
appointments to talk about any bioinformatics related
questions you might have.
We wish you all have a wonderful summer break!
Comments 0
Log in to post a comment