Bioinformatics Coursework: - Imperial College London

websterhissΒιοτεχνολογία

1 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

77 εμφανίσεις

Course 341: Introduction to Bioinformatics

2004/2005, 2005/2006, 2006/2007

Moustafa Ghane
m


Imperial College London

Course 341: Introduction to Bioinformatics

Answers to Microarray Bioinformatics Tutorial 2

(Review questions on clustering)


1.

Describe what is meant by data clustering and how it can be used for the analysis
of gene expression matrices.

Lecture 16 slide #
3
-
5


Clustering of data is a method by which a large set of data is grouped into clusters (groups) of
smaller sets of similar data. In the context of gene expression matrices where rows represent
genes and columns represent measurements of gene expression
values for samples under
different conditions, clustering algorithms can be applied to find either groups of similar genes
or groups of similar samples or both:



e.g. Groups of genes with “similar expression profiles (Co
-
expressed Genes)
---

similar rows
in the gene expression matrix



or Groups of samples (disease cell lines/tissues/toxicants) with “similar effects” on
gene expression
---

similar columns in the gene expression matrix



2.

Describe what is meant by a cluster centoid and what is meant by simil
arity
metrics.


Lecture 16 slide # 6
-
14, 26


The centroid is taken to be a “virtual” representative object for a cluster. Mathematically, it
could be calculated as a point in an M
-
dimensional space whose parameter values are the
mean of the parameter value
s of all the points in the clusters.

(where M is the number of
features or parameters or dimensions used for describing each object).



It is a virtual object, since there does not need to be a real object in the cluster with the
calculated vales.


A sim
ilarity metric is a method used for quantifying the similarity between two objects. We
typically represent objects as points in an M
-
dimensional space. Generally, the distance
between two points is taken as a common metric to assess the similarity among th
em. The
most commonly used distance metric is the Euclidean metric which defines the distance
between two points p= ( p1, p2, ....) and q = ( q1, q2, ....) is given by :






Other metrics include Manhattan distance, which is calculated as follows




Course 341: Introduction to Bioinformatics

2004/2005, 2005/2006, 2006/2007

Moustafa Ghane
m


Imperial College London




M
ake sure to describe the properties of a good similarity metric.


1.

Distance between two profiles must be greater than or equal to zero,
distances cannot be negative.

2.

The distance between a profile and itself must be zero

3.

Conversely if the difference between

two profiles is zero, then the profiles
must be identical.

4.

The distance between profile A and profile B must be the same as the
distance between profile B and profile A.

5.

The distance between profile A and profile C must be less than or equal to
the sum o
f the distance between profiles A and B and profiles B and C.




Make sure to provide formulae for two different similarity metrics that can be
used in data clustering.


Provided above, Euclidean and Manhattan



3.


Describe the operation of the hierarchical cl
ustering algorithms

Lecture 16 slide # 16
-
24


Hierarchical clustering is a method that successively links objects with similar profiles to form
a tree structure. The standard hierarchical clustering algorithm works as follows:


Given a set of N items to b
e clustered, and an NxN distance (or similarity) matrix, the
basic process of hierarchical clustering is this:

1.

Start by assigning each item to its own cluster, so that if you have N items,
you now have N clusters, each containing just one item.

2.

Find the
closest (most similar) pair of clusters and merge them into a single
cluster, so that now you have one less cluster.

3.

Compute distances (similarities) between the new cluster and each of the old
clusters.

4.

Repeat steps 2 and 3 until all items are clustered i
nto a single cluster of size
N.




Make sure you explain what is meant by a similarity matrix


At each step of the algorithm, we need to compute a similarity matrix (or alternatively a
distance matrix) which represent the similarity (alternatively distance)

between the N objects
being clustered. At each step you use the matrix to find the two elements with maximum
similarity (alternatively minimum distance). The two elements are merged into one element
and the matrix is recalculated. The matrix is thus updat
ed during the operation of the
algorithm by reducing it to a smaller matrix at each step. You start by an NxN matrix, then an
(N
-
1)x(N
-
1) matrix, ….





Make sure you explain what is meant by single linkage, average linkage and
complete linkage


Linkage met
hods refer to how the distance between clusters (groups of objects) are
calculated. Whereas it is straightforward to calculate distance between two objects, we do
Course 341: Introduction to Bioinformatics

2004/2005, 2005/2006, 2006/2007

Moustafa Ghane
m


Imperial College London

have various options when calculating distance between clusters. These include single
linkage
, average linkage and complete linkage methods.


In
Single Linkage we consider the distance between one cluster and another cluster to be
equal to the shortest distance from any member of one cluster to any member of the other
cluster.


In Complete Linkag
e we consider the distance between one cluster and another cluster to be
equal to the longest distance from any member of one cluster to any member of the other
cluster.


In Average Linkage we consider the distance between one cluster and another cluster
to be
equal to the average distance from any member of one cluster to any member of the other
cluster.




Make sure you explain what is meant by a dendrogram.














Dendrograms are used to represent the outputs of hierarchical clustering algorithms
. A
dendrogram is a binary tree structure whose leaf elements represent the data elements,
which are joined up the tree based on their similarity. Internal nodes represent sub
-
clusters of
elements. The root of the node represents the cluster containing th
e whole data collection.


The length of each tree branch represents the distance between clusters it joins.


4.

Describe the operation k
-
means clustering algorithm using psuedocode.

Lecture 16 slide # 26


Given a set of N items to be grouped into k clusters

1.

Select an initial partition of k clusters

2.

Assign each object to the cluster with the closest centroid

3.

Compute the new centeroid of the clusters.

4.

Repeat step 2 and 3 until no object changes cluster.


5.

Compare and contrast the advantages of hierarchical cl
ustering and k
-
means
clustering.

Lecture 16 slide # 34


The table in the slides provides the required

comparison from a computational perspective. In
general hierarchical clustering is more informative since it provides a more detailed output
showing simi
larity between individual items in the data set. However, its space and time
complexity are higher than k
-
means clustering since you need to start with an NxX matrix, In
k
-
means you don’t. Also the output of k
-
means may change based on the seed clusters so

it
can generate different results each time you execute.


6.

Explain briefly the operation of the SOM algorithm and how it relates to k
-
means
algorithm.

Lecture 16 slide # 35


Course 341: Introduction to Bioinformatics

2004/2005, 2005/2006, 2006/2007

Moustafa Ghane
m


Imperial College London

7.

Explain briefly what is meant by dimensionality reduction and why it may be
impo
rtant in data analysis.

Lecture 16 slide # 36


8.

Explain briefly how both MDS and PCA work and compare between them.


Lecture 16 slide # 37
-
30


9.

What is the main difference between clustering and classification.


In classification you already know the group
s that the data is divided into, this is provided
by a label (e.g. diseased vs. healthy), and you are trying to find a model in terms of the
dimensions (d1…dm) that can predict the class. This type of analysis is useful for
predictive modelling.


In cluste
ring you are trying to divide the data into groups based on the values of their
dimensions. You choose these groups such as to maximise the similarity inside the
groups and maximise the distance between them. This type of analysis is useful for
exploratory

analysis.

(Problems)


10.

If you use k
-
means clustering on the data in table below to group the following
people by age into 3 groups. How many steps would it take the algorithm to
converge if you start with centroids defined by Andy, Burt and Claire? How may

steps would be needed if you start with Andy, Ed and Harry?


ID

Age

Andy

1

Burt

2

Claire

3

Dave

11

Ed

12

Fred

13

George

21

Harry

22

Ian

23


a)


Cluster 1: Initial Centroid 1. Assigned items (Andy). New centroid 1.

Cluster 2: Initial Centroid 2.

Assigned items (Burt). New centroid 2.

Cluster 3: Initial Centroid 3. Assigned items (Claire, Dave, Ed, Fred, George, Harry, Ian).
New centorid: 15


Cluster 1: Initial Centroid 1. Assigned items (Andy). New centroid 1.

Cluster 2: Initial Centroid 2. Assig
ned items (Burt, Claire). New centroid 2.5

Cluster 3: Initial Centroid 15. Assigned items (Dave, Ed, Fred, George, Harry, Ian). New
centorid: 17


Cluster 1: Initial Centroid Andy. Assigned items (Andy). New centroid Andy.

Cluster 2: Initial Centroid 2.5. A
ssigned items (Burt, Claire). New centroid 2.5

Cluster 3: Initial Centroid 17. Assigned items (Dave, Ed, Fred, George, Harry, Ian). New
centorid: 17


STOP,

3 steps


b)

Cluster 1: Initial centroid 1, items (Andy, Burt, Claire), final centroid 2.

Cluster 2
: Initial centroid 12, items (Dave, Ed, Fred), final centroid 12.

Cluster 3: Initial centroid 22, items (George, Harry Ian,) final centroid 22.

Course 341: Introduction to Bioinformatics

2004/2005, 2005/2006, 2006/2007

Moustafa Ghane
m


Imperial College London


Cluster 1: Initial centroid 2, items (Andy, Burt, Claire), final centroid 2.

Cluster 2: Initial centroid 12, i
tems (Dave, Ed, Fred), final centroid 12.

Cluster 3: Initial centroid 22, items (George, Harry Ian,) final centroid 22.


STOP

2 steps, but clearly better results.



11.

Use the k
-
means clustering algorithm to find 3 clusters on the following data set,
Assume

initial cluster centroids adefined by A, B and C. Provide a graphical
representation of the clusters.


ID

Dimension 1

Dimension 2

A

1

1

B

8

6

C

20

3

D

21

2

E

11

7

F

7

7

G

1

2

H

6

8



Cluster 1: Initial centroid (1,1), items (A, G), final centroid

(1,1.5)

Cluster 2: Initial centroid (8,6), items (B, E, F, H), final centroid (8,7).

Cluster 3: Initial centroid (20,3), items (C, D) final centroid (20.5,2.5)



Cluster 1: Initial centroid (1,1.5), items (A, G), final centroid (1,1.5)

Cluster 2: Initial

centroid (8,7), items (B, E, F, H), final centroid (8,7).

Cluster 3: Initial centroid (20,2.5), items (C, D) final centroid (20.5,2.5)


STOP





















A

G

H

F

B

E

C

D

Course 341: Introduction to Bioinformatics

2004/2005, 2005/2006, 2006/2007

Moustafa Ghane
m


Imperial College London

12.

Use the hierarchical clustering on the data of question 11 using a Euclidean metric
1

in the

following cases:

a.

Using single Linkage

b.

Using Complete Linkage


Make sure to show the values of your distance matrix at each step


I build a matrix based on distance (not similarity), so at each step, so I scan for the minimum
value


If I used a similarit
y matrix, I would have to choose the maximum value.


a.

Using single Linkage


Note I only have to calculate distances once, I will operate only on this matrix from now on.


A and G are most similar items so I merge them to get first link between two elements. I draw
the connection and
label the length on the scale bar.







I need to update the matrix, I delete the row and column for A and row/column for B. I insert a
new row and column called AG. The entries for AG need to be calculated. Since I use single
linkage, I choose to keep t
he minimum value between (AG, B) i.e. min (dist(A,B) , dist(B,G)) =
min(8.6, 8.1) = 8.1 the distance from G to B. All other entries that do not involve AG remain
the same. The updated values are shown in italics.









I repeat the process, this time I have a choice since the distance between F and B is 1.4, the
distance b
etween G and H is also 1.4 and so is the distance between C and D. I arbitrarily
choose to link F and B together.













1

This is a rather big size problem to solve by hand, but given to show how you can do it.


A

B

C

D

E

F

G

H

A

X

8.6

19.1

20

11.7

8.5

1

8.6

B

X

X

12.3

13.6

3.2

1.4

8.1

2.8

C

X

X

X

1.4

9.8

13.6

19

14.9

D

X

X

X

X

11.2

14.9

20

16.2

E

X

X

X

X

X

4

11.2

5.1

F

X

X

X

X

X

X

7.8

1.4

G

X

X

X

X

X

X

X

7.8

H

X

X

X

X

X

X

X

X


A
-
G

B

C

D

E

F

H

A
-
G

X

8.1

19

20

11.2

7.8

7.8

B

X

X

12.3

13.6

3.2

1.4

2.8

C

X

X

X

1.4

9.8

13.6

14.9

D

X

X

X

X

11.2

14.9

16.2

E

X

X

X

X

X

4

5.1

F

X

X

X

X

X

X

1.4

H

X

X

X

X

X

X

X

A

G


1

A

G


F

B


1

Course 341: Introduction to Bioinformatics

2004/2005, 2005/2006, 2006/2007

Moustafa Ghane
m


Imperial College London








I repeat and I link now BF and H





















Now I link C and D





























I now link BFH and E




AG

B
-
F

C

D

E

H

AG

X

7.8

19

20

11.2

7.8

B
-
F

X

X

12.3

13.6

3.2

1.4

C

X

X

X

1.4

9.8

14.9

D

X

X

X

X

11.2

16.2

E

X

X

X

X

X

5.1

H

X

X

X

X

X

X


AG

BF
-
H

C

D

E

AG

X

7.8

19

20

11.2

BF
-
H

X

X

12.3

13.6

3.2

C

X

X

X

1.4

9.8

D

X

X

X

X

11.2

E

X

X

X

X

X


AG

BFH

C
-
D

E

AG

X

7.8

19

11.2

BF
H

X

X

12.3

3.2

C
-
D

X

X

X

9.8

E

X

X

X

X

1

A

G


F

B


H


C

D


A

G


F

B


H


1

Course 341: Introduction to Bioinformatics

2004/2005, 2005/2006, 2006/2007

Moustafa Ghane
m


Imperial College London



























I now link AG and BFHE and then the final cluster AGBFHE and CD. Givi
ng me the final
dendrogram shown below. Compare this to the scatter plot shown in the previous problem
and see if it makes sense.
























Here is the dendrogram generated by

the KDE data mining tools.








AG

BFH
-
E

CD

AG

X

7.8

19

BFH
-
E

X

X

9.8

CD

X

X

X


AG
-
BFHE

CD

AG
-
BFHE

X

9.8

CD

X

X

F

B


H


E


A

G

1

C

D


1

3

C

D


F

B


H


E


A

G


1

3

8

10

Course 341: Introduction to Bioinformatics

2004/2005, 2005/2006, 2006/2007

Moustafa Ghane
m


Imperial College London

b)

For complete linkage, we do
the same thing, but when updating the matrix, we choose the
maximum distance between clusters rather than the minimum distance.



I still start be choosing A and G to start with the same matrix since they still have the
minimum distance.



Now when updating the matrix, I set the dis
tance between AG and B to be the maximum of
dist(A,B) and dist(A,G) i.e. 8.6 rather than 8.1 as in the previous case










I choose to merge B and F since they have the minumum distance










I choose to merge BF and H,


Etc


Here is the dendrogram generated by the KDE data mining tool. First compare it to the one
above. Then generate your own dendrogram and compare it to the one below.


















A

B

C

D

E

F

G

H

A

X

8.6

19.1

20

11.7

8.5

1

8.6

B

X

X

12.3

13.6

3.2

1.4

8.1

2.8

C

X

X

X

1.4

9.8

13.6

19

14.9

D

X

X

X

X

11.2

14.9

20

16.2

E

X

X

X

X

X

4

11.2

5.1

F

X

X

X

X

X

X

7.8

1.4

G

X

X

X

X

X

X

X

7.8

H

X

X

X

X

X

X

X

X


A
-
G

B

C

D

E

F

H

A
-
G

X

8.6

19.1

20

11.7

8.5

8.6

B

X

X

12.3

13.6

3.2

1.4

2.8

C

X

X

X

1.4

9.8

13.6

14.9

D

X

X

X

X

11.2

14.9

16.2

E

X

X

X

X

X

4

5.1

F

X

X

X

X

X

X

1.4

H

X

X

X

X

X

X

X


AG

B
-
F

C

D

E

H

AG

X

8.6

19.1

20

11.7

8.6

B
-
F

X

X

13.6

14.9

4

2.8

C

X

X

X

1.4

9.8

14.9

D

X

X

X

X

11.2

16.2

E

X

X

X

X

X

5.1

H

X

X

X

X

X

X

Course 341: Introduction to Bioinformatics

2004/2005, 2005/2006, 2006/2007

Moustafa Ghane
m


Imperial College London

13.

The following table sh
ows the gene expression values for 8 genes under five types
of cancer. You are interested in discovering the similarity relationship between the
eight genes.


ID

C1

C2

C3

C4

C5

A

1

1

1

1

2

B

1

2

1

1

1

C

14

15

15

15

15

D

15

15

15

15

15

E

16

16

16

16

16

F

6

6

5

6

6

G

4

4

4

4

4

H

5

5

5

5

5


a.

Using Manhattan distance and a single linkage show the resulting
dendrogram.


Work out the calculation yourself by hand. When you do it, you will end up with a dendrogram
looking as the one below.














Not
e that even though there are more dimensions than in the previous problem (five features
as opposed to only 2), you will mainly be dealing with the same size distance matrix (8x8)
since this defined by the number of elements being clustered. In general it
will be as tedious
to solve as the previous one, but get your hand working at it to figure out the pattern of doing
it. Clearly as the computation progresses the matrix size gets smaller.


b.

How would memory storage requirements change if you use complete
li
nkage? If you use average linkage?


In complete linkage it is the same requirements, you just pick values from the initial distance
matrix, but update them differently.


In average linkage you need to calculate the distance between every pair of elements
in both
clusters. You would need to keep the initial distance matrix to look
-
up this information in
addition to the one you are updating.


14.

Based on the table in question 12, use hierarchical clustering (Manhattan distance
and single linkage) to study the

similarity between the five cancer types (C1 ..C5).
How can this form of analysis be useful?


Analysis is useful when you want to study similarity between diseases (See question 4 in
tutorial 1). Here is the distance matrix for this problem, it is easier
to calculate because of the
Manhattan distance..




C1

C2

C3

C4

C5

C1

X

X

X

X

X

C2

2

X

X

X

X

C3

2

2

X

X

X

C4

1

1

1

X

X

C5

2

2

2

1

X

Course 341: Introduction to Bioinformatics

2004/2005, 2005/2006, 2006/2007

Moustafa Ghane
m


Imperial College London


There many different ways to proceed since there are lots of 1, the dendrogram can have any
shape you want based on w
hich diseases you link
-
up since the distance that separates them
is always 1.









C1

C2

C3

C4

C5

C1

X

X

X

X

X

C2

2

X

X

X

X

C3

2

2

X

X

X

C4

1

1

1

X

X

C5

2

2

2

1

X


C14
-
2

C3

C5

C14
-
2

X

X

X

C3

1

X

X

C5

1

2

X


C142
-
3

C5

C142
-
3

X

X

C5

1

X


C1
-
4

C2

C3

C5

C1
-
4

X

X

X

X

C2

1

X

X

X

C3

1

2

X

X

C5

1

2

2

X

1