Bioinformatics Coursework:

fabulousgalaxyBiotechnology

Oct 1, 2013 (3 years and 9 months ago)

84 views

Course 341: Introduction to Bioinformatics

Microarray Bioinformatics Tutorial 2

(Review questions on clustering)


1.

Describe what is meant by data clustering and how it can be used for the analysis of
gene expression matrices.


2.

Describe what is meant by a c
luster cent
r
oid and what is meant by similarity metrics.




Make sure to describe the properties of a good
similarity

metric.




Make sure to provide formulae for two different similarity metrics that can be used
in data clustering.


3.


Describe the operation o
f the hierarchical clustering algorithms




Make sure you explain what is meant by a similarity matrix




Make sure you explain what is meant by single linkage, average linkage and
complete linkage




Make sure you explain what is meant by a dendrogram.


4.

Descri
be the operation k
-
means clustering algorithm using psuedocode.


5.

Compare and contrast the advantages of hierarchical clustering and k
-
means
clustering.


6.

Explain briefly the operation of the SOM algorithm and how it relates to k
-
means
algorithm.


7.

Explain br
iefly what is meant by dimensionality reduction and why it may be important
in data analysis.


8.

Explain briefly how both MDS and PCA work and compare between them.


9.

What is the main difference between clustering and classification.


(Problems)


10.

If you use
k
-
means clustering on the data in table below to group the following people
by age into 3 groups. How many steps would it take the algorithm to converge if you
start with centroids defined by Andy, Burt and Claire? How may steps would be
needed if you star
t with Andy, Ed and Harry?


ID

Age

Andy

1

Burt

2

Claire

3

Dave

11

Ed

12

Fred

13

George

21

Harry

22

Ian

23


11.

Use the k
-
means clustering algorithm to find 3 clusters on the following data set,
Assume initial cluster centroids a

defined by A, B and C
. Provide a graphical
representation of the clusters.


ID

Dimension 1

Dimension 2

A

1

1

B

8

6

C

20

3

D

21

2

E

11

7

F

7

7

G

1

2

H

6

8



12.

Use the hierarchical clustering on the data of question 11 using a Euclidean metric in
the following cases:

a.

Usin
g single Linkage

b.

Using Complete Linkage


Make sure to show the values of your distance matrix at each step


13.

The following table shows the gene expression values for 8 genes under five types of
cancer. You are interested in discovering the similarity relati
onship between the eight
genes.


ID

C1

C2

C3

C4

C5

A

1

1

1

1

2

B

1

2

1

1

1

C

14

15

15

15

15

D

15

15

15

15

15

E

16

16

16

16

16

F

6

6

5

6

6

G

4

4

4

4

4

H

5

5

5

5

5


a.

Using Manhattan distance and a single linkage show the resulting
dendrogram.

b.

How wou
ld memory storage requirements change if you use complete
linkage? If you use average linkage?


14.

Based on the table in question 12, use hierarchical clustering (Manhattan distance
and single linkage) to study the similarity between the five cancer types (C
1 ..C5).
How can this form of analysis be useful?