# Bioinformatics Coursework:

Biotechnology

Oct 1, 2013 (4 years and 7 months ago)

90 views

Course 341: Introduction to Bioinformatics

Microarray Bioinformatics Tutorial 3

(Review questions on data normalisation)

1.

Explain what is meant by data normalisation and discuss why it is an important step before the
analysis of the output of microarray e
xperiments.

2.

Describe the main sources that would lead to microarray data variability. Provide examples of
how you may use normalisation methods to address them.

3.

You have collected data from a cDNA microarray. The green channel is to measure gene
expressi
on in normal tissue, and the red channel is used to measure gene expression in
diseased tissue. You believe that the output may biased to the green channel. Explain what
type of plot you can use to test your assumption, and how you could correct the measur
ed
results.

4.

Two labs are running experiments on the APO1 gene. Suggest one method that would allow
them to compare their results.

5.

In the context of microarray experiments, explain what is meant by biological and technical
variability of the data.

(Revie
w questions on data classification)

6.

Explain how microarrays can be used as a basis for both diagnostic and prognostic tools.

7.

Describe what is meant by a decision tree and how it can be used to classify data.

8.

Describe the generic decision tree constructi
on algorithm and describe how information gain is
used to choose the tests at the internal nodes of the tree.

9.

Describe the basic idea behind the naïve Bayes classifier.

(Problems)

10.

Given the following sets of class values. For each set, calculate the I
(p,n) metric given by

Data Set 1: {p, p, p, p, p, p, p, p}.

Data Set 2: {p, p, p, p, p, p, p, n}.

Data Set 3: {p, p, p, p, p, p, n, n}.

Data Set 4: {p, p, p, p, p, n, n, n}.

Data Set 5: {p, p, p, p, n, n, n, n}.

Data Set 6: {p, p, p, n,
n, n, n, n}.

Data Set 7: {p, p, n, n, n, n, n, n}.

Data Set 8: {p, n, n, n, n, n, n, n}.

Data Set 9: {n, n, n, n, n, n, n, n}.

Can you explain your findings? How does this relate to the Information gain concept
introduced in the lectures?

11.

Given
the following training data set about exotic dishes

a.

What is the information gain associated with choosing the attribute “Taste” as the root
of the decision tree

b.

Draw the full decision tree whose root is given by “Taste”

c.

Use the tree to pre
dict the class value for the record given by

12.

Given the following training data set

a.

Based on the data set, calculate the amount of information neede
d to decide if an arbitrary
record belongs to either class 1 or class 0.

b.

Construct a decision tree from this training data
-
set based on using the concept of
information gain as a metric for choosing the nodes of the tree.

c.

Predict the class value for the
following two records.

Instance

A1

A2

Class Value

11

F

N

?

12

T

N

?

d.

What is your prediction confidence in each case?

e.

Generate two decision rules from the tree

1

Note there was an error in the printed sheet. The correct entry here is “Hot” not “Sour”

Instance

Temperature

Taste

Size

Appealing

11

Hot
1

Salty

Small

?

12

Cold

Sweet

Large

?

Instance

A1

A2

Class Value

1

M

N

1

2

T

N

1

3

F

N

1

4

F

N

1

5

T

O

1

6

M

N

0

7

F

O

0

8

T

O

0

9

T

N

0

10

F

O

0

Instance

Temperature

Taste

Size

Appealing

1

Ho
t

Salty

Small

No

2

Cold

Sweet

Large

No

3

Cold

Sweet

Large

No

4

Cold

Sour

Small

Yes

5

Hot

Sour

Small

Yes

6

Hot

Salty

Large

No

7

Hot

Sour

Large

Yes

8

Cold

Sweet

Small

Yes

9

Cold

Sweet

Small

Yes

10

Hot

Salty

Large

No

13.

Given the following training data set

Sample

Gene 1

Gene 2

Gene 3

Gene 4

Diseased

A

H
igh

Medium

High

Medium

Yes

B

Low

Medium

Low

High

Yes

C

Medium

High

High

Medium

No

D

Low

Low

Low

Low

Yes

E

Medium

Medium

High

Medium

No

a.

Show how a naïve Bayesian classifier would classify the following sample

Sample

Gene 1

Gene 2

Gene 3

Gene 4

Disea
sed

X

Low

Medium

High

Low

??

14.

Given the following training data set collected during a drug efficacy study for CMV
-
buster.
The data shows gene expression measurements for three genes A, B, C as measured in
blood samples collected from people suffering
from the Cytomegalovirus infection before
-
buster, and indicates whether each gene was under
-
expressed
or over
-
expressed compared to a control sample from healthy individuals. The last column
indicates whether the treatment was ef
fective or not.

Sample

A

B

C

Effective

1

Under
-
expressed

Under
-
expressed

Over
-
expressed

No

2

Under
-
expressed

Over
-
expressed

Under
-
expressed

No

3

Over
-
expressed

Over
-
expressed

Under
-
expressed

No

4

Under
-
expressed

Under
-
expressed

Over
-
expressed

Yes

5

O
ver
-
expressed

Over
-
expressed

Over
-
expressed

Yes

6

Over
-
expressed

Und
-
expressed

Under
-
expressed

Yes

7

Over
-
expressed

Over
-
expressed

Under
-
expressed

Yes

a.

How would naïve Bayes predict “Effective” given the input:

A=Under
-
expressed, B
-

Under
-
expressed, C=
Over
-
expressed.