Bioinformatics Coursework:

austrianceilBiotechnology

Oct 1, 2013 (3 years and 10 months ago)

78 views

Answer Hints to Bioinformatics Tutorial:

Differential Gene Expression Analysis


1.

Book Question, examples provided in lecture notes!


2.



N

D


FC = (log(D)
-

log(N))/log(2)

A

128

256


1



B

256

128


-
1



C

1024

512


-
1



D

1000

1000


0




3.

T
-
test

a.

When wo
uld you use a t
-
test as opposed to a z
-
test?

Use t
-
test for studying small data sets, and the z
-
test for studying larger data sets.

The z
-
test assumes that the data set (sample) follows a normal distribution.

The t
-
test assumes that the data set (sample)
was drawn from a normal distribution but because we
only have a small sample, the sample itself follows a t
-
distribution.


b.

What is meant by paired and unpaired experiments? How do they affect the calculation of a
t
-
test?

Paired experiments involve measurem
ents from the same individuals (or very similar individuals
e.g. twins) under different conditions. In such a case you can get away by comparing the
measurements of each individual directly.

In unpaired experiments, the assumption does not hold (so the ind
ividuals are different and you
cannot relate individual measurements), so you have to compare the averages across the two data
sets.

The formulae for paired/unpaired are given in the lecture notes.


c.

What is meant by a two tail t
-
test? Right tail / Left ta
il t
-
tests?

In two tail tests you are only interested if the two populations are different (you don’t care if the
change was positive or negative so long as the measurements in one group are far enough in either
way from the mean value of the other group).


Right tail and Left tail tests are stronger, they insist that the measurements in one group are
higher/lower than those in the other group.



d.

What is meant when we say that the t
-
value is a Signal
-
to
-
Noise ratio

The Signal and Noise represent the two com
ponents of the t
-
value (Signal represents the
numerator, Noise represents the denominator).

Signal is the average difference between both groups (High signal means the difference is high),
and Noise is the fluctuation in that difference (Low noise means sm
all fluctuations).

A large SNR means the differences are high and the fluctuation (noise) is low.


e.

What is the number of degrees of freedom for a paired t
-
test when each of the samples has 10
data elements?

10
-
1 = 9


f.

What is the number of degrees of freed
om for two unpaired data sets, the first having 4
elements and the second having 6 elements?

4+6
-
2=8


4.

P
-
value

a.

What does a p
-
value of 1 mean?

Hard Luck! You have just proved the null hypothesis!

In case you were trying to check whether a particular value d
oes not belong to a given population,
you just discovered that this value coincides exactly with the mean for the population. In case of
testing for differences between means of two samples, you have just proved that there is no
difference between their me
ans. The area under the curve between this value and +/
-

infinity is 1.



b.

What does a p
-
value of 0.05 mean? Explain your answer graphically using a normal
distribution.

Congratulations you just disproved the null hypothesis. P
-
value of 0.05 means the proba
bility of
rejecting the null hypothesis is very high.

In case you were trying to check whether a particular value does not belong to a given population,
you just discovered that this value is very far from the mean of the population, in fact the
probabili
ty that this value belongs to the population is less than 5%.

In case of testing for differences between means of two samples, you have just proved that there is
high probability that their means are different.









c.

What is the difference between a nor
mal
-
distribution and t
-
distribution

t
-
distribution is lower at the mean, and flatter, i.e. it takes longer to reach zero on both sides. For
any value on the x
-
axis, the area under the curve to the left (or right) of that value is bigger for the
t
-
distribut
ion than it is for the normal distribution. Note that the t
-
distribution approximates to a
normal distribution when the number of degrees of freedom is high (>30)










d.

What is meant by a critical t
-
value for a p=0.05, how does this value depend on the

number
of samples in an experiment?

This is the value on the x axis, where the area under the curve to its right is 0.05

For your experiment to have a p value of 0.05, the t
-
value you calculate must be greater than the
critical t
-
value.

Both the t
-
value

you calculate and the critical t
-
value change as the number of degrees of freedom
changes.


e.

Using the p
-
value table at the bottom of the next page, find the critical t
-
value for a paired t
-
test (2 samples each having 4 elements) such that provides a 95% c
onfidence that the two
samples are different.

V=3, and it is a two
-
tailed distribution. t value represents the value for which the area under the
curve should be 0.025. (since the curve is symmetric and it is a two
-
tailed test).

In the new table below thi
s is 3.182 (Please note this value was missing in the original tutorial
sheet).

T

v

p

10.95

6

0.000034

10.95

3

0.001631

12.05

6

0.000020

12.05

3

0.001230

8.4

6

0.000155

8.4

3

0.003539

2.353

6

0.056825

3.182

3

0.025

2.353

3

0.100033

5.

Volcano Plot

a.

E
xplain the volcano plot method for assessing the effect and significance over a large number
of genes. Why is it useful?

You are trying to compare a very large number of fold changes to quickly assess which genes have
an effect that is both high and signif
icant. You use a scatter plot, each point represents a gene. The
co
-
ordinates represent the magnitude of the effect for that gene and its significance.


b.

How are effect and significance calculated in the volcano plot?

Effect is calculated as the difference

between the two population means, Significance is calculated
by calculating the p
-
value from a t
-
test.


c.

What is the numerical interpretation of the X
-
axis in a volcano plot?

This represents the average fold change. A value of 0 means no change, a value of

+1 means the
effect in the gene is doubled (
-
1 effect is halved), a value of +2 means the effect is 4 times, etc …



d.

What is the numerical interpretation of the y
-
axis in the volcano plot?

This represents the number of decimal points in the p
-
value calcu
lated, the higher you are on the x
-
axis, the lower the p
-
value (and hence the higher your confidence that the effect is true and not just
by chance).


6.

The table below shows gene expression values for a number genes. Each gene is measured for the
same type
of tissue cell .in normal state in four samples (N1..N4) and in diseased state in another
set of four samples (D1..D4)


























a.

Consider Gene V. Without going through lengthy calculations, is there any
change between both states. What do
you expect the p
-
value to be?

For Gene V, all measurements are 100 in both groups. So there is no change between
the expression values for the normal and diseased states. The p
-
value is going to be 1.


b.

Calculate t
-
value for Genes A, E, N,Q, S, V

I used
Excel

7.

Using the table of p
-
values below:

a.

Calculate the effect and significance for genes A, E, N, Q, S, V and plot them on a scatter plot
(Volcano plot)

Calculations from above table plotted in figure below, each gene is represented by a square on the
p
lot and labelled.


b.

Compare the effect and significance between genes A and S

Directly from Plot S has higher effect but lower significance than A.



E,V

N

Q

A

S