Introduction to Bioinformatics 1. Course Overview - Department of ...

sparrowcowardBiotechnology

Oct 2, 2013 (3 years and 8 months ago)

52 views

Introduction to Bioinformatics


6
.
Statistical Analysis of Gene Expression
Matrices II

Course 341

Department of Computing

Imperial College, London


Moustafa Ghanem

Lecture Overview


Motivation


Get a feel for t
-
values and how they change


Volcano plots


Visual method for differential gene expression analysis


Meaning of x and y axes


Interpretation of results





Interpretation of t
-
test


The higher the t
-
value, the lower the p
-
value, the
more confident you are

Calculating t
-
test (t statistic)


First

calculate

t

statistic

value

and

then

calculate

p

value


For

the

paired

t
-
test,

t

is

calculated

using

the

following

formula
:









And

n

is

the

number

of

pairs

being

tested
.



For

an

unpaired

(independent

group)

t
-
test,

the

following

formula

is

used
:






Where σ

(
x
) is the standard deviation of
x

and

n (
x
) is the number of elements in
x
.


Where
d

is calculated by

Remember these formulae !!

Calculating and Interpreting t
-
values

Consider the following examples, and assume a paired experiment:


High t
-
value


Take Gene A, assuming paired test:




For Either type of test


Average Difference is = 100, SD. = 0


t value is near infinity,


p is extremely low



Consider Gene M for a paired experiment















Where
d

is calculated by






Average Difference is = 0


t value is zero, what does this mean?




Consider Gene T for a paired experiment















Where
d

is calculated by


t
-
value = Signal/Noise ratio



Graphical Interpretation of t
-
test (Paired)

t

= Mean of differences

S.D. of differences/sqrt(n)

d
1

d
2

d
3

d
4

Value

Sample ID

d =Diff

Sample ID

d
avg

Case2: Moderate Variation around mean of
differences

d
2

d
3

d
4

Value

Sample ID

d =Diff

Sample ID

d
avg

Case1: Low Variation around mean of
differences

d
1

d
2

d
3

d
4

Value

Sample ID

d =Diff

Sample ID

d
avg

Case3: Large Variation around mean of differences

Graphical Interpretation of t
-
test (Paired)

Back to our problem

5000 Rows
represent
genes

Columns
represent
samples

4 Wild Type samples (Blue)

4 Wild KO samples (Red)

Hypothesis Testing


Uses hypothesis testing methodology.



For each Gene (>5,000)


Pose Null Hypothesis (Ho) that gene is not affected


Pose Alternative Hypothesis (Ha) that gene is affected


Use statistical techniques to calculate the probability of rejecting the
hypothesis (p
-
value)


If p
-
value < some critical value reject Ho and Accept Ha



The issues:


Large number of genes (or experiments)


Need quick way to filter out significant genes that have high fold change


Need also to sort genes by fold change and significance





Volcano Plots

For each gene
calculate the
significance of
the change




(t
-
test, p
-
value)

For each gene
compare the
value of the
effect between
population WT
vs. KO


(fold change)

Identify Genes
with high effect
and high
significance




Volcano Plot

Volcano plots are a graphical means for visualising results of
large numbers of t
-
tests allowing us to plot both the Effect
and significance of each test in an easy to interpret way

Volcano plots


In a volcano plot:


X
-
axis represents effect measured as fold
change:

Effect = log(WT)


log(KO)

2

2


= log(WT / KO)

2

If WT = WO, Effect Fold Change = 0 , If WT = 2 WO, Effect Fold Change = 1

...

Numerical Interpretation (Effect)



Using log
2
for X axis:

Effect has
doubled

2
1
(2 raised to
the power of 1)


Two Fold
Change

Effect has halved

2
0.5
(2 raised to
the power of 0.5)

Volcano plots



Calculate Significance as


log (p_value)

If p = 0.1,
-
log(0.1) = 1 (1 decimal point)

If p = 0.01,
-
log (0.01) = 2 (2 decimal points)

...

10


In a volcano plot:


y
-
axis represents the number of zeroes in the p
-
value


(remember with a p
-
value of 0.0001, you are more confident than with
a p
-
value of 0.01


This is just a trick so that higher values on the graph are more
important

Numerical Interpretation (Significance)




Using log
10
for
Y axis:

p< 0.1

(1 decimal place)

p< 0.01

(2 decimal places)

Visualise the Result :Volcano Plot


Effect vs.
Significance


Selections of items
that have both a
large effect and are
highly significant can
be identified easily.


Choosing log scales is a matter of
convenience

Effect can be both +ve or
-
ve

High Effect & Significance

Boring stuff

-
ve effect

+ve effect

High
Significance

Low

Significance

Summary


t
-
Test good for small samples (in our case 4 paired observations)


t distribution approximates to normal distribution when degrees of
freedom > 30


Remember formulae for paired/un
-
paired



Volcano plot simple method for visualising large sets of such
observations


Remember formula for x
-
axis


Remember formula for y
-
axi