# Introduction to Bioinformatics 1. Course Overview - Department of ...

Biotechnology

Oct 2, 2013 (4 years and 7 months ago)

60 views

Introduction to Bioinformatics

6
.
Statistical Analysis of Gene Expression
Matrices II

Course 341

Department of Computing

Imperial College, London

Moustafa Ghanem

Lecture Overview

Motivation

Get a feel for t
-
values and how they change

Volcano plots

Visual method for differential gene expression analysis

Meaning of x and y axes

Interpretation of results

Interpretation of t
-
test

The higher the t
-
value, the lower the p
-
value, the
more confident you are

Calculating t
-
test (t statistic)

First

calculate

t

statistic

value

and

then

calculate

p

value

For

the

paired

t
-
test,

t

is

calculated

using

the

following

formula
:

And

n

is

the

number

of

pairs

being

tested
.

For

an

unpaired

(independent

group)

t
-
test,

the

following

formula

is

used
:

Where σ

(
x
) is the standard deviation of
x

and

n (
x
) is the number of elements in
x
.

Where
d

is calculated by

Remember these formulae !!

Calculating and Interpreting t
-
values

Consider the following examples, and assume a paired experiment:

High t
-
value

Take Gene A, assuming paired test:

For Either type of test

Average Difference is = 100, SD. = 0

t value is near infinity,

p is extremely low

Consider Gene M for a paired experiment

Where
d

is calculated by

Average Difference is = 0

t value is zero, what does this mean?

Consider Gene T for a paired experiment

Where
d

is calculated by

t
-
value = Signal/Noise ratio

Graphical Interpretation of t
-
test (Paired)

t

= Mean of differences

S.D. of differences/sqrt(n)

d
1

d
2

d
3

d
4

Value

Sample ID

d =Diff

Sample ID

d
avg

Case2: Moderate Variation around mean of
differences

d
2

d
3

d
4

Value

Sample ID

d =Diff

Sample ID

d
avg

Case1: Low Variation around mean of
differences

d
1

d
2

d
3

d
4

Value

Sample ID

d =Diff

Sample ID

d
avg

Case3: Large Variation around mean of differences

Graphical Interpretation of t
-
test (Paired)

Back to our problem

5000 Rows
represent
genes

Columns
represent
samples

4 Wild Type samples (Blue)

4 Wild KO samples (Red)

Hypothesis Testing

Uses hypothesis testing methodology.

For each Gene (>5,000)

Pose Null Hypothesis (Ho) that gene is not affected

Pose Alternative Hypothesis (Ha) that gene is affected

Use statistical techniques to calculate the probability of rejecting the
hypothesis (p
-
value)

If p
-
value < some critical value reject Ho and Accept Ha

The issues:

Large number of genes (or experiments)

Need quick way to filter out significant genes that have high fold change

Need also to sort genes by fold change and significance

Volcano Plots

For each gene
calculate the
significance of
the change

(t
-
test, p
-
value)

For each gene
compare the
value of the
effect between
population WT
vs. KO

(fold change)

Identify Genes
with high effect
and high
significance

Volcano Plot

Volcano plots are a graphical means for visualising results of
large numbers of t
-
tests allowing us to plot both the Effect
and significance of each test in an easy to interpret way

Volcano plots

In a volcano plot:

X
-
axis represents effect measured as fold
change:

Effect = log(WT)

log(KO)

2

2

= log(WT / KO)

2

If WT = WO, Effect Fold Change = 0 , If WT = 2 WO, Effect Fold Change = 1

...

Numerical Interpretation (Effect)

Using log
2
for X axis:

Effect has
doubled

2
1
(2 raised to
the power of 1)

Two Fold
Change

Effect has halved

2
0.5
(2 raised to
the power of 0.5)

Volcano plots

Calculate Significance as

log (p_value)

If p = 0.1,
-
log(0.1) = 1 (1 decimal point)

If p = 0.01,
-
log (0.01) = 2 (2 decimal points)

...

10

In a volcano plot:

y
-
axis represents the number of zeroes in the p
-
value

(remember with a p
-
value of 0.0001, you are more confident than with
a p
-
value of 0.01

This is just a trick so that higher values on the graph are more
important

Numerical Interpretation (Significance)

Using log
10
for
Y axis:

p< 0.1

(1 decimal place)

p< 0.01

(2 decimal places)

Visualise the Result :Volcano Plot

Effect vs.
Significance

Selections of items
that have both a
large effect and are
highly significant can
be identified easily.

Choosing log scales is a matter of
convenience

Effect can be both +ve or
-
ve

High Effect & Significance

Boring stuff

-
ve effect

+ve effect

High
Significance

Low

Significance

Summary

t
-
Test good for small samples (in our case 4 paired observations)

t distribution approximates to normal distribution when degrees of
freedom > 30

Remember formulae for paired/un
-
paired

Volcano plot simple method for visualising large sets of such
observations

Remember formula for x
-
axis

Remember formula for y
-
axi