Introduction to Bioinformatics
5
.
Statistical Analysis of Gene Expression
Matrices I
Course 341
Department of Computing
Imperial College, London
Moustafa Ghanem
Lecture Overview
Motivation
–
Identifying differentially expressed genes
–
Calculating effect: fold ratio
–
Calculating significance: p

values
Statistical Analysis
–
Paired and unpaired experiments
–
Need for significance testing
–
Hypothesis testing
–
t

tests and p

values
t

tests
–
Paired and unpaired t

tests
–
Formulae for t

test
–
Single

tail vs. two tails t

tests
–
Looking up p

values
Motivation
Large

scale Differential Gene Expression Analysis
Consider a microarray experiment
–
that measures gene expression in two groups of rat tissue (>5000
genes in each experiment).
–
The rat tissues come from two groups:
WT: Wild

Type rat tissue,
KO: Knock Out Treatment rat tissue
–
Gene expression for each group measured under similar conditions
–
Question: Which genes are affected by the treatment? How
significant is the effect? How big is the effect?
Calculating Expression Ratios
In Differential Gene Expression Analysis, we are interested in identifying
genes with different expression across two states, e.g.:
–
Tumour cell lines vs. Normal cell lines
–
Treated tissue vs. diseased tissue
–
Different tissues, same organism
–
Same tissue, different organisms
–
Same tissue, same organism
–
Time course experiments
We can quantify the difference (effect) by taking a ratio
i.e. for gene
k
, this is the ratio between expression in state
a
compared to
expression in state
b
–
This provides a relative value of change (e.g. expression has doubled)
–
If expression level has not changed ratio is 1
Fold change
(Fold ratio)
Ratios are troublesome since
–
Up

regulated & Down

regulated genes treated differently
Genes up

regulated by a factor of 2 have a ratio of 2
Genes down

regulated by same factor (2) have a ratio of 0.5
–
As a result
down regulated genes are compressed between 1 and 0
up

regulated genes expand between 1 and infinity
Using a logarithmic transform to the base 2 rectifies problem, this is
typically known as the fold change
A gene is up

regulated in state 2 compared
to state 1 if it has a higher value in state 2
A gene is down

regulated in state 2
compared to state 1 if it has a lower value in
state 2
Examples of fold change
Gene ID
Expression
in state 1
Expression
in state 2
Ratio
Fold Change
A
100
50
2
1
B
10
5
2
1
C
5
10
0.5

1
D
200
1
200
7.65
E
10
10
1
0
You can calculate Fold change between pairs of expression values:
e.g. Between State 1 vs State 2 for gene A
Or Between mean values of all measurements for a gene in the
WT/KO experiments
•
mean(WT1..WT4) vs mean (KO1..KO4)
A, B and D are down regulated
C is up

regulated
E has no change
Statistics
Back to our problems
5000 Rows
represent
genes
Columns
represent
samples
4 Wild Type samples (Blue)
4 Wild KO samples (Red)
Statistics
Significance of
Fold Change
For our problem we can calculate an average fold ratio for each
gene (each row)
This will give us an average effect value for each gene
–
2, 1.7, 10, 100, etc
Question which of these values are significant?
–
Can use a threshold, but what threshold value should we set?
–
Use statistical techniques based on number of members in each
group, type of measurements, etc

> significance testing.
Statistics:
5000 separate statistical problems
How do we think about this problem?
Effectively:
–
5000 separate experiments where each experiment measures the
expression of one gene in two groups of 4 individuals
–
For each experiment (gene), want to establish if there is a statistical
difference between the reported values in each group
–
We then want to identify those genes (across the 5000 genes) that
have a significant change
Each row in our table is similar to one of those of traditional
statistical analysis problems
Statistics
Unpaired statistical experiments
Overall setting:
2 groups of 4 individuals each
–
Group1: Imperial students
–
Group2: UCL students
Experiment 1:
–
We measure the height of all students
–
We want to establish if members of one group are consistently (or on
average) taller than members of the other, and if the measured
difference is significant
Experiment 2:
–
We measure the weight of all students
–
We want to establish if members of one group are consistently (or on
average) heavier than the other, and if the measured difference is
significant
Experiment 3:
–
………
Condition
Group 1
members
Condition
Group 2
members
Statistics
Unpaired statistical experiments
In unpaired experiments, you typically have two groups of people
that are not related to one another, and measure some property
for each member of each group
e.g. you want to test whether a new drug is effective or not, you
divide similar patients in two groups:
–
One groups takes the drug
–
Another groups takes a placebo
–
You measure (quantify) effect of both groups some time later
You want to establish whether there is a significant difference
between both groups at that later point
The WT/KO example is an unpaired experiment if the rats in the
experiments are different !
Condition
Group 1
members
Condition
Group 2
members
Statistics
Unpaired statistical experiments
The WT/KO example is an unpaired experiment if the rats in the
experiments are different!
Experiment for WT Rats for
Gene 96608_at
Rat #
WT gene expression
WT1
100
WT2
100
WT3
200
WT4
300
Experiment for KO Rats for
Gene 96608_at
Rat #
KO gene expression
KO1
150
KO2
300
KO3
100
KO4
300
Statistics
Unpaired statistical experiments
How do we address the problem?
Compare two sets of results
(alternatively calculate mean for
each group and compare means)
Graphically:
–
Scatter Plots
–
Box plots, etc
Compare Statistically
–
Use unpaired t

test
Are these two series significantly different?
Are these two series significantly different?
Statistics
Paired statistical experiments
Overall setting:
1 groups of 4 individuals each
–
Group1: Imperial students
–
We make measurements for each student in two situations
Experiment 1:
–
We measure the height of all students before Bioinformatics course
and after Bioinformatics course
–
We want to establish if Bioinformatics course consistently (or on
average) affects students’ heights
Experiment 2:
–
We measure the weight of all students before Bioinformatics course
and after Bioinformatics
–
We want to establish if Bioinformatics course consistently (or on
average) affects students’ weights
Experiment 3:
–
………
Group
members
Condition 1
Condition 2
Statistics
Paired statistical experiments
In paired experiments, you typically have one group of people, you
typically measure some property for each member before and
after a particular event (so measurement come in pairs of before
and after)
e.g. you want to test the effectiveness of a new cream for tanning
–
You measure the tan in each individual before the cream is applied
–
You measure the tan in each individual after the cream is applied
You want to establish whether the there is a significant difference
between measurements before and after applying the cream for
the group as a whole
Group
members
Condition 1
Condition 2
Statistics
Paired statistical experiments
The WT/KO example is a paired experiment if the rats in the
experiments are the same!
Experiments for Gene 96608_at
Rat #
WT gene
expression
KO gene
expression
Rat1
100
200
Rat2
100
300
Rat3
200
400
Rat4
300
500
Statistics
Paired statistical experiments
How do we address the problem?
Calculate difference for each pair
Compare differences to zero
Alternatively (compare average
difference to zero)
Graphically:
–
Scatter Plot of difference
–
Box plots, etc
Statistically
–
Use unpaired t

test
Are differences close to Zero?
Statistics
Significance testing
In both cases (paired and unpaired) you want to establish whether
the difference is significant
Significance testing is a statistical term and refers to estimating
(numerically) the probability of a measurement occurring by
chance.
To do this, you need to review some basic statistics
–
Normal distributions: mean, standard deviations, etc
–
Hypothesis Testing
–
t

distributions
–
t

tests and p

values
Mean and
standard deviation
Mean and standard deviation tell you the basic features
of a distribution
mean = average value of all members of the group
u = (x
1
+x
2
+x3 ….+x
N
)/N
standard deviation = a measure of how much the
values of individual members vary in relation to the
mean
The normal distribution is symmetrical about the mean
68% of the normal distribution lies within 1 s.d. of the
mean
68% of dist.
1 s.d.
1 s.d.
x
Note on s.d. calculation
Through the following slides and in the tutorials, I use the
following formula for calculating standard deviation
Some people use the unbiased form below (for good reasons)
Please use the simple form if you want the answers to add up
at the end
The Normal Distribution
Many continuous variables follow a
normal distribution
, and it
plays a special role in the statistical tests we are interested in;
•
The
x

axis
represents the values of a
particular variable
•
The
y

axis
represents the proportion of
members of the population that have
each value of the variable
•
The area under the curve represents
probability
–
i.e. area under the curve
between two values on the x

axis
represents the probability of an
individual having a value in that range
68% of dist.
1 s.d.
1 s.d.
x
Normal Distribution and
Confidence Intervals
1

a
= 0.95
a
/2 = 0.025
a
/2 = 0.025

1.96
1.96
Any normal distribution
can be transformed to a
standard distribution
(mean 0, s.d. = 1)
using a simple transform
0.025 = p

value:
probability of a
measurement value belonging to this
distribution
Hypothesis Testing: (Unpaired)
Are two data sets different
H
o
H
a
Population 1
Population 2
Population 1
Population 2
If standard deviation known use z test,
else use t

test
We use z

test (normal distribution) if the
standard deviations of two populations from
which the data sets came are known (and
are the same)
We pose a
null hypothesis that the
means are equal
We try to refute the hypothesis using the
curves to calculate the probability that the
null hypothesis is true (both means are
equal)
–
if probability is low
(low p) reject the null
hypothesis and accept the alternative
hypothesis (
both means are different
)
–
If probability is high
(high p) accept null
hypothesis (
both means are equal
)
In unpaired experiments, we compare
the difference between the means.
Comparing Two Samples
Graphical interpretation
To compare two groups you can
compare the mean of one group
graphically.
The graphical comparison allows you
to visually see the distribution of the
two groups.
If the p

value is low, chances are there
will be little overlap between the two
distributions. If the p

value is not low,
there will be a fair amount of overlap
between the two groups.
We can set a critical value for the x

axis based on the threshold of p

value
Hypothesis Testing: (Paired)
Are two data sets different
H
o
H
a
Population 1
Population 2
Population 1
Population 2
If standard deviation known use z test,
else use t

test
We use z

test (normal distribution) if the
standard deviations of two populations from
which the data sets came are known
We pose a
null hypothesis that the mean
difference is zero
We try to refute the hypothesis using the
curves to calculate the probability that the
null hypothesis is true (mean of difference
is 0)
–
if probability is low
(low p) reject the null
hypothesis and accept the alternative
hypothesis (
mean of difference <>0
)
–
If probability is high
(high p) accept null
hypothesis (
mean of difference is 0
)
In paired experiments, we compare the
mean difference.
The t

test
In most cases we use what is know as a t

test rather than the z

test when comparing samples.
In particular when we have
–
small data sets (less than 30 each) and
–
we don’t know the s.d. and have to calculate it from the small
samples
Same concepts as before apply, but we base the test on what is
known as the t

distribution, which approximates the normal
distribution for small samples
We have to calculate what is know as a t

value!
Typically known as Student t

test
The t

distribution
In fact we have many t

distributions, each one is calculated in
reference to the number of degrees of freedom (
d.f.
)also know as
variables (
v
)
Normal distribution
t

distribution
We will see how we calculate the
degrees of freedom in a short while
t

test terminology
t

test:
Used to compare the mean of a sample to a known number (often 0).
Assumptions:
Subjects are randomly drawn from a population and the
distribution of the mean being tested is normal.
Test:
The hypotheses for a single sample t

test are:
–
H
o
: u = u
0
–
H
a
: u < > u
0
p

value:
probability of error in rejecting the hypothesis of no difference
between the two groups.
(where u
0
denotes the hypothesized
value to which you are comparing a
population mean)
H
1
:
1
<
2
t

Tests terminology
Single

tail vs. two

tail
H
0
:
1
2
H
1
:
1
>
2
H
0
:
1
2
H
0
:
1

2
†
〠
H
1
:
1

2
> 0
H
0
:
1

2
H
1
:
1

2
< 0
OR
OR
Left
Tail
Right
Tail
H
0
:
1

2
= 0
H
1
:
1

2
0
H
0
:
1
=
2
H
1
:
1
2
OR
Two
Tail
What am I testing for:
–
Right Tail: (group1 > group2)
–
Left Tail: (group1 < group2)
–
Two Tail: Both groups are
different but I don’t care how.
t

test terminology
Unpaired vs. paired t

test
Same as before !! Depends on your experiment
Unpaired t

Test: The hypotheses for the comparison of two
independent groups are:
–
H
o
: u
1
= u
2
(means of the two groups are equal)
–
H
a
: u
1
<> u
2
(means of the two group are not equal)
Paired t

test:
The hypothesis of paired measurements in same
individuals
–
Ho: D = 0 (the difference between the two observations is 0)
–
H
a
: D
<>
0 (the difference is not 0)
Calculating t

test (t statistic)
First
calculate
t
statistic
value
and
then
calculate
p
value
For
the
paired
t

test,
t
is
calculated
using
the
following
formula
:
And
n
is
the
number
of
pairs
being
tested
.
For
an
unpaired
(independent
group)
t

test,
the
following
formula
is
used
:
Where σ
(
x
) is the standard deviation of
x
and
n (
x
) is the number of elements in
x
.
Where
d
is calculated by
Remember these formulae !!
Calculating p

value for t

test
When
carrying
out
a
test,
a
P

value
can
be
calculated
based
on
the
t

value
and
the
‘Degrees
of
freedom’
.
There
are
three
methods
for
calculating
P
:
–
One Tailed >:
–
One Tailed <:
–
Two Tailed:
Where
p(t,v)
is looked up from the t

distribution table
The
number
of
degrees
(
v
)
of
freedom
is
calculated
as
:
–
UnPaired:
n
(x
) +
n
(
y
)

2
?¦
Paired:
n

1 (
where n
is the number of pairs.)
p

values
Results of the t

test:
If the
p

value
associated with the t

test is
small (usually set at p < 0.05), there is evidence to reject the null
hypothesis in favour of the alternative.
In other words, there is evidence that the mean is significantly
different than the hypothesized value. If the p

value associated
with the t

test is not small (p > 0.05), there is not enough evidence
to reject the null hypothesis, and you conclude that there is
evidence that the mean is not different from the hypothesized
value.
Calculating t and p values
You
will
usually
use
a
piece
of
software
to
calculate
t
and
P
–
(Excel
provides
that
!)
.
In
a
problems
–
You
can
assume
access
to
a
function
p(t,v)
which
calculates
p
for
a
given
t
value
and
v
(number
of
degrees
of
freedom)
–
or
alternatively
have
a
table
indexed
by
critical
t
values
and
v
t

value and p

value
Given a t

value, and degrees of freedom, you can look

up a p

value
Alternatively, if you know what p

value you need (e.g. 0.05) and
degrees of freedom you can set the threshold for critical t
t

test Interpretation
t
0
2.0154

2.0154
.025
Reject H
0
Reject H
0
.025
t (value) must > t (critical on table) by P level
Note as t increases, p decreases
t
c
t
.100
t
.05
t
.025
t
.01
t
.005
A
= .05
A
= .05

t
c
=1.812
=

1.812
The table provides the t
values (t
c
) for which P(t
x
> t
c
)
= A
Finding a critical t
Summary
Differential analysis
–
Uses fold ratio (fold change) for measuring effect
–
Need some measure of significance of such effect.
Statistical analysis
–
Paired vs. unpaired experiments
t

tests
–
Calculating t for paired/un

paired experiments
–
Deciding single tail vs. two

tail
–
Calculating degrees of freedom
–
Look

up p value
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο