Introduction to Bioinformatics 1. Course Overview - Department of ...

powerfultennesseeBiotechnology

Oct 2, 2013 (3 years and 9 months ago)

78 views

Introduction to Bioinformatics


5
.
Statistical Analysis of Gene Expression
Matrices I

Course 341

Department of Computing

Imperial College, London


Moustafa Ghanem

Lecture Overview


Motivation


Identifying differentially expressed genes


Calculating effect: fold ratio


Calculating significance: p
-
values



Statistical Analysis


Paired and unpaired experiments


Need for significance testing


Hypothesis testing


t
-
tests and p
-
values



t
-
tests


Paired and unpaired t
-
tests


Formulae for t
-
test


Single
-
tail vs. two tails t
-
tests


Looking up p
-
values

Motivation

Large
-
scale Differential Gene Expression Analysis


Consider a microarray experiment


that measures gene expression in two groups of rat tissue (>5000
genes in each experiment).



The rat tissues come from two groups:


WT: Wild
-
Type rat tissue,


KO: Knock Out Treatment rat tissue



Gene expression for each group measured under similar conditions



Question: Which genes are affected by the treatment? How
significant is the effect? How big is the effect?

Calculating Expression Ratios


In Differential Gene Expression Analysis, we are interested in identifying
genes with different expression across two states, e.g.:


Tumour cell lines vs. Normal cell lines


Treated tissue vs. diseased tissue


Different tissues, same organism


Same tissue, different organisms


Same tissue, same organism


Time course experiments


We can quantify the difference (effect) by taking a ratio






i.e. for gene
k
, this is the ratio between expression in state
a

compared to
expression in state
b


This provides a relative value of change (e.g. expression has doubled)


If expression level has not changed ratio is 1


Fold change

(Fold ratio)


Ratios are troublesome since


Up
-
regulated & Down
-
regulated genes treated differently


Genes up
-
regulated by a factor of 2 have a ratio of 2


Genes down
-
regulated by same factor (2) have a ratio of 0.5


As a result


down regulated genes are compressed between 1 and 0



up
-
regulated genes expand between 1 and infinity


Using a logarithmic transform to the base 2 rectifies problem, this is
typically known as the fold change


A gene is up
-
regulated in state 2 compared
to state 1 if it has a higher value in state 2

A gene is down
-
regulated in state 2
compared to state 1 if it has a lower value in
state 2

Examples of fold change



Gene ID

Expression
in state 1

Expression
in state 2

Ratio

Fold Change

A

100

50

2

1

B

10

5

2

1

C

5

10

0.5

-
1

D

200

1

200

7.65

E

10

10

1

0

You can calculate Fold change between pairs of expression values:

e.g. Between State 1 vs State 2 for gene A

Or Between mean values of all measurements for a gene in the
WT/KO experiments


mean(WT1..WT4) vs mean (KO1..KO4)

A, B and D are down regulated

C is up
-
regulated

E has no change

Statistics

Back to our problems

5000 Rows
represent
genes

Columns
represent
samples

4 Wild Type samples (Blue)

4 Wild KO samples (Red)

Statistics

Significance of

Fold Change


For our problem we can calculate an average fold ratio for each
gene (each row)



This will give us an average effect value for each gene


2, 1.7, 10, 100, etc



Question which of these values are significant?


Can use a threshold, but what threshold value should we set?


Use statistical techniques based on number of members in each
group, type of measurements, etc
-
> significance testing.

Statistics:

5000 separate statistical problems


How do we think about this problem?



Effectively:


5000 separate experiments where each experiment measures the
expression of one gene in two groups of 4 individuals


For each experiment (gene), want to establish if there is a statistical
difference between the reported values in each group


We then want to identify those genes (across the 5000 genes) that
have a significant change



Each row in our table is similar to one of those of traditional
statistical analysis problems



Statistics

Unpaired statistical experiments


Overall setting:

2 groups of 4 individuals each


Group1: Imperial students


Group2: UCL students


Experiment 1:


We measure the height of all students


We want to establish if members of one group are consistently (or on
average) taller than members of the other, and if the measured
difference is significant


Experiment 2:


We measure the weight of all students


We want to establish if members of one group are consistently (or on
average) heavier than the other, and if the measured difference is
significant


Experiment 3:


………


Condition

Group 1
members

Condition

Group 2
members

Statistics

Unpaired statistical experiments


In unpaired experiments, you typically have two groups of people
that are not related to one another, and measure some property
for each member of each group



e.g. you want to test whether a new drug is effective or not, you
divide similar patients in two groups:


One groups takes the drug


Another groups takes a placebo


You measure (quantify) effect of both groups some time later



You want to establish whether there is a significant difference
between both groups at that later point



The WT/KO example is an unpaired experiment if the rats in the
experiments are different !

Condition

Group 1
members

Condition

Group 2
members

Statistics

Unpaired statistical experiments


The WT/KO example is an unpaired experiment if the rats in the
experiments are different!


Experiment for WT Rats for
Gene 96608_at

Rat #

WT gene expression

WT1

100

WT2

100

WT3

200

WT4

300

Experiment for KO Rats for
Gene 96608_at

Rat #

KO gene expression

KO1

150

KO2

300

KO3

100

KO4

300

Statistics

Unpaired statistical experiments


How do we address the problem?


Compare two sets of results
(alternatively calculate mean for
each group and compare means)




Graphically:


Scatter Plots


Box plots, etc



Compare Statistically


Use unpaired t
-
test

Are these two series significantly different?

Are these two series significantly different?

Statistics

Paired statistical experiments


Overall setting:

1 groups of 4 individuals each


Group1: Imperial students


We make measurements for each student in two situations


Experiment 1:


We measure the height of all students before Bioinformatics course
and after Bioinformatics course


We want to establish if Bioinformatics course consistently (or on
average) affects students’ heights


Experiment 2:


We measure the weight of all students before Bioinformatics course
and after Bioinformatics


We want to establish if Bioinformatics course consistently (or on
average) affects students’ weights


Experiment 3:


………


Group
members

Condition 1

Condition 2

Statistics

Paired statistical experiments


In paired experiments, you typically have one group of people, you
typically measure some property for each member before and
after a particular event (so measurement come in pairs of before
and after)



e.g. you want to test the effectiveness of a new cream for tanning


You measure the tan in each individual before the cream is applied


You measure the tan in each individual after the cream is applied



You want to establish whether the there is a significant difference
between measurements before and after applying the cream for
the group as a whole


Group
members

Condition 1

Condition 2

Statistics

Paired statistical experiments


The WT/KO example is a paired experiment if the rats in the
experiments are the same!


Experiments for Gene 96608_at

Rat #

WT gene
expression

KO gene
expression

Rat1

100

200

Rat2

100

300

Rat3

200

400

Rat4

300

500

Statistics

Paired statistical experiments


How do we address the problem?


Calculate difference for each pair


Compare differences to zero


Alternatively (compare average
difference to zero)



Graphically:


Scatter Plot of difference


Box plots, etc


Statistically


Use unpaired t
-
test

Are differences close to Zero?

Statistics

Significance testing


In both cases (paired and unpaired) you want to establish whether
the difference is significant


Significance testing is a statistical term and refers to estimating
(numerically) the probability of a measurement occurring by
chance.


To do this, you need to review some basic statistics


Normal distributions: mean, standard deviations, etc


Hypothesis Testing


t
-
distributions


t
-
tests and p
-
values

Mean and

standard deviation


Mean and standard deviation tell you the basic features
of a distribution



mean = average value of all members of the group

u = (x
1
+x
2
+x3 ….+x
N
)/N



standard deviation = a measure of how much the
values of individual members vary in relation to the
mean





The normal distribution is symmetrical about the mean
68% of the normal distribution lies within 1 s.d. of the
mean


68% of dist.

1 s.d.

1 s.d.

x

Note on s.d. calculation


Through the following slides and in the tutorials, I use the
following formula for calculating standard deviation





Some people use the unbiased form below (for good reasons)






Please use the simple form if you want the answers to add up
at the end


The Normal Distribution

Many continuous variables follow a
normal distribution
, and it
plays a special role in the statistical tests we are interested in;


The
x
-
axis

represents the values of a
particular variable



The
y
-
axis

represents the proportion of
members of the population that have
each value of the variable



The area under the curve represents
probability


i.e. area under the curve
between two values on the x
-
axis
represents the probability of an
individual having a value in that range

68% of dist.

1 s.d.

1 s.d.

x

Normal Distribution and
Confidence Intervals



1
-
a

= 0.95

a
/2 = 0.025

a
/2 = 0.025

-
1.96

1.96

Any normal distribution
can be transformed to a
standard distribution

(mean 0, s.d. = 1)

using a simple transform

0.025 = p
-
value:

probability of a
measurement value belonging to this
distribution

Hypothesis Testing: (Unpaired)

Are two data sets different

H
o

H
a

Population 1

Population 2

Population 1

Population 2

If standard deviation known use z test,
else use t
-
test


We use z
-
test (normal distribution) if the
standard deviations of two populations from
which the data sets came are known (and
are the same)



We pose a
null hypothesis that the
means are equal



We try to refute the hypothesis using the
curves to calculate the probability that the
null hypothesis is true (both means are
equal)


if probability is low

(low p) reject the null
hypothesis and accept the alternative
hypothesis (
both means are different
)



If probability is high

(high p) accept null
hypothesis (
both means are equal
)

In unpaired experiments, we compare
the difference between the means.

Comparing Two Samples

Graphical interpretation


To compare two groups you can
compare the mean of one group
graphically.



The graphical comparison allows you
to visually see the distribution of the
two groups.



If the p
-
value is low, chances are there
will be little overlap between the two
distributions. If the p
-
value is not low,
there will be a fair amount of overlap
between the two groups.



We can set a critical value for the x
-
axis based on the threshold of p
-
value

Hypothesis Testing: (Paired)

Are two data sets different

H
o

H
a

Population 1

Population 2

Population 1

Population 2

If standard deviation known use z test,
else use t
-
test


We use z
-
test (normal distribution) if the
standard deviations of two populations from
which the data sets came are known



We pose a
null hypothesis that the mean
difference is zero



We try to refute the hypothesis using the
curves to calculate the probability that the
null hypothesis is true (mean of difference
is 0)


if probability is low

(low p) reject the null
hypothesis and accept the alternative
hypothesis (
mean of difference <>0
)



If probability is high

(high p) accept null
hypothesis (
mean of difference is 0
)

In paired experiments, we compare the
mean difference.

The t
-
test


In most cases we use what is know as a t
-
test rather than the z
-
test when comparing samples.



In particular when we have


small data sets (less than 30 each) and


we don’t know the s.d. and have to calculate it from the small
samples



Same concepts as before apply, but we base the test on what is
known as the t
-
distribution, which approximates the normal
distribution for small samples



We have to calculate what is know as a t
-
value!



Typically known as Student t
-
test

The t
-
distribution


In fact we have many t
-
distributions, each one is calculated in
reference to the number of degrees of freedom (
d.f.
)also know as
variables (
v
)

Normal distribution

t
-
distribution

We will see how we calculate the
degrees of freedom in a short while

t
-
test terminology


t
-
test:

Used to compare the mean of a sample to a known number (often 0).


Assumptions:
Subjects are randomly drawn from a population and the
distribution of the mean being tested is normal.



Test:

The hypotheses for a single sample t
-
test are:


H
o
: u = u
0



H
a
: u < > u
0









p
-
value:

probability of error in rejecting the hypothesis of no difference
between the two groups.

(where u
0

denotes the hypothesized
value to which you are comparing a
population mean)


H
1
:

1

<

2

t
-
Tests terminology

Single
-
tail vs. two
-
tail

H
0
:

1




2


H
1
:

1

>

2


H
0
:

1



2


H
0
:

1

-


2






H
1
:

1

-


2

> 0

H
0
:

1

-


2


 

H
1
:

1

-


2

< 0

OR

OR

Left
Tail

Right
Tail

H
0
:

1

-

2


= 0
H
1
:

1

-


2



0

H
0
:

1

=

2

H
1
:

1




2


OR

Two
Tail


What am I testing for:


Right Tail: (group1 > group2)


Left Tail: (group1 < group2)


Two Tail: Both groups are
different but I don’t care how.


t
-
test terminology

Unpaired vs. paired t
-
test


Same as before !! Depends on your experiment



Unpaired t
-
Test: The hypotheses for the comparison of two
independent groups are:


H
o
: u
1

= u
2

(means of the two groups are equal)


H
a
: u
1

<> u
2

(means of the two group are not equal)



Paired t
-
test:

The hypothesis of paired measurements in same
individuals


Ho: D = 0 (the difference between the two observations is 0)


H
a
: D

<>
0 (the difference is not 0)


Calculating t
-
test (t statistic)


First

calculate

t

statistic

value

and

then

calculate

p

value


For

the

paired

t
-
test,

t

is

calculated

using

the

following

formula
:









And

n

is

the

number

of

pairs

being

tested
.



For

an

unpaired

(independent

group)

t
-
test,

the

following

formula

is

used
:






Where σ

(
x
) is the standard deviation of
x

and

n (
x
) is the number of elements in
x
.


Where
d

is calculated by

Remember these formulae !!

Calculating p
-
value for t
-
test


When

carrying

out

a

test,

a

P
-
value

can

be

calculated

based

on

the

t
-
value

and

the

‘Degrees

of

freedom’
.



There

are

three

methods

for

calculating

P
:


One Tailed >:



One Tailed <:



Two Tailed:




Where
p(t,v)
is looked up from the t
-
distribution table



The

number

of

degrees

(
v
)

of

freedom

is

calculated

as
:


UnPaired:
n

(x
) +
n

(
y
)
-
2


Paired:
n
-

1 (
where n

is the number of pairs.)



p
-
values


Results of the t
-
test:
If the
p
-
value

associated with the t
-
test is
small (usually set at p < 0.05), there is evidence to reject the null
hypothesis in favour of the alternative.



In other words, there is evidence that the mean is significantly
different than the hypothesized value. If the p
-
value associated
with the t
-
test is not small (p > 0.05), there is not enough evidence
to reject the null hypothesis, and you conclude that there is
evidence that the mean is not different from the hypothesized
value.



Calculating t and p values


You

will

usually

use

a

piece

of

software

to

calculate

t

and

P


(Excel

provides

that

!)
.


In

a

problems


You

can

assume

access

to

a

function

p(t,v)

which

calculates

p

for

a

given

t

value

and

v

(number

of

degrees

of

freedom)




or

alternatively

have

a

table

indexed

by

critical

t

values

and

v

t
-
value and p
-
value


Given a t
-
value, and degrees of freedom, you can look
-
up a p
-
value



Alternatively, if you know what p
-
value you need (e.g. 0.05) and
degrees of freedom you can set the threshold for critical t

t
-
test Interpretation

t

0

2.0154

-
2.0154

.025

Reject H

0

Reject H

0

.025

t (value) must > t (critical on table) by P level

Note as t increases, p decreases

t
c

t
.100

t
.05

t
.025

t
.01

t
.005

A

= .05

A

= .05

-
t
c

=1.812

=
-
1.812


The table provides the t
values (t
c
) for which P(t
x

> t
c
)
= A

Finding a critical t

Summary


Differential analysis


Uses fold ratio (fold change) for measuring effect


Need some measure of significance of such effect.



Statistical analysis


Paired vs. unpaired experiments



t
-
tests


Calculating t for paired/un
-
paired experiments


Deciding single tail vs. two
-
tail


Calculating degrees of freedom


Look
-
up p value