# Introduction to Bioinformatics 1. Course Overview - Department of ...

Βιοτεχνολογία

2 Οκτ 2013 (πριν από 4 χρόνια και 7 μήνες)

99 εμφανίσεις

Introduction to Bioinformatics

5
.
Statistical Analysis of Gene Expression
Matrices I

Course 341

Department of Computing

Imperial College, London

Moustafa Ghanem

Lecture Overview

Motivation

Identifying differentially expressed genes

Calculating effect: fold ratio

Calculating significance: p
-
values

Statistical Analysis

Paired and unpaired experiments

Need for significance testing

Hypothesis testing

t
-
tests and p
-
values

t
-
tests

Paired and unpaired t
-
tests

Formulae for t
-
test

Single
-
tail vs. two tails t
-
tests

Looking up p
-
values

Motivation

Large
-
scale Differential Gene Expression Analysis

Consider a microarray experiment

that measures gene expression in two groups of rat tissue (>5000
genes in each experiment).

The rat tissues come from two groups:

WT: Wild
-
Type rat tissue,

KO: Knock Out Treatment rat tissue

Gene expression for each group measured under similar conditions

Question: Which genes are affected by the treatment? How
significant is the effect? How big is the effect?

Calculating Expression Ratios

In Differential Gene Expression Analysis, we are interested in identifying
genes with different expression across two states, e.g.:

Tumour cell lines vs. Normal cell lines

Treated tissue vs. diseased tissue

Different tissues, same organism

Same tissue, different organisms

Same tissue, same organism

Time course experiments

We can quantify the difference (effect) by taking a ratio

i.e. for gene
k
, this is the ratio between expression in state
a

compared to
expression in state
b

This provides a relative value of change (e.g. expression has doubled)

If expression level has not changed ratio is 1

Fold change

(Fold ratio)

Ratios are troublesome since

Up
-
regulated & Down
-
regulated genes treated differently

Genes up
-
regulated by a factor of 2 have a ratio of 2

Genes down
-
regulated by same factor (2) have a ratio of 0.5

As a result

down regulated genes are compressed between 1 and 0

up
-
regulated genes expand between 1 and infinity

Using a logarithmic transform to the base 2 rectifies problem, this is
typically known as the fold change

A gene is up
-
regulated in state 2 compared
to state 1 if it has a higher value in state 2

A gene is down
-
regulated in state 2
compared to state 1 if it has a lower value in
state 2

Examples of fold change

Gene ID

Expression
in state 1

Expression
in state 2

Ratio

Fold Change

A

100

50

2

1

B

10

5

2

1

C

5

10

0.5

-
1

D

200

1

200

7.65

E

10

10

1

0

You can calculate Fold change between pairs of expression values:

e.g. Between State 1 vs State 2 for gene A

Or Between mean values of all measurements for a gene in the
WT/KO experiments

mean(WT1..WT4) vs mean (KO1..KO4)

A, B and D are down regulated

C is up
-
regulated

E has no change

Statistics

Back to our problems

5000 Rows
represent
genes

Columns
represent
samples

4 Wild Type samples (Blue)

4 Wild KO samples (Red)

Statistics

Significance of

Fold Change

For our problem we can calculate an average fold ratio for each
gene (each row)

This will give us an average effect value for each gene

2, 1.7, 10, 100, etc

Question which of these values are significant?

Can use a threshold, but what threshold value should we set?

Use statistical techniques based on number of members in each
group, type of measurements, etc
-
> significance testing.

Statistics:

5000 separate statistical problems

Effectively:

5000 separate experiments where each experiment measures the
expression of one gene in two groups of 4 individuals

For each experiment (gene), want to establish if there is a statistical
difference between the reported values in each group

We then want to identify those genes (across the 5000 genes) that
have a significant change

Each row in our table is similar to one of those of traditional
statistical analysis problems

Statistics

Unpaired statistical experiments

Overall setting:

2 groups of 4 individuals each

Group1: Imperial students

Group2: UCL students

Experiment 1:

We measure the height of all students

We want to establish if members of one group are consistently (or on
average) taller than members of the other, and if the measured
difference is significant

Experiment 2:

We measure the weight of all students

We want to establish if members of one group are consistently (or on
average) heavier than the other, and if the measured difference is
significant

Experiment 3:

………

Condition

Group 1
members

Condition

Group 2
members

Statistics

Unpaired statistical experiments

In unpaired experiments, you typically have two groups of people
that are not related to one another, and measure some property
for each member of each group

e.g. you want to test whether a new drug is effective or not, you
divide similar patients in two groups:

One groups takes the drug

Another groups takes a placebo

You measure (quantify) effect of both groups some time later

You want to establish whether there is a significant difference
between both groups at that later point

The WT/KO example is an unpaired experiment if the rats in the
experiments are different !

Condition

Group 1
members

Condition

Group 2
members

Statistics

Unpaired statistical experiments

The WT/KO example is an unpaired experiment if the rats in the
experiments are different!

Experiment for WT Rats for
Gene 96608_at

Rat #

WT gene expression

WT1

100

WT2

100

WT3

200

WT4

300

Experiment for KO Rats for
Gene 96608_at

Rat #

KO gene expression

KO1

150

KO2

300

KO3

100

KO4

300

Statistics

Unpaired statistical experiments

How do we address the problem?

Compare two sets of results
(alternatively calculate mean for
each group and compare means)

Graphically:

Scatter Plots

Box plots, etc

Compare Statistically

Use unpaired t
-
test

Are these two series significantly different?

Are these two series significantly different?

Statistics

Paired statistical experiments

Overall setting:

1 groups of 4 individuals each

Group1: Imperial students

We make measurements for each student in two situations

Experiment 1:

We measure the height of all students before Bioinformatics course
and after Bioinformatics course

We want to establish if Bioinformatics course consistently (or on
average) affects students’ heights

Experiment 2:

We measure the weight of all students before Bioinformatics course
and after Bioinformatics

We want to establish if Bioinformatics course consistently (or on
average) affects students’ weights

Experiment 3:

………

Group
members

Condition 1

Condition 2

Statistics

Paired statistical experiments

In paired experiments, you typically have one group of people, you
typically measure some property for each member before and
after a particular event (so measurement come in pairs of before
and after)

e.g. you want to test the effectiveness of a new cream for tanning

You measure the tan in each individual before the cream is applied

You measure the tan in each individual after the cream is applied

You want to establish whether the there is a significant difference
between measurements before and after applying the cream for
the group as a whole

Group
members

Condition 1

Condition 2

Statistics

Paired statistical experiments

The WT/KO example is a paired experiment if the rats in the
experiments are the same!

Experiments for Gene 96608_at

Rat #

WT gene
expression

KO gene
expression

Rat1

100

200

Rat2

100

300

Rat3

200

400

Rat4

300

500

Statistics

Paired statistical experiments

How do we address the problem?

Calculate difference for each pair

Compare differences to zero

Alternatively (compare average
difference to zero)

Graphically:

Scatter Plot of difference

Box plots, etc

Statistically

Use unpaired t
-
test

Are differences close to Zero?

Statistics

Significance testing

In both cases (paired and unpaired) you want to establish whether
the difference is significant

Significance testing is a statistical term and refers to estimating
(numerically) the probability of a measurement occurring by
chance.

To do this, you need to review some basic statistics

Normal distributions: mean, standard deviations, etc

Hypothesis Testing

t
-
distributions

t
-
tests and p
-
values

Mean and

standard deviation

Mean and standard deviation tell you the basic features
of a distribution

mean = average value of all members of the group

u = (x
1
+x
2
+x3 ….+x
N
)/N

standard deviation = a measure of how much the
values of individual members vary in relation to the
mean

The normal distribution is symmetrical about the mean
68% of the normal distribution lies within 1 s.d. of the
mean

68% of dist.

1 s.d.

1 s.d.

x

Note on s.d. calculation

Through the following slides and in the tutorials, I use the
following formula for calculating standard deviation

Some people use the unbiased form below (for good reasons)

at the end

The Normal Distribution

normal distribution
, and it
plays a special role in the statistical tests we are interested in;

The
x
-
axis

represents the values of a
particular variable

The
y
-
axis

represents the proportion of
members of the population that have
each value of the variable

The area under the curve represents
probability

i.e. area under the curve
between two values on the x
-
axis
represents the probability of an
individual having a value in that range

68% of dist.

1 s.d.

1 s.d.

x

Normal Distribution and
Confidence Intervals

1
-
a

= 0.95

a
/2 = 0.025

a
/2 = 0.025

-
1.96

1.96

Any normal distribution
can be transformed to a
standard distribution

(mean 0, s.d. = 1)

using a simple transform

0.025 = p
-
value:

probability of a
measurement value belonging to this
distribution

Hypothesis Testing: (Unpaired)

Are two data sets different

H
o

H
a

Population 1

Population 2

Population 1

Population 2

If standard deviation known use z test,
else use t
-
test

We use z
-
test (normal distribution) if the
standard deviations of two populations from
which the data sets came are known (and
are the same)

We pose a
null hypothesis that the
means are equal

We try to refute the hypothesis using the
curves to calculate the probability that the
null hypothesis is true (both means are
equal)

if probability is low

(low p) reject the null
hypothesis and accept the alternative
hypothesis (
both means are different
)

If probability is high

(high p) accept null
hypothesis (
both means are equal
)

In unpaired experiments, we compare
the difference between the means.

Comparing Two Samples

Graphical interpretation

To compare two groups you can
compare the mean of one group
graphically.

The graphical comparison allows you
to visually see the distribution of the
two groups.

If the p
-
value is low, chances are there
will be little overlap between the two
distributions. If the p
-
value is not low,
there will be a fair amount of overlap
between the two groups.

We can set a critical value for the x
-
axis based on the threshold of p
-
value

Hypothesis Testing: (Paired)

Are two data sets different

H
o

H
a

Population 1

Population 2

Population 1

Population 2

If standard deviation known use z test,
else use t
-
test

We use z
-
test (normal distribution) if the
standard deviations of two populations from
which the data sets came are known

We pose a
null hypothesis that the mean
difference is zero

We try to refute the hypothesis using the
curves to calculate the probability that the
null hypothesis is true (mean of difference
is 0)

if probability is low

(low p) reject the null
hypothesis and accept the alternative
hypothesis (
mean of difference <>0
)

If probability is high

(high p) accept null
hypothesis (
mean of difference is 0
)

In paired experiments, we compare the
mean difference.

The t
-
test

In most cases we use what is know as a t
-
test rather than the z
-
test when comparing samples.

In particular when we have

small data sets (less than 30 each) and

we don’t know the s.d. and have to calculate it from the small
samples

Same concepts as before apply, but we base the test on what is
known as the t
-
distribution, which approximates the normal
distribution for small samples

We have to calculate what is know as a t
-
value!

Typically known as Student t
-
test

The t
-
distribution

In fact we have many t
-
distributions, each one is calculated in
reference to the number of degrees of freedom (
d.f.
)also know as
variables (
v
)

Normal distribution

t
-
distribution

We will see how we calculate the
degrees of freedom in a short while

t
-
test terminology

t
-
test:

Used to compare the mean of a sample to a known number (often 0).

Assumptions:
Subjects are randomly drawn from a population and the
distribution of the mean being tested is normal.

Test:

The hypotheses for a single sample t
-
test are:

H
o
: u = u
0

H
a
: u < > u
0

p
-
value:

probability of error in rejecting the hypothesis of no difference
between the two groups.

(where u
0

denotes the hypothesized
value to which you are comparing a
population mean)

H
1
:

1

<

2

t
-
Tests terminology

Single
-
tail vs. two
-
tail

H
0
:

1

2

H
1
:

1

>

2

H
0
:

1

2

H
0
:

1

-

2

H
1
:

1

-

2

> 0

H
0
:

1

-

2

 

H
1
:

1

-

2

< 0

OR

OR

Left
Tail

Right
Tail

H
0
:

1

-

2

= 0
H
1
:

1

-

2

0

H
0
:

1

=

2

H
1
:

1

2

OR

Two
Tail

What am I testing for:

Right Tail: (group1 > group2)

Left Tail: (group1 < group2)

Two Tail: Both groups are
different but I don’t care how.

t
-
test terminology

Unpaired vs. paired t
-
test

Same as before !! Depends on your experiment

Unpaired t
-
Test: The hypotheses for the comparison of two
independent groups are:

H
o
: u
1

= u
2

(means of the two groups are equal)

H
a
: u
1

<> u
2

(means of the two group are not equal)

Paired t
-
test:

The hypothesis of paired measurements in same
individuals

Ho: D = 0 (the difference between the two observations is 0)

H
a
: D

<>
0 (the difference is not 0)

Calculating t
-
test (t statistic)

First

calculate

t

statistic

value

and

then

calculate

p

value

For

the

paired

t
-
test,

t

is

calculated

using

the

following

formula
:

And

n

is

the

number

of

pairs

being

tested
.

For

an

unpaired

(independent

group)

t
-
test,

the

following

formula

is

used
:

Where σ

(
x
) is the standard deviation of
x

and

n (
x
) is the number of elements in
x
.

Where
d

is calculated by

Remember these formulae !!

Calculating p
-
value for t
-
test

When

carrying

out

a

test,

a

P
-
value

can

be

calculated

based

on

the

t
-
value

and

the

‘Degrees

of

freedom’
.

There

are

three

methods

for

calculating

P
:

One Tailed >:

One Tailed <:

Two Tailed:

Where
p(t,v)
is looked up from the t
-
distribution table

The

number

of

degrees

(
v
)

of

freedom

is

calculated

as
:

UnPaired:
n

(x
) +
n

(
y
)
-
2

Paired:
n
-

1 (
where n

is the number of pairs.)

p
-
values

Results of the t
-
test:
If the
p
-
value

associated with the t
-
test is
small (usually set at p < 0.05), there is evidence to reject the null
hypothesis in favour of the alternative.

In other words, there is evidence that the mean is significantly
different than the hypothesized value. If the p
-
value associated
with the t
-
test is not small (p > 0.05), there is not enough evidence
to reject the null hypothesis, and you conclude that there is
evidence that the mean is not different from the hypothesized
value.

Calculating t and p values

You

will

usually

use

a

piece

of

software

to

calculate

t

and

P

(Excel

provides

that

!)
.

In

a

problems

You

can

assume

access

to

a

function

p(t,v)

which

calculates

p

for

a

given

t

value

and

v

(number

of

degrees

of

freedom)

or

alternatively

have

a

table

indexed

by

critical

t

values

and

v

t
-
value and p
-
value

Given a t
-
value, and degrees of freedom, you can look
-
up a p
-
value

Alternatively, if you know what p
-
value you need (e.g. 0.05) and
degrees of freedom you can set the threshold for critical t

t
-
test Interpretation

t

0

2.0154

-
2.0154

.025

Reject H

0

Reject H

0

.025

t (value) must > t (critical on table) by P level

Note as t increases, p decreases

t
c

t
.100

t
.05

t
.025

t
.01

t
.005

A

= .05

A

= .05

-
t
c

=1.812

=
-
1.812

The table provides the t
values (t
c
) for which P(t
x

> t
c
)
= A

Finding a critical t

Summary

Differential analysis

Uses fold ratio (fold change) for measuring effect

Need some measure of significance of such effect.

Statistical analysis

Paired vs. unpaired experiments

t
-
tests

Calculating t for paired/un
-
paired experiments

Deciding single tail vs. two
-
tail

Calculating degrees of freedom

Look
-
up p value