The NCCR/ISREC DNA Array and Bioinformatics Core ... - BCF-SIB

dasypygalstockingsBiotechnology

Oct 2, 2013 (3 years and 6 months ago)

74 views

WORKSHOP

SPOTTED 2
-
channel ARRAYS

DATA PROCESSING AND
QUALITY CONTROL

Eugenia Migliavacca and

Mauro Delorenzi,

ISREC, December 11, 2003

AIMS

Discussion

Information

Introduction to the use of the webpage for
automated normalization

interface btw experimentalists and analysts

feedback

resource allocation

Acknowledgments


some slides originally provided by:

Terry Speed (Berkeley / WEHI)

Sandrine Dudoit (
Berkeley
)

Yee Hwa Yang (Berkeley)

Natalie Thorne (WEHI)



Otto Hagenbuechle


Eugenia Migliavacca

Darlene Goldstein

and others





RNA ISOLATION

(AMPLIFICATION)
AND LABELING
WITH FLUORO
-
DYES

Preparation

Hybridisation

Binding labelled samples (targets) to
complementary probes on a slide

Hybridise

for

5
-
12 hours

Wash

Mix

Scanning

1

2

Adjust scanner parameters; frequently can adapt:


1. excitation wave (laser) intensity

2. "gain" (amplification) of the photon detection system

1

2

Human 10K

cDNA Array

How to extract data ?

How to recognize problems ?

Part of the image of one channel false
-
coloured on a white (v. high)
red (high)

through yellow and
green

(med
ium
) to
blue (low)

and black scale.

Scanner's Spots

RNA preparation and Labeling

Data for further analysis

Slide scanning

Hybridisation

Image analysis

Normalization

Steps of a Microarray Experiment

Why perform an experiment ?

What is the aim ?

Which conclusions do you want to reach ?

first: DESIGN !

mRNA abundance

rRNA 80%

tRNA


tRNA

tRNA

mRNA 1%

1
-
50


50
-
500

500+

approx. 300'000 mRNA Molecules/cell

approx. 10
-
20'000 different genes


What do you want to measure ?


RNA mass

different

in different cells

Relative vs Absolute changes

200'000 mRNA Molecules/cell

200 for gene X (0.1%)

400'000 mRNA Molecules/cell

400 for gene X (0.1%)


Is gene X differentially expressed ?

RNA preparation and Labeling

Data for further analysis

Slide scanning

Hybridisation

Image analysis

Normalization

R
,
G,
M,
A,
etc

16
-
bit TIFF files

(Rfg, Rbg)
,
(Gfg, Gbg),
etc

What is needed for high quality data ?


Which are the critical steps ?

Steps of a Microarray Experiment

RNA preparation and Labeling

Data for further analysis

Slide scanning

Hybridisation

Image analysis

Normalization

Adjust / Balance
channels approx.;
avoid saturation

check normalized and
unnormalized data of
exp RNA and of
spiked RNA

Spike
-
in RNA in known conc.
and ratios

Steps of a Microarray Experiment

Why avoid saturation ?

Why balance channels ?

Why perform "normalization" ?

What to check before and after normalization ?


Why calculate ratios ?

Why calculate log ratios ?

Aim: Gene Expression Data

Gene expression data on
p

genes for
n

samples


Genes

Slides

Gene expression level of gene
5
in slide 4
j


M

=

Log
2
(
Red intensity

/
Green intensity
)


slide 1

slide 2

slide 3

slide 4

slide 5



1


0.46


0.30


0.80


1.51


0.90

...

2

-
0.10


0.49


0.24


0.06


0.46

...

3


0.15


0.74


0.04


0.10


0.20

...

4

-
0.45

-
1.03

-
0.79

-
0.56

-
0.32

...

5

-
0.06


1.06


1.35


1.09

-
1.09

...

These values are conventionally displayed on a

red

(>0)

yellow (0)
green (<0)

scale.

Objectives for high quality

Important aspects include:



Tentatively separating


systematic sources of variation ("artefacts"), that bias the
results,



from random sources of variation ("noise"), that hide the
truth.


Removing the former as well as possible and quantifying the
latter


Only if this is done can we hope to

reach good quality

and

make valid statements about the confidence in the results

Typical Statistical Approach

Measured value


= real value + systematic errors + noise

Corrected value


= real value + noise



Analysis of Corrected value =>



(unbiased) CONCLUSIONS



Estimation of Noise =>







quality of CONCLUSIONS, statistical significance




(level of confidence) of the conclusions


Image Analysis =>
Rfg ; Rbg

;
Gfg ; Gbg


(
fg = foreground, bg = background.)

For each spot on the slide calculate:





Red intensity = R = Rfg
-

Rbg


Green intensity = G = Gfg
-

Gbg





M =

Log
2
(
Red intensity

/
Green intensity
)



Subtraction of background values (additive background model
assuming to be locally constant …)

Sources of background: probe unspecifically sticking on slide,
irregular / dirty slide surface, dust,

and noise / errors) in the scanner measurement

Not included: real cross
-
hybridisation and unspecific
hybridisation to the probe

Step 1: a) Background Correction

b) Calculation of (log) ratios

Subtraction of background

has shown frequently not to improve the performance:

while making the average of many measurements closer to the true
values (reduced bias or systematic error)

it causes higher variability (lower reproducibility)

Comment to Background Correction

A. High variance
-

Unbiased Estimator

B. Low variance
-

Biased Estimator

average

single meas.

A.
High variance
-

Unbiased Estimator

when you take
many measurements
: the average will be closer to
the true value more frequently

B. Low variance
-

slightly biased Estimator

when you take
one or a few measurements
: the average will be
closer to the true value more frequently

DAF Microarrays 2002: we preferred no subtraction, should be
re
-
evaluated with Agilent scanner (and GenePix IAS)

Which is better ?

A reminder on logarithms

A numerical example

M = log R/G = logR
-

logG

A = ( logR + logG ) /2

Positive controls

(spotted in varying concentrations)


Negative
controls

blanks

Lowess
curve

Step 2: An M vs A (MVA) Plot

Why use an M vs A plot ?

1.
Logs stretch out region we are most interested in.

2.
Can more clearly see features of the data such as intensity
dependent variation, and dye
-
bias.

3.
Differentially expressed genes more easily identified.

4.
Intuitive interpretation

S1.n. Control Slide: Dye Effect, Spread.

MVA plot: looking at data

Lowess curve

Spot identifier

Normalisation
-

Median


Assumption: Changes roughly symmetric


First panel: smooth density of log
2
G and log
2
R.


Second panel: M vs A plot with median put to zero

Step 3: Normalisation
-

global median centering

common median


Assumption: changes roughly symmetric at all intensities.

Step 4: Normalisation
-

lowess
-

local
median centering

What is this normalization doing?

Local regression


Classical (global) regression: draws a
single line

to the entire set of points


Local regression: draws a
curve

through
noisy data by
smoothing


Lowess

(LOcally WEighted Scatterplot
Smoothing) is a type of local regression


Can correct for
both

print
-
tip and
intensity
-
dependent bias with
lowess

fits
to the data
within print
-
tip groups

Local regression illustrated

Lowess line


After within slide global lowess normalization.


Likely to be a spatial effect.

Print
-
tip groups

Step 5: Normalisation
-

spatial corrections

Normalization between groups (ctd)


After print
-
tip
location
-

and
scale
-

normalization.

Print
-
tip groups

normalized values look nice , but .....

Effects of
Location
Normalisati
on
(example)

Before

After

Boxplots of log ratios
by pin group

Lowess lines through points
from each pin group

Identifying sub
-
array effects



Assumption:


All (print
-
tip
-
)groups should have the same spread in M



True ratio is
ij

where i represents different
(print
-
tip)
-
groups

and j
represents different spots. Observed is M
ij
, where M
ij

= a
i
* log(
ij
)



Robust estimate of a
i

is










Corrected values are calculated as:

Taking varying scale into account

Step 6: Rescaling (Spread
-
Normalisation)

Illustration: print
-
tip
-
group
-

Normalisation

Assumption: For every print group:

changes roughly symmetric

at all intensities
.

Glass Slide

Array of bound cDNA probes

4x4 blocks = 16 pin groups

Which normalization to use?

Case 1:

A few genes that are likely to change and / or a random large
collection of genes
(expect as many up as down):

Each slide per se:


Location: print
-
tip
-
group

lowess normalization
.


Scale: for all print
-
tip
-
groups, adjust MAD to equal the geometric
mean for MAD for all print
-
tip
-
groups.


Case 2:
Non
-
random gene collection and / or many genes do change
appreciably
:


USE
DYE
-
SWAP APPROACH


Self
-
normalization: take the difference of the two log
-
ratios.


Check using controls or known information.


MVA plots: what to look at ?


How to use the spikes ?


Points:

signal intensity

background

saturation

homogeneity , normalizability

problem diagnosis




Webpage


How to use the plots ?



Use of the different options





Quality control before normalization (?)



Choice of normalization