Introduction to Bioinformatics 1. Course Overview - Department of ...

clumpfrustratedΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 8 μήνες)

70 εμφανίσεις

Introduction to Bioinformatics


Microarrays2:

Microarray Data Normalisation

Course 341

Department of Computing

Imperial College, London


Moustafa Ghanem

Lecture Overview


Background and Motivation


Introduction


Microarray experiments and microarray data analysis


Sources of variability


Experimental design


Normalisation Examples


Probe intensity values


Two colour arrays


Positive controls


Spatial normalisation within array


Between array normalisation


Normalisation Methods


Total intensity normalisation


Scaling and centring


Linear regression


MA plots and Lowess




Background

Microarrays


A Microarray is a device detects the
presence

and
abundance

of labelled
nucleic acids in a biological sample.



The Microarray consists of a solid
surface onto which known DNA
molecules have been chemically
bonded at special locations.


Each array location is typically known
as a probe and contains many
replicates of the same molecule.


The molecules in each array location
are carefully chosen so as to hybridise
only with mRNA molecules
corresponding to a single gene.


A Microarray works by exploiting the ability of mRNA
molecule to hybridize to its complementary DNA probe


The mRNA molecules in a target biological sample are
labelled using a fluorescent dye and applied to the array


The fluorescent label enables the detection of which
probes have hybridised (presence) via the light emitted
from the probe.

Background

Microarray Data Analysis

Biological question

Biological verification

and interpretation

Microarray experiment

Experimental design

Platform Choice

Image analysis

Normalization

Clustering

Pattern Discovery

Sample Attributes

16
-
bit TIFF Files

(Rspot, Rbkg)
,
(Gspot, Gbkg)

Data Mining

Classification

Statistical

Analysis

Motivation



Data generated from Microarray experiments are inherently highly
variable.


1.
First, there is the
Law of Large Numbers
:


Any measurement of thousands of values will find some large
differences due to chance (normal distribution)


However, the average gene does not change its expression across
experiments


Must have replication (e.g. different patients different experiments)
and statistics to show that measured differences are real.


2.
Second, there are also
Systematic Sources of Variability
:


e.g. Errors is scanning microarray images, differences between
properties of Cy3 and Cy5 channels, etc


Must have systematic methods for addressing such errors.

Motivation



Normalisation is a general term for a collection of methods that are
directed at reasoning about and resolving the systematic errors
and bias introduced by microarray experimental platforms



Normalisation methods stand in contrast with the data analysis
methods described in other lectures (e.g. differential gene
expression analysis, classification and clustering).



Our overall aim is to be able to quantify measured/calculated
variability, differentials and similarity:


Are they biologically significant or just side effects of the experimental
platforms and conditions?


Introduction

Sources of Microarray Data Variability


There are several levels of variability in
measured gene expression of a feature.



At the highest level
, there is biological
variability in the population from which
the sample derives.



At an experimental level
, there is


variability between preparations and
labelling of the sample,


variability between hybridisations of the
same sample to different arrays, and


variability between the signal on replicate
features on the same array.



Variability between
Individuals

True gene expression
of individual

Variability between
sample preparations

Variability between
arrays and
hybridisations

Variability between
replicate features

Measured gene
expression

The measured gene expression in any experiment
includes true gene expression,together with
contributions from many sources of variability

Introduction

Sources of Microarray Data Variability


Population Variation


Whose mRNA are we using? May need to test different samples in parallel.


May need many replicates to study biological variation



Sample Treatment


Experimental conditions



Tissue preparation



Target Preparation


RNA isolation: need to use use identical amounts of tissue, identical extraction
methods; use minimum number of steps; measure amount of RNA and
normalize concentration


Labelling: need to account for and measure incorporation of label and
normalize samples to same concentration


Amount: Need to add same amount of label to each hybridization

There are many standard experimental protocols
that biologists need to follow when conducting
their experiments to minimize variability

Introduction

Sources of Microarray Data Variability


Arrays


Same sample may be hybridized to different arrays in different labs


PCR products probes: prepared through amplification directly from cells, must
add same amount of product to each spot on filter


Uniformity of spotting: must use arraying tool for filter arrays or robot for
microarrays.


Treatment and handling of filters or slides



Hybridization and washing


Time: long hybridization ensures that hybridization goes to completion.


Temperature: most hybridisations performed between 45 and 65
o
c



Data acquisition


Image acquisition


Spot and background detection


Oligos reduce variability of probes compared to
PCR products.

In
-
situ synthesis standardises probe production
and produces better spot quality and reduces errors
in image acquisition

Introduction

Biological and Technical Variability


Biological Variability: variation between individuals in the
population and is independent of the microarray process itself


Population variability can be measured with pilot studies



Technical Variability is dependent on the microarray process itself.


Technical variability is measured in calibration experiments.



In good experiments, technical variation should be much less than
biological variation


Introduction

Experimental Design

Experiment

Replicate 1

Replicate 2

Extract 2

Extract 1

Label Cy3

Label Cy5

Label Cy3

Label Cy3

Label Cy3

Biological
Replicates

Technical
Replicates


Tree representation of replicate
experiments:


The first level is at the level of
biological replicates


This is followed by two independent
mRNA extractions, and reciprocal
Cy3 and Cy5 labelling


Finally on each array, each probe is
printed in triplicate.



In this example, each data point in
the experiment is replicated a total of
24 times.



Furthermore, in each microarray
experiment, each gene (each probe
or probe set) is really a separate
experiment in its own right


Introduction

Conducting Good Experiments

Experimental Design

Requirements


Sample Handling



Identical or comparable


RNA Extraction


Matched Methods


RNA Quantity/Quality



Comparable


Labelling



Matched Methods


Array production



Platform Matched, Batch Matched


Hybridisation



Match Method, Comparable time


Intensity
-
dependent bias



Comparable & Normalised


Biological variation



Comparable or Matched

Introduction

Gene Expression Matrices

Images

Spots

Spot/Image
quantiations

Intermediate

data

Samples

Genes

Gene
expression
levels

Final

data

Gene Expression Matrix

Raw

data

Array scans

Normalisation Examples

Probe Intensity Value


The raw intensities of signal from each spot on the array are not directly
comparable. Depending on the types of experiments done, a number of
different approaches to normalization may be needed. Not all types of
normalization are appropriate in all experiments. Some experiments may
use more than one type of normalization.



Reasonable Assumption:

intensities of fluorescent molecules reflect the
abundance of the mRNA molecules


generally true but could be
problematic



Example:


intensity of gene A spot is 100 units in normal
-
tissue array


intensity of gene A spot is 50 units in cancer
-
tissue array


Conclusion
:
gene A’s expression level in normal issue is significantly
higher than in cancer tissue


Typical Problem: Usually
more variability at low
intensity

Normalisation Examples

Probe Intensity Value


Problem?

What if the overall background intensity of the normal
-
tissue array is 95 units while the background intensity of cancer
-
tissue array is 10 units?



Solutions:


Subtract background intensity value


Take ratio of spot intensity to background intensity (preferable)


In both cases have to decide where to measure background intensity
(e.g. local to spot or globally per chip)



In general,
There could be many factors contributing to the background
intensity of a microarray chip


To compare microarray data across different chips, data (intensity levels) need
to be normalized to the “same” level

Images showing examples of
how background intensity can be
calculated

Normalisation Examples

Two Colour Arrays


Reasonable Assumption:

For two colour arrays, in a self self
hybridization, we expect for each spot Red = Green



Problem:

This is not necessarily true due to labelling effects,
chemistry (dye properties), scanner properties, etc


Dye Bias in

Two
-
channel microarrays: Intensity in one channel may
be higher than the other



Solutions:


Dye swapping experiments: in first replicate label control with red and
experiment with green; in second replicate swap colours


Calibration Experiments (Self vs. self Hybridisation):label same
extract with both colours and calculate variation

CONTROL

SAMPLE

Normalisation Examples

Two Colour Arrays


“Error” correction

y = ax

y = x

possible ways of “correction”



1.
dividing all x by a; 2. multiplying all y by a;

Can easily be extended when regression line is y = ax+b

Normalisation Examples

Ratio of Signal to Positive Control


Problem:

Is there any cross hybridisation?


Solution:

It is often useful to spike the labelling reaction with
some foreign RNA or DNA that is not normally in the RNA
population.


The signal


s
i

for gene i would therefore be raw counts
g
i

divided
by the median of the counts for the vector spots.


How does this approach compare to the
affymetrix PM/MM probes?

Normalisation Examples

Ratio of Signal to Positive Control


Normalization of signal for each gene to a ratio makes it possible to
compare ratios between experiments, provided that the spiked controls
are the same in all experiments.



Normalization to a positive control is typically used in single
-
label
experiments. Comparison of one experiment to another can either be
done by plotting signal
s
i


directly on a graph, or signals from two
experiments can be converted into a ratio, usually by choosing one
treatment as a control.



For example, in a time course, a 0 hour time point might be chosen, and
signal from all other time points divided by the signal for the 0 hour time
point, to give a ratio.



Normalisation Examples

Spatial variation within array


Problem:

Signal varies according to spot location


Particularly corners: Less hybridization solution


Also because of print
-
tip group of robot


Solutions:


Calculate ratio to mean or total intensity value


Use Locally Weighted Regression (Lowess)


Use Block
-
Block Lowess

Normalisation Examples

Between Array Normalisation


Assumption: the overall intensities across two arrays should be
similar


Problem:

Not always the case


Solution1:

Ensure that data points in the two
-
intensity coordinate
system should be roughly centered around the diagonal

Solution2:

Use total intensity
normalization for large number

Normalisation Methods

Between Array Normalisation


Mean/Median centering


mean/median intensity of every chip
brought to same level



Total intensity normalization


scaling factor determined by
summing intensities







Spiked
-
control, housekeeping normalization (Positive Controls)


Normalisation Methods

Centring and

Scaling


Data is scaled to ensure that the means and the standard deviations of all
of the distributions are equal. For each measurement on the array,
subtract the mean measurement of the array and divide by the standard
deviation. Following centring, the mean measurements on each array will
be zero, and the standard deviation will be 1



Normalisation Methods

Normalized ratios usually expressed as logs


To facilitate easier mathematical handling of the data, as well as
comparisons over a wide range of expression levels, ratios are usually
expressed as logs.



For example, if a gene is expressed at 4 times the level in the control than
in the mutant, log
2

(1/4) =
-
2. A log ratio of 0 is therefore indicative of a
gene whose expression is the same in both conditions or treatments.



Ratio =
T
g

=


R
g


G
g

Log Ratio =
log
2
(T
g
)

=

R
g

G
g

log
2

Normalisation Methods

Regression Normalisation


Regression normalization
:


Fit the linear regression
model: y = ax + b


Test the significance of the
intercept
b
. Fit a linear
regression without
b

if it is
insignificant.


Transform the data



Problem:

assumption may not
hold due to nonlinear trend




Normalisation Methods

From Scatter Plot to MA Plot


Instead of plotting the two intensity values against one another
(Scatter plot), it is common to use an MA plot


M: log
2
(R/G): ratio of two intensities


A: log
2
SQRT(R*G) = ½ log
2
(R*G): mean log intensity of two values

Normalisation Methods

Lowess Normalization


Locally Weighted Least Square
Regression


Assumption: Variation in data is
intensity

dependent


Smoothes the intensity function


Lowess is typically applied to M
-
A plots


Summary


Normalisation used to identify if variation is due to experimental
conditions.



Typical sources of variation are


Population, Sample, Target, Array (Probe), Hybridisation, Data Acquisition



Different Normalisation Examples


Probe intensity values


Two colour arrays


Positive controls


Spatial normalisation within array


Between array normalisation



Common Normalisation Methods


Mainly scaling factors and regression