QC and pre-processing - BiGCaT - Data Server

signtruculentBiotechnology

Oct 2, 2013 (3 years and 6 months ago)

139 views

QC and pre
-
processing

of microarray data

Lars
Eijssen

-

BiGCaT

Bioinformatics

2

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

Contents


Background on quality control (QC) and (further)
data pre
-
processing



Application of an automated workflow for Affymetrix
data


Settings


Illustration on data sets


Interpretation of outcome



Introduction to the afternoon session and the data
set to be used


3

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

BACKGROUND

4

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

Proper quality control (QC)


Ensures validity of study results



Is pivotal in

omics research


Hard to judge quality by eye



Several tables and images assist in judging quality



Here we focus on QC of
gene expression arrays

5

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

Data analysis overview

Untreated (control)

Exposed to compound


Raw data

Normalised data

List of regulated genes


Results

Microarray scans

Image analysis

Quality control

Further pre
-
processing

Statistical analysis

Pattern analysis

Pathway analysis

Literature data

Slide

based

on

a
slide

from

J.
Pennings
, RIVM, NL


Background


correction


Normalisation

6

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

QC and pre
-
processing


Ensure signal comparability
within

each array


Stains on the array


Gradient over the array



Ensure comparable signals
between

all arrays


Degraded / low quality sample


Failed hybridisation


Too low or high overall intensity



Some effects can be
corrected

for, others require
removal

of data from the set


7

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

QC for one and two channel microarrays


The principles are similar for both types of arays



But the details are different



In two channel arrays QC is a bit more complex


Each spot consists of two measurements, not one


Dye
-
effect



I will further discuss QC later in this talk, focusing
on one channel arrays (Affymetrix chips)



8

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

Dye bias

Foreground intensity

Background intensity

9

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

Red and green foreground intensity




For two channel
arrays, it is
relevant to
check whether
effects cancel
out between
channels

10

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

Pre
-
processing: background correction


Background signal needs to be corrected for


For example signal of remaining non
-
hybridised mRNA



Three types of background


Overall slide background


Local slide background


Specific background


For example cross
-
hybridization, can be corrected for by
mismatch probes (in case of Affymetrix chips)


Also used to make present/marginal/absent calls

11

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

Pre
-
processing: normalisation


After discarding bad arrays and spots, remaining within
-

and
between
-
array differences not related to the biology, need to
be corrected for



The procedure is cyclic


Several QC plots are made before and after normalisation


Whether normalisation can correct an artifact may influence
decision to discard or not


After data selection, the complete QC should be run again


Some abberations may have been masked by larger ones


12

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

Log transformation


Generally
, the
intensities

are
first

2
log
-
transformed



The distribution of the
logged

intensities

is more ‘
normal

than

on

the
original

scale



Log
transformed

data are
easier

to
handle

statistically




This

will

be

explained

more in the
lecture

on

statistics




13

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

Spotted and Affymetrix arrays

Spotted arrays


Either one or two channel



Spot
-
level QC often
included



Also often parts of arrays
are flagged



Each gene is measured by
only one or two probes on
the array

Affymetrix chips


Always one channel


no dye effect


No spot
-
level QC is taken
into account



No flagging of local
abberations



Each gene is measured by
a probeset of probes
spread randomly over the
array




Main focus in

remainder of talk

14

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

Pre
-
processing for Affymetrix chips


A specific extra step is
summarisation

of probe values into one
value for each probeset



Well
-
known methods for pre
-
processing Affymetrix chips



MAS5.0 (uses mismatch intensities)



RMA (Robust Multiarray Average, does not use mismatches)


Includes both background correction and (quantile) normalisation



GC
-
RMA (like RMA, but also takes into account GC content)



dChip (model
-
based)



For exonST en geneST arrays, only RMA can be used (another
option is PLIER, error
-
model)


15

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

Custom CDF files


Affymetrix provides annotations for their probesets (CDF file)



When these get outdated, one can of course update probeset
annotations



But it may be even better to:


disassemble

these sets into the separate probes


reannotate

probes


reassemble these

into new different probesets



This is exactly what custom CDF files do



Note that reassembled probesets do not necessarily contain the same
number of probes anymore

16

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

BrainArray CDF files
1


Reannotation based on one of several genome databases



IDs are created as follows: ID from the gene the probeset refers to
followed by ‘_at’ to resemble an Affymetrix ID



For example:
ENSG00000139618
_at



When using these annotations in other tools, you have to
remove the
‘_at’ additions
, in order to get recognisable Ids



Note that when using Entrez gene this means that the ID is composed of a
number (Entrez gene ID) followed by ‘_at’, and as such looks exactly like a
normal Affymetrix ID, but
IT IS NOT


1
http://arrayanalysis.mbni.med.umich.edu/arrayanalysis.html

17

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

Low intensity filtering



Before

filtering









After

filtering



Low
intensity

spots are
more subject to
noise


Filtering
can

be

done

at a
later stage

average intensity

difference between groups

18

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

AN
AUTOMATED

WORKFLOW

19

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

ArrayAnalysis.org

web server

local machine

calculation server

20

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

http://www.arrayanalysis.org

21

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

22

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

23

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

24

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

25

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

26

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

27

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

28

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

29

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

Table and images of QC statistics

Affymetrix criteria:


Sample prep controls Lys < Phe < Thr < Dap


Lys present


Bèta Actin 3’/5’ ≤ 3


GAPDH 3’/5’ ≤ 1.25


Hybridisation controls BioB < BioC < BioD < Crex


BioB present


Percentage present within 10%


Background within 20 units


Scaling factors within 3
-
fold from the average




In the table, red and blue indicate whether criteria are fulfilled

The images are taken from
other

data sets than the one you will be using

Outcome

of the
workflow

30

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

RNA Degradation

Density plot

plot

31

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

Boxplots

32

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

Virtual (spatial) images

MA plots

33

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

NUSE and RLE plot

34

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

Array correlation plot

35

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

Clustering and PCA plots

36

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

Perspectives


Future relevance of Affymetrix chips?



Data repositories / comparative research





It is also available for local install in R



We will soon include model for statistical analysis
(and processing of other data types)




37

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

Quality Control (QC) of Microarrays


Nature, 2005

38

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

Project members

Lars Eijssen

Magali Jaillard

Michiel Adriaens

Philip de Groot

Chris Evelo

Thanks to:

39

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

THE
AFTERNOON

SESSION

AND THE DATA SET

40

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

The afternoon session


In the afternoon session, you will be performing QC
and pre
-
processing yourself



You will follow a stepwise guide available online at

the course wiki



You will use an
Affymetrix

data set and make use of
arrayanalysis.org

41

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

Short description of the data set (1)


Microarray experiments have to be uploaded to online
repositories such as Gene Expression Omnibus (GEO, NCBI) or
ArrayExpress (AE, EBI) upon publication



We will use a

published
1

dataset

available from AE


1
Toxicogenomics

of
subchronic

hexachlorobenzene

exposure in Brown Norway rats.

Ezendam

J,
Staedtler

F,
Pennings

J,
et al
.
Environ Health
Perspect

112(7):782
-
91

42

Bioinformatics to understand studies in genomics


São Paulo


June 9
-
11 2011

Short description of the data set (2)


Hexachlorobenzene (HCB) is a persistent pollutant, that is toxic for
liver, neurons and the reproductive and immune systems



In this study, Brown Norway rats were fed a diet supplemented with
HCB doses of
0, 150, or 450 mg/kg



Spleen, mesenteric lymph nodes (MLN), thymus, blood, liver, and
kidney were analyzed using the Affymetrix rat RGU
-
34A GeneChip
microarray


13
-
17 arrays per tissue, max 6 per concentration



We will be primarily considering the liver data (17 arrays)