# débit faisant intervenir des réplicats

P
ackages R
proposés

pour l’analyse diffé
rentielle
de données de séquençage
haut

débit

f
aisant intervenir des réplicats

Remarque

:

1)

Certains packages peuvent intégrer une normalisation des données qui précède l’analyse
différentielle et qu’il est parfois difficile de s’affranchir.

2)

Détails des principales techniques de normalisation sur le document .ppt

_
____________________________
__
_

package edgeR
__________________________
__
_________

Principe

:

edgeR suppose que le nombre X de lectures (reads) associé à un gène i et à un échantillon (d’ADNc
ou d’ARN…) j
sous

la condition k
suit une loi négative binômiale pouvant s’écrire

:

X
ij
k

~ NB(
μ
ij
k

;
φ
)

où les

X
ij
k

sont supposés i.i.d

t
el que

:

E(X
ij
k
) =
μ
ij
k

Var(X
ij
)=
μ
i
jk
(1+
φ)

où, φ

est un paramètre de surdispersion à estimer

… Si φ = 1, on se ramène donc à une distribution de Poisson

On suppose que μ
ij
k

peut
s’
écrire sous la form
e

:
μ
ij
k
=
λ
ik
.N
j

L’estimation de
φ

se fait en décomposant

:

-

Log
-
vraisemblance de
φ pour
le

gène

i

: l
i
(φ)

-

Log
-
vraisemblance commune de φ

: lc(φ)= ∑
i

l
i
(φ)

Une version pondérée de la vraisemblance de
φ est alors estimée par

:

WL(φ
i
)= l
i
(φ)+
α
.l
c
(φ)

2
options sont implémentées

(et à choisir par l’utilisateur)

pour effectuer cette estimation

:

-

«

tag
-
wise dispersion

»

:

α= 0

-

«

tag
-
wise + common dispersion

»

:

α>
0

Références

:

Robinson MD and Smyth, GK.

Moderated statistical tests for assessi
ng
differences in tag abundance

Bioinformatics

(2007)

23(21) ; 2881
-
2887

Robinson MD and Smyth, GK.

Small
-
sample estimation of negative binomial dispersion, with applications to SAGE data

Biostatistics (2008), 9, 2 ; 321
-
332

Robinson MD, McCarthy DJ, Smyth,

GK.

edgeR : a Bioconductor package for differential expression analysis
of digital gene expression data

Bioinformatics (2009)

Robinson MD and Oshlack
,
A
.

A scaling normalization method for differential expression analysis of RNA
-
seq data

Genome Biology
(2010)

11: R25

Remarque(s)
:

1)

«

Le pendant du test

limma adapté aux données de séquençage (cf moderated t
-
test du
même auteur)

»

2)

«

Normalisation TMM implémenté
e

dans le package

»

______________________________
Package DESeq

_______________________________________

Principe

:
Algorithme d’
Anders & Huber

Soit X
ij
, le nombre de lectures (reads) correspondant au gène i et à l’échantillon j. On suppose

:

X
ij
~ NB(
μ
ij

;

σ
2
ij
)

Idée

:

La variance biologique associée au gène est une fonction de lissage de son niveau d’expression
dans cette condition.

Trois
hypothèses

:

1)

La valeur attendue du nombre de comptages est fonction d’un terme dépendant du gène et
de la condition associés

et peut

s’écrire comme suit

:

E[X
ij
] =
μ
ij

=

q
i,
ρ
(j)
.s
j

Où s
j

: représente la couverture ou profondeur de la librairie j

2)

Variance globale du

ne = shot noise

(var.
technique) + variance brute (réplicats
biologiques)

:

σ
2
ij
=
μ
ij

+ s
2
j
.
ν
i,
ρ
(j)

3)

Le paramètre
associé

ν
i,
ρ
(j)

à la variance
du gène est une fonction lissée de la forme

:

ν
i,
ρ
(j)

=
ν
p
(
q
i,
ρ
(j)
)

Il est estimé en considérant les données des gènes ayant des niveaux d’expression similaires
(shrinkage)

En pratique

avec m

: le

nombre d’échantillons

Expression globale

(baseMean)

:

Fold Change

:

Estimation de la variance

1)

Tous les calculs sont réalisés après exclusion des gènes ayant au moins une valeur nulle
d’expression

2)

Le modèle estime une fonction
ν
p

pour chaque condition
ρ

3)

On vérifie à l’aide de la
fonction
varianceFitDiagnostics

que la variance estimée n’est pas
trop éloignée de la variance empirique des q
ij
(
ρ
).

férences:

Anders, S and Huber, W.

Differential expression analysis for sequence count data

Nature Precedings (2010)

Dans bioconductor

:
Analysing RNA

seq data with the «

DESeq

» package

(
Anders S.
)

___________________________________________________________________________________________

Les autres packages

en bref

GPseq

Using the generalized Poisson distribution to model sequence read
counts from high throughput
sequencing experiments”

Abstract

Deep sequencing of RNAs (RNA
-
seq) has been a useful tool to characterize and quantify transcriptomes.
However, there are significant challenges in the analysis of RNA
-
seq data, such

as how to separate signals
from sequencing bias and how to perform reasonable normalization. Here, we focus on a fundamental
question in RNA
-
seq analysis: the distribution of the position
-
level read counts. Specifically, we propose a
two
-
parameter general
ized Poisson (GP) model to the position
-
level read counts. We show that the GP
model fits the data much better than the traditional Poisson model. Based on the GP model, we can better
estimate gene or exon expression, perform a more reasonable normalizatio
n across different samples, and
improve the identification of differentially expressed genes and the identification of differentially spliced
exons. The usefulness of the GP model is demonstrated by applications to multiple RNA
-
seq data sets.

References

Consul, P. C. (1989) Generalized Poisson Distributions: Properties and Applications. New York:

Marcel Dekker.

Sudeep Srivastava, Liang Chen
.

A two
-
parameter generalized Poisson model to improve the
analysis

of RNA
-
Seq data Nucleic Acids Research Advance A
ccess published July 29,2010 doi :
10.1093/nar/gkq670

Remarque

Propose une normali
s
ation particulière basée sur un comptage de lectures par position

BaySeq

(inclus dans Bi
oConductor)

“Empirical Bayesian analysis of patterns of differential expression

in count data”

Introduction

We assume that we have discrete data from a set of sequencing or other high
-
throughput experiments,
arranged in a matrix such that each column describes a

sample and each row describes some entity for
which counts exist. For
example,

the rows may correspond to the di
ff
erent sequences observed in a
sequencing

experiment. The data then consists of the number of times each sequence is

observed in each
sample. We wish to determine which, if any, rows of the data

correspond to some

patterns of di
ff
erential
expression across the samples. This

problem has been addressed for pairwise di
ff
erential expression by the
edgeR package.

However, baySeq takes an alternative approach to analysis that allows more

complicated patterns of
di
ff
er
ential expression than simple pairwise comparison,

and thus is able to cope with more complex
experimental designs. We also

observe that the methods implemented in baySeq perform at least as well,
and

in some circumstances considerably better than those im
plemented in edgeR .

baySeq uses empirical Bayesian methods to estimate the posterior likelihoods

of each of a set of models that
de
fi
ne patterns of di
ff
erential expression for each

row. This approach begins by considering a distribution for
the row de
f
i
ned

by a set of underlying parameters for which some prior distribution exists. By

estimating this prior distribution
from the data, we are able to assess, for a given

model about the relatedness of our underlying parameters
for multiple libraries,

the po
sterior likelihood of the model.

In forming a set of models upon the data, we consider which patterns are

biologically likely to occur in the
data. For example, suppose we have count

data from some organism in condition A and condition B.
Suppose further t
hat

we have two biological replicates for each condition, and hence four libraries

A1;A2;B1;B2, where A1, A2 and B1, B2 are the replicates. It is reasonable to

suppose that at least some of
the rows may be una
ff
ected by our experimental

conditions A and B,

and the count data for each sample in
these rows will

be equivalent. These data need not in general be identical across each sample1

due to
random e
ff
ects and di
ff
erent library sizes, but they will share the same

underlying parameters. However,
some of
the rows may be in
fl
uenced by the

diff
erent experimental conditions A and B. The count data for the
samples

A1 and A2 will then be equivalent, as will the count data for the samples B1

and B2. However, the count data
between samples A1;

A2;

B1;

B2 will not

be

equivalent. For such a row, the data from samples A1 and A2
will then share

the same set of underlying parameters, the data from samples B1 and B2 will

share the same
set of underlying parameters, but, crucially, the two sets will not

be identical.

Our

task is thus to determine the posterior likelihood of each model for each

row of the data. We can do this
by considering either a Poisson or negative
-
binomial distribution upon the
sequencing count data. The
Poiss
on method is

considerably faster as a clos
ed form conjugate prior exists for this distribution.

The negative
-
binomial solution is slower as it requires a numerical solution for

the p
rior, but is probably a
better fi
t for most data. In experimental data,

we have found that the Poisson method is lik
ely to give poor
results if true

biological replicates are not available; in most human studies, for example. In

general,
therefore, the use of the negative
-
binomial methods is recommended.

Reference

Thomas J. Hardcastle and Krystyna A. Kelly. baySeq:
Empirical Bayesian Methods For Identifying
Differential Expression In Sequence Count Data.BMC Bioinformatics (2010)

Remarque

Pas de normali
s
ation intégrée

NBPseq

(en cours d’ouverture)

Details

Overview:
For assessing evidence for differential gene
expression from RNA
-
Seq read counts,it is critical to
adequately model the count variability between independent biological replicates.

Negative binomial (NB) distribution offers a more realistic model for RNA
-
Seq count variability

than Poisson
distributio
n and still permits an exact (non
-
asymptotic) test for comparing two

groups.

For each individual gene, a NB distribution uses a dispersion parameter
_
i
to model the extra
-
Poisson
variation between biological replicates. Across all genes, the NBP parameteri
zation of

the NB distribution (the
NBP model) uses two parameters
(
_; _
)
to model extra
-
Poisson variation

over the entire range of expression
levels. The NBP model allows the NB dispersion parameter to

be an arbitrary power function of the mean.
The NBP
model includes the Poisson

model as a limiting case (as
_
tends to
0
) and the NB2 model as a
special case (when
_
= 2
).

Under the NB2 model, the dispersion parameter is a constant and does not vary with the mean

expression
levels. NBP model is more flexibl
e and is the recommended default option.

Count Normalization:
We take gene expression to be indicated by relative frequency of RNASe
q
mapped to a gene, relative to library sizes (column sums of the count matrix). Since the

relative frequencies
sum to

1 in each library (one column of the count matrix), the increased relative

frequencies of truly over
expressed genes in each column must be accompanied by decreased

relative frequencies of other genes,
even when those others do not truly differently expre
ss. Robinson

and Oshlack (2010) presented examples
where this problem is noticeable.

A simple fix is to compute the relative frequencies relative to effective library sizes

library sizes

multiplied by
normalization factors. By default,
nbp.test
assumes the

normalization factors

are 1 (i.e. no normalization is
needed). Users can specify normalization factors through the

argument
norm.factors
. Many authors
(Robinson and Oshlack (2010), Anders and Huber

(2010)) propose to estimate the normalization factors
bas
ed on the assumption that most genes

are NOT differentially expressed.

The exact test requires that the effective library sizes (column sumsof the count
matrix multiplied by normalization factors) are approximately equal. By defaul
t,

nbp.test
will thin
(downsample) the counts to make the effective library sizes equal.
Thinning

may lose statistical efficiency,
but is unlikely to introduce bias.

Reference :

Di, Y, D. W. Schafer, J. S. Cumbie, and J. H. Chang: "The NBP Negative
Binomial Model for

Assessing Differential Gene Expression from RNA
-
Seq", SAGMB, accepted.

Samseqr

(en cours

d’ouverture
: le pendant de SAM microarrays par les m
êmes auteurs …
)

Description

This package implements a method for normalization,

testing, and false discovery rate estimation for RNA
-
sequencing
data. W
e estimate the sequencing depths of the experiments using a new method based on
Poisson

goodness
-
of
-
ﬁt statistic, calculate a score statistic on the basis of a Poisson log
-
linear model,

and
then estimate the false discovery rate using a modiﬁed version of permutation plug
-
in method.

Reference :

Li J, Witten DM, Johnstone I, Tibshirani R (2011). Normalization, testing, and false discovery rate

estimation for RNA
-
sequencing data.
Submitted.