Analysis of SAGE Data: An Introduction - MD Anderson Bioinformatics

dasypygalstockingsBiotechnology

Oct 2, 2013 (3 years and 11 months ago)

83 views

Analysis of SAGE Data:

An Introduction

Kevin R. Coombes

Section of Bioinformatics

Outline


Description of SAGE method


Preliminary bioinformatics issues


Description of analysis methods
introduced in early paper


Review of literature: statistics and SAGE

What is SAGE?


Serial Analysis of Gene Expression


Method to quantify gene expression
levels in samples of cells


Open system


Can potentially reveal expression levels of
all genes: “unbiased” and “comprehensive”


Microarrays are closed, since they only tell
you about the genes spotted on the array

Ref: Velculescu et al., Science 1995; 270:484
-
487

How does SAGE work?

1. Isolate mRNA.

2.(b) Synthesize ds cDNA.

2.(a) Add
biotin
-
labeled dT primer:

4.(a) Divide into two pools and add linker sequences:

4.(b) Ligate.

3.(c) Discard loose fragments.

3.(a) Bind to
streptavidin
-
coated beads.

3.(b) Cleave with “
anchoring enzyme
”.

5. Cleave with “
tagging enzyme
”.

6. Combine pools and ligate.

7. Amplify ditags, then cleave with
anchoring enzyme
.

8. Ligate ditags.

9. Sequence and record the tags and frequencies.

From ditags to counts


Locate the punctuation “CATG”


Extract ditags of length 20
-
26 between the
punctuation


Discard duplicate ditags (including in reverse
direction)
--

probably PCR artifacts


Take extreme 10 bases as the two tags,
reversing right
-
hand tag


Discard linker sequences


Count occurrences of each tag

SAGE software available at http://www.sagenet.org

What does the data look like?

From tags to genes


Collect sequence records from GenBank that
are represented in UniGene


Assign sequence orientation (by finding poly
-
A tail or poly
-
A signal or from annotations)


Extract 10
-
bases 3’
-
adjacent to 3’
-
most CATG


Assign UniGene identifier to each sequence
with a SAGE tag


Record (for each tag
-
gene pair)


#sequences with this tag


#sequences in gene cluster with this tag

Maps available at http://www.ncbi.nlm.nih.gov/SAGE

From tags to genes


Ideal situation:


one gene = one tag


True situation


one gene = many tags (alternative
splicing; alternative polyadenylation)


one tag = many genes (conserved 3’
regions)

Sequencing Errors


Estimated sequencing error rate:


0.7% per base (range 0.2%
-

1%)


Affect


ditags in a SAGE experiment


can improve by using phred scores and
discarding ambiguous sequences


tag
-
gene mappings from GenBank


RNA better than EST

Reliable tag
-
gene assignments

SAGE and cancer


Ten SAGE libraries, two each from


normal colon


colon tumors


colon cancer cell lines


pancreatic tumors


pancreatic cell lines


Pooled each pair

Ref: Zhang et al., Science 1997; 276:1268
-
1272

Variability in SAGE libraries

Distribution of tags


303,706 total tags


48,471 distinct tags


Distribution


85.9% seen up to 5 times (25% of mass)


12.7% between 5 and 50 times (30%)


0.1% between 50 and 500 times (26%)


0.1% more than 500 times (19%)

Ref: Zhang et al., Science 1997; 276:1268
-
1272

How many tags were missed?


They simulated to find 92% chance of
detecting tags at 3 copies/cell


Using binomial approximation


Get 95% chance for 3 copies/cell


Only get 63% chance for 1 copy/cell


Most of what they saw occurred at 1
-
5
copies per cell

Differential Expression


Found 289 tags differentially expressed
between normal colon and colon cancer
(181 decreased; 108 increased)


Method: Monte Carlo simulation.


100000 sims per transcript for relative
likelihood of seeing observed difference


Used observed distribution of transcripts to
simulate 40 experiments.

Ref: Zhang et al., Science 1997; 276:1268
-
1272

Sensitivity


Claim: 95% chance of detecting 6
-
fold
difference


Method: Monte Carlo


200 simulations, assuming abundance of
0.0001 in first sample and 0.0006 in
second sample

Ref: Zhang et al., Science 1997; 276:1268
-
1272

Weaknesses in Analysis


Failed to account for intrinsic variability
in samples (which changes depending
on abundance) in assessing significance


Monte Carlo used observed distribution,
which is definitely not true distribution.


Sensitivity only measured at one
abundance level.

Alternative Analytic Methods


Audic and Claverie, Genome Res 1997;
7:986
-
995


Chen et al., J Exp Med 1998; 9:1657
-
1668


Kal et al., Mol Biol of Cell 1999; 10:1859
-
1872


Michiels et al., Physiol Genomics 1999; 1:83
-
91


Stollberg et al., Genome Res 2000; 10:1241
-
1248


Man et al., Bioinformatics 2000; 16:953
-
959

Audic and Claverie


Main goal: confidence limits for
differential expression


Use Poisson approximation for number
of times
x

you see the same tag.


Put a uniform prior on the Poisson
parameter; get posterior probability of
see tag
y

times in new experiment

p(
y

|
x
) = (
x

+
y
)! / [
x
!
y
! 2^(
x

+
y

+1)]


Generalizes to unequal sample sizes

Chen et al.


Assume


equal sample sizes


tag has concentration X, Y in two samples


Look at W = X/(X+Y)


Use a symmetric Beta prior distribution with a
peak near 0.5 (since most genes don’t
change)


Use Bayes theorem to compute posterior
probability of threefold difference in
expression

Unequal sample sizes


This analysis generalizes easily to the
case of unequal size SAGE libraries


Lal et al., Cancer Res 1999; 59:5403
-
5407


This method is used at the NCBI
SAGEmap web site for online differential
expression queries


http://www.ncbi.nlm.nih.gov/SAGE

Kal et al.


Assume the proportion of times you see
a tag has binomial distribution


Replace with a normal approximation to
compute confidence limits


Used at
http://www.cmbi.kun.nl/usage


Equivalent to chi
-
square test on 2x2
table:

Michiels et al.


First perform overall chi
-
square test to
decide if the two SAGE libraries being
compared are different.


Get significance by Monte Carlo
simulation


Perform gene
-
by
-
gene chi
-
square tests
and use them to rank genes in order of
“most likely to be different”

Stollberg et al.


Assume binomial distributions


Model the binomial parameters as a
sum of two exponentials


fit to the Zhang step function data


Simulate from this model, adding


sequencing errors


nonuniqueness of tags


nonrandomness of DNA sequences

Stollberg et al.


Key finding:


Naively using observed data to fit model
parameters cannot recover the observed
data by simulation


Maximum likelihood estimate of
parameters that recover the observed data
give very different looking parameters

Stollberg et al.

Man et al.


Compares specificity and sensitivity of
different tests for differential expression


Audic and Claverie


Kal


Fisher’s exact test


Monte Carlo simulation of experiments


Findings


Similar power at high abundance


Kal has highest power at low abundance


Questions


Sample size computations:


How many tags should we sequence if we
want to see tags of a given frequency?


How many tags should we sequence if we
want to see a given percentage of tags?


How many tags are expressed in a
sample?


Best method for identifying differential
expression?

Additional SAGE references


Review


Madden et al., Drug Disc Today 2000; 5:415
-
425


Online Tools


Lash et al., Genome Res 2000; 10:1051
-
1060


van Kampen et al., Bioinformatics 2000; 16:899
-
905


Comparison of SAGE and Affymetrix


Ishii et al., Genomics 2000; 68:136
-
143


Combine SAGE and custom microarrays


Nacht et al., Cancer Res 1999; 59:5464
-
5470


Mapping SAGE data onto genome


Caron et al., Science 2001; 291:1289
-
1292


Data mining the public SAGE libraries


Argani et al., Cancer Res 2001; 61:4320
-
4324