A Look into Bioinformatics

hordeprobableBiotechnology

Oct 4, 2013 (3 years and 8 months ago)

69 views

RNA
-
seq: Quantifying the
Transcriptome

Alisha Holloway, PhD

Gladstone Bioinformatics Core Director

What is RNA
-
seq?

Use of high
-
throughput sequencing technologies
to assess the RNA content of a sample.

Why do an RNA
-
seq experiment?


Detect
differential expression


Assess allele
-
specific expression


Quantify alternative transcript
usage


Discover novel genes/transcripts,
gene fusions


Profile transcriptome


Ribosome profiling to measure
translation



Why do an RNA
-
seq experiment?


Detect
differential expression


Assess allele
-
specific expression


Quantify alternative transcript
usage


Discover novel genes/transcripts,
gene fusions


Profile
transcriptome


Ribosome profiling to measure
translation



Skelly et al. 2011

Why do an RNA
-
seq experiment?


Detect
differential expression


Assess allele
-
specific expression


Quantify alternative transcript
usage


Discover novel genes/transcripts,
gene fusions


Profile
transcriptome


Ribosome profiling to measure
translation




Why do an RNA
-
seq experiment?


Detect
differential expression


Assess allele
-
specific expression


Quantify alternative transcript
usage


Discover novel genes/transcripts,
gene fusions


Profile
transcriptome


Ribosome profiling to measure
translation




Why do an RNA
-
seq experiment?


Detect
differential expression


Assess allele
-
specific expression


Quantify alternative transcript
usage


Discover novel genes/transcripts,
gene fusions


Profile
transcriptome


Ribosome profiling to measure
translation


Pluripotent
Stem Cell

Cardiomyocytes

Cardiogenic

Mesoderm

Cardiac

Precursors

Why do an RNA
-
seq experiment?


Detect
differential expression


Assess allele
-
specific expression


Quantify alternative transcript
usage


Discover novel genes/transcripts,
gene fusions


Profile
transcriptome


Ribosome profiling to measure
translation


More tomorrow!

Ingolia

et al. 2009,
Weissman

Lab

RNA
-
seq

Microarray

ID novel genes,

transcripts, &
exons

Well vetted QC

and analysis
methods

Greater dynamic range

Well characterized biases

Less bias

due to genetic
variation

Quick turnaround from
established core facilities

Repeatable

Currently less expensive

No species
-
specific

primer/probe design

More accurate relative to
qPCR

Many

more applications

RNA
-
seq vs.
Affy

RNA
-
seq vs.
Taqman

Marioni

et al. 2008

© 2010
NuGen

Illumina

Pac
-
Bio

Read length

100
bp

paired end

2500
bp

avg

Throughput

200 million read

pairs
/lane

1 million

reads/
SMRT cell

Error

rate

<1%

15% total, most are
indels
, 4% SNP

Cost

$600/sample

$7
-
8k/sample

Accessibility

USCF,

UC
-
Davis, BGI

No commercially
available protocols

Uses

DE, ASE, quant alt.

transc
. usage

Characterize
transcriptome

When to use Pac
-
Bio

Plan it well.


Experimental design


Biological replicates


Reference genome?


Good gene annotation?


Read depth


Barcoding


Read length


Paired vs. single
-
end

Technical

variation

Biological

variation

Plan it well.


Experimental design


Biological replicates


Reference genome?


Good gene annotation?


Read depth


Barcoding


Read length


Paired vs. single
-
end

Plan it well.


Experimental design


Biological replicates


Reference genome?


Good gene annotation?


Read depth


Barcoding


Read length


Paired vs. single
-
end

How much data do we need?


~15
-
20K genes expressed in a tissue | cell line.


Genes are on average 3KB


For 1x coverage using 100
bp

reads, would
need 600K sequence reads


In reality, we need MUCH higher coverage to
accurately estimate gene expression levels.


50 million reads

Plan it well.


Experimental design


Biological replicates


Reference genome?


Good gene annotation?


Read depth


Barcoding


Read length


Paired vs. single
-
end

200 million reads / lane

Run 4 samples / lane

Plan it well.


Experimental design


Biological replicates


Reference genome?


Good gene annotation?


Read depth


Barcoding


Read length


Paired vs. single
-
end

Uniq

seq = 4
read length

Read
length

Unique
seq

25

1.1x10
15

50

1.3x10
30

100

1.6x10
60

~60 million coding bases
in vertebrate genome

Plan it well.


Experimental design


Biological replicates


Reference genome?


Good gene annotation?


Read depth


Barcoding


Read length


Paired vs. single
-
end

Paired
-
end!



Effectively doubles read length


huge impact on read mapping


Increases number of splice junction
spanning reads


Critical for estimating transcript
-
level abundance

The wet lab side…briefly

How do you make sense of this pile of
data?


QC


Alignment


Expt
: Compare two groups


Transcript Assignment & Abundance


Differential Expression


Expt
: Allele
-
specific expression

QC


FastQC

-

http://www.bioinformatics.babraham.ac.uk/projects/fastqc
/


Proportion of reads that mapped uniquely


Remove duplicates; likely due to PCR amp.


Assess ribosomal RNA content


Assess content of possible contaminants


human RNA (if not human samples),
Mycoplasma (if cell lines)



Then what?


Align reads to the genome


Easy(
ish
) for genomic sequence


Difficult for transcripts with splice junctions


Alignment Algorithms


Burrows
-
Wheeler Transform


Bowtie (
Langmead

et al 2009)


BWA (Li and Durbin 2009)


SOAP2 (Li et al. 2009)



Smith
-
Waterman


BFAST (Homer at al. 2009, based on BLAT)


multiple indexes, finds
candidate alignment locations using seed and extend, followed by a
gapped Smith
-
Waterman local alignment for each candidate


http://en.wikipedia.org/wiki/List_of_sequence_alignment_software

Alignment tools for splice junction
mapping


Tophat



MapSplice


SpliceMap


HMMsplicer

Tophat


Map reads to
transcriptome

using Bowtie


Map to genome to discover novel exons


or start here if no annotation available


Split reads to smaller segments; map to
genome to discover novel splice junctions


Report best alignment for each read

Trapnell

et al. Bioinformatics 2009;
Trapnell

et al. Nature Protocols 2012

MapSplice

&
SpliceMap

Wang et al. NAR 2010, Au et al. NAR 2010


Tag alignment (user chooses aligner)


Break reads into segments


Map reads


Unmapped segments considered for splice
junction mapping based on location of partner
segment


Merge segments from read for final alignment


Assess splice junction quality

HMMsplicer


Remove reads that map contiguously


Hidden
markov

model to detect exon
boundary of remaining reads


Compute intensive


Reference annotation not used


Best for compact genomes


User sets threshold for accepting splice
junction.

Dimon

et al.
PLoS

One 2010

HMMsplicer

Martin & Wang, Nature Reviews Genetics 2011

Transcript Assignment/Abundance

Transcript Assignment &|Abundance
Tools


For DE:


Cufflinks


MISO


Scripture


not maintained


De novo
assembly


Cufflinks


Trans
-
ABySS


Trinity


Maker

Cufflinks


Constructs the parsimonious set of transcripts
that explain the reads observed. Basically,
finds a minimum path cover on the DAG.


Derives a
likelihood for the abundances of a
set of transcripts given a set of
fragments.


FPKM


fragments per kb of exon per million
fragments mapped.

Trapnell
,
Pachter

MISO


M
ixture of
Iso
forms


Bayesian


treats expression level of set of
isoforms as random variable and estimates a
distribution over the values of this variable.


Gives confidence intervals for expression
estimates and measures of DE as Bayes factors

Burge Lab @ MIT

Bias Correction and Normalization


Random
hexamer

bias
(Hansen et al. 2010)


From PCR or RT primers


Reestimate

FPKM or read
counts based on bias


Upper quartile normalization
(Bullard et al. 2010)


excellent resource for
comparison to
qPCR

and
microarray as well as
methods of normalization of
RNA
-
seq data

Differential Expression


Goal: determine whether observed difference
in read counts is greater than would be
expected due to random variation.


If reads independently sampled from
population, reads would follow multinomial
distribution
appx

by Poisson

Differential Expression


BUT! We know that the count data show more
variance than expected


Overdipersion

problem mitigated by using the
negative binomial distribution
, which is
determined by mean and variance


Sample j, gene
i

Differential Expression


Binomial test


Old
Cuffdiff


Negative binomial


DESeq



estimate variance using all genes with
similar expression levels


Cuffdiff



sim

to
DESeq
, but
incorp

fragment
assignment uncertainty simultaneously


EdgeR

-

moderate variance over all genes


T
-
test

Differential Expression

Old
cuffdiff

Some biology, finally?


H
ow have gene expression patterns have
changed during the course of differentiation?


Which genes are specific to certain cell types?


What can we learn about what those co
-
expressed genes do?


Clusters of co
-
expressed genes


Use unsupervised
clustering to group genes
by expression pattern



Use gene ontology
information to determine
which kinds of genes are in
each group



Reveal novel associations
and gene types

Clusters of co
-
expressed genes

Pluripotency
/stem cell:
Nanog
, Oct4

Mesoderm/cell fate commitment: Mesp1,
Eomes

Cardiac precursors: Isl1, Mef2c, Wnt2

Cardiac structure/function: Actc1, Ryr2, Tnni3

Thanks for listening!

Alisha Holloway

Gladstone Institutes

Bioinformatics Core


alisha.holloway@gladstone.ucsf.edu