ppt - University of Connecticut

disturbedtonganeseΒιοτεχνολογία

2 Οκτ 2013 (πριν από 4 χρόνια και 1 μήνα)

69 εμφανίσεις

1

5

6

4

2


RNA
-
Seq

based discovery and reconstruction of
unannotated

transcripts in partially annotated genomes

3

Serghei

Mangul
*, Adrian
Caciula
*, Ion
Mandoiu
** and Alexander
Zelikovsky
*

*Georgia State University, **University of Connecticut

Expectation Maximization (EM)

Maximum Likelihood (ML) Model

Introduction


Alternative Splicing

Simulation

Setup
:

human

genome

data

(UCSC

hg
18
)


UCSC

database

-

66
,

803

isoforms

19
,

372

genes,

Single

error
-
free

reads
:

60
M

of

length

100
bp

for

partially

annotated

genome

-
>

remove

from

every

gene

exactly

one

isoform



Fig
.

9
(a)

shows

that

in

genes

with

more

transcripts

is

more

difficult

to

correctly

reconstruct

all

transcripts
.

As

a

result

Cufflinks

performs

better

on

genes

with

few

transcripts

since

annotations

are

not

used

in

it

standard

settings
.


DRUT

has

higher

sensitivity

on

genes

with

2

and

3

transcripts,

but

RABT

is

better

on

genes

with

4

transcripts
.


For

genes

with

more

than

4

transcripts

performance

of

annotation
-
guided

methods

is

equal

to

”existing

annotations

ratio”,

which

mean

what

these

methods

are

unable

to

reconstruct

unannotated

transcript
..



References

Genes,

Exons
,

Introns
,

and

Splicing


Fig.
4
.
Transcripts


Exons


Reads Relation.

1
.

S
.

Mangul
,

I
.

Astrovskaya
,

M
.

Nicolae
,

B
.

Tork
,

I
.

Mandoiu
,

and

A
.

Zelikovsky
,

“Maximum

likelihood

estimation

of

incomplete

genomic

spectrum

from

hts

data,”

in

Proc
.

11
th

Workshop

on

Algorithms

in

Bioinformatics,

2011
.

2
.

C
.

Trapnell
,

B
.

Williams,

G
.

Pertea
,

A
.

Mortazavi
,

G
.

Kwan,

M
.

van

Baren
,

S
.

Salzberg
,

B
.

Wold
,

and

L
.

Pachter
,

“Transcript

assembly

and

quantification

by

RNA
-
Seq

reveals

unannotated

transcripts

and

isoform

switching

during

cell

differentiation
.


Nature

biotechnology,

vol
.

28
,

no
.

5
,

pp
.

511

515
,

2010
.


Fig
.

9
.


a)

Sensitivity

and

PPV

of

the

methods

grouped

by

the

number

of

transcripts

per

gene
.

Here,

60
M

single

reads

of

length

100
bp

are

simulated

*

Cufflinks

is

a

well

known

tool

for

transcriptome

reconstruction

[
2
]
.

D
iscovery and
R
econstruction of
U
nannotated

T
ranscripts

Virtual Transcript Expectation Maximization (VTEM)

Experimental Results

Fig 3.
Panel: Bipartite Graph
-

consisting
of transcripts with unknown frequencies
and reads with observed frequency (
o
j
)


Gene

-

a

segment

of

DNA

or
RNA

that carries genetic information.


Exon

-

a region of a gene
which is
translated into protein


I
ntron

-

a region of a gene which is
not

translated into protein


Splicing



a
process

in which the
introns

are removed and
exons

are joined

to be translated into a single protein


the
process
in which
exons

can
be spliced out in different
combinations named
transcripts

to
generate
the
mature RNA
.



Alternative splicing is a common
mode of
gene regulation

within
cells, being used by 90

95% of
human genes.


It can
drastically alter the
function of a gene

in different
tissue types or environmental


conditions, or even inactivate the gene completely.


Alternative splicing is implicated in many diseases.

Input data of EM is a
panel
: a bipartite graph


a set of candidate transcripts that are believed to emit the set of reads


weighted match based on mapping of the read
i

to the transcripts j (
h
Tj
,
i
)


FIND:

ML estimate of transcript frequencies


SUBPROBLEMS:


Decide if the panel is likely to be incomplete


Estimate total frequency of missing transcripts


Identify read spectrum emitted by missing transcripts


Assemble missing transcripts from read spectrum emitted by
missing transcripts


ML Estimates of Transcripts Frequencies


Probability that a read is sampled from transcript
j

is proportional with
f(j)



f(j)

transcript (unknown) frequency


ML estimates for
f(j)

is given by
n(j)/(n(1) + . . . + n(N))


n(j)

denotes the number of reads sampled from transcript
j


INITIALIZATION:
Uniform transcript frequencies
f(j)

‘s


E STEP:


Compute the expected number
n(j)

of reads sampled from transcript
j

(assuming current transcript
frequencys

f(j)
)


M STEP:


For each transcript
j
, set of
f(j)

=
portion of reads emitted by
transcript
j

among all reads in the sample

Quality of ML Model


The possible gaps in the ML model include:


erroneous reads caused by genotyping errors


missing and/or chimerical candidate transcripts


an inaccurate read to transcript match (caused by genotyping errors)


non
-
uniform emitting of reads by transcripts



Measure the quality of ML model by
deviation
D

of observed reads from
expected reads

(
e
j
)



Expected read frequencies
(
e
j
)

are calculated based on


weighted match between reads and strings

maximum likelihood frequencies estimations of transcripts ( )

Fig. 2.
Alternative Splicing Process

Fig. 1.
Chromosome with its DNA

|R| is the number of reads


Fig
4

shows the relation between transcripts,
exons

and
reads

-

LEFT:
transcripts
-
> unknown frequencies

-

RIGHT:
reads
-
> Observed frequencies

-

EDGES:
weights ~ probability of the read to be emitted

by the transcript

ML Problem:



GIVEN:

Annotations (transcripts) and



frequencies of the reads.


a) Map reads to annotated

transcripts (using Bowtie)





b)
VTEM:

Identify

overexpressed


exons

(possibly from
unannotated

transcripts)


c) Assemble Transcripts (e.g., Cufflinks)

using reads from

overexpressed


exons

and unmapped reads




d) Output: annotated transcripts + novel

transcripts


DRUT (D
iscovery and
R
econstruction of
U
nannotated

T
ranscripts):


GIVEN:

A set of transcripts and frequencies for the reads.


FIND :
Transcripts missing from the set.

Fig 8.
An example of VTEM estimation

Fig. 7.
VTEM