Bioinformatic discovery of microRNA precursors from human ESTs ...

signtruculentΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

88 εμφανίσεις

SUNG
-
CHOU LI


CHAO
-
YU PAN


WEN

CHANG

LIN

2006

Bioinformatic discovery of
microRNA precursors from human
ESTs and introns

Outline


Introduction


Scanning method


Result


Specificity assessment


Discussion


Conclusion


Reference


2

Introduction


Intron


DNA region that is not translated into

protein


Consist by pre
-
mRNA and other RNAs


Remove by splicing


Containing 60 known


pre
-
miRNA


(release 5)

3

Introduction


Expressed Sequence Tag(EST)


Used to identify gene

transcript, gene discovery
and gene sequence determination


Containing 26 known pre
-
miRNA

(release 5)

4

Introduction


microRNA

(
miRNA
)


Non
-
protein
-
coding RNA


down
-
regulate gene expression


disease progression


5

Scanning Method


Direct
-
cloning


Traditional biological method in miRNA discovery


Limited by need of RNA start material


Few highly express miRNA constitute cloned
product


miRNAs are unstable, easy to degraded

6

Scanning Method


Bioinformatic discovery

Srnaloop

Sequence &
Structural

Conservation
examination

RefSeq

filter

7

Srnaloop


Srnaloop is a BLAST
-
like algorithm that looks for
short complementary words within a specified
distance.


Uses dynamic programming to determine a complete
alignment. Compared to BLAST,
srnaloop supports
shorter word lengths and aligns complementary base
pairs
.


Acquire 1,350,168 candidate hairpins( 359,360 from
ESTs, 990,808 from introns).

8

Sequence & Structural features filter


GC content

Different types
of genome
sequence have different
ranges of GC content
, and pre
-
miRNAs are also
expected to be so.


Core mfe and hairpin mfe(
M
inimum
F
ree
E
nergy)

Hairpin structure results from intra
-
molecular base
pairing with
hydrogen bonding
. Different pairing
patterns result in distinctive stabilities and minimum
free energies.

Greater a number of paired bases within a hairpin
implies greater stability and lower mfe

9

Sequence & Structural features filter


Ch_ratio


core mfe / hairpin mfe,
ch_ratio

has a fixed
distribution range


Acquire 113,484 candidate hairpins from introns and
67,215 from ESTs.

10

RefSeq filter


The

Reference Sequence (
RefSeq
) database

is a non
-
redundant collection of richly annotated DNA, RNA,
and protein sequences.




RefSeq

biological sequences are
derived
from

GenBank

records but differ in that each
RefSeq

is a synthesis of information.


NM access number are the tag to extract mRNA
protein
-
coding sequence, if candidate matching
these sequence will be removed.


Acquire 66,109 candidate from ESTs and 113,124
from introns.

11

Conservation Examination


miRNAs have been conserved among
phylogenetically close species during evolution
.


Searched putative pre
-
miRNAs against other
published mammalian genomic sequences, namely
mouse , rat and dog.


Conserved hairpin definition: Putative miRNAs of a
hairpin has a contiguous >= 20
-
nt fragment that is
identical to a subject sequence
.


12

Conservation Examination


HMDR are the pre
-
miRNA

candidates conserved in

all four genomes (human, mouse, dog and rat)
.


After applying the final conservation filters, there
were 208 qualified candidate hairpins in the HMDR
set,

52 of which were known pre
-
miRNAs, resulting
in 60.5% (52/86) sensitivity and 25.0% (52/208)
specificity.

13

Specificity assessment(1)


Using the 130 newly updated (release 8) pre
-
miRNA

sequences as our validation test dataset.


Detected 116 of the 130 input pre
-
miRNAs after the
initial hairpin finding procedure. And then tested the
sensitivity of the Sequence & Structural features filter
led to 85% sensitivity.
(previous sensitivity 86.5%)

14

Specificity assessment(2)


This procedure is based on the fact that the fraction
of
miRNA

encoding sequences in the human genome
is very small; therefore, randomly extracted
sequences are extremely unlikely to code for
miRNA
.


randomly extracted 99,600 sequence fragments from
intronic

sequences (33,200 fragments), ESTs
(33,200 fragments) and genomic sequences (33,200
fragments).


The 332 known pre
-
miRNA

sequences and the 99,
600 random sequences of 11 Mbps were applied to
our discovery pipeline under the same hairpin
finding parameters.

15

Specificity assessment(2)

ESTs

Genome
Seq

introns

1440M

11M

Scanning
method

33200

33200

33200

5 output

(2,1,2)

332 known
pre
-
miRNA

sequences

Scanning
method

210 output

16

Specificity assessment(2)


Of the 332 known pre
-
miRNAs, 210 survived the
discovery pipeline as a true positive prediction value.


5 false positives in three independent experiments (2,
1 and 2 predicted candidates respectively),
corresponding to an average of 1.67 false positives in
11 Mbps. Thus, because the initial are about 1,440
Mbps in length, we could theoretically generate 212
false positive candidates from similar size dataset.


210 (TP)/(210 (TP) + 212 (FP))
,
The specificity is
49.7%
by calculating the percentage ratio of where
TP denotes true positives and FP denotes false
positives.

17

Discussion


Using
intronic

and EST sequences as raw data.


Advantage: most of the ESTs and introns are well
annotated, making it easy to acquire information
associated with their expression patterns and
levels.


Disadvantage: all of the 207 known pre
-
miRNAs
should have matches when searching against the
human genome; in our data, however, only 60 and
26 pre
-
miRNAs matched to introns and ESTs,
respectively, implying only 41.5% [(60 + 26)/207]
coverage.

18

Conclusion


This paper developed a new scanning method using
criteria based on the features of 207 known pre
-
miRNAs to predict miRNAs from expressed
sequences (ESTs and introns). And it achieve good
sensitivity and specificity
compared with other
published

works
.

19

References


Lai EC,
Tomancak

P, Williams RW, Rubin GM:
Computational identification of Drosophila
microRNA

genes.
Genome
Biol

2003


Grad Y,
Aach

J, Hayes GD, Reinhart BJ, Church GM,
Ruvkun

G, Kim J:
Computational and experimental
identification of C.
elegans

microRNAs.
Mol Cell
2003


Lee Y, Tsai J,
Sunkara

S,
Karamycheva

S,
Pertea

G, Sultana R,
Antonescu

V, Chan A, Cheung F,
Quackenbush

J:
The TIGR
Gene Indices: clustering and assembling EST and
known genes and integration with eukaryotic
genomes.

20