Chapter 3 Ying Xu

trainerhungarianAI and Robotics

Oct 20, 2013 (3 years and 9 months ago)

82 views

Chapter 3

Ying
Xu


DNA sequence of a genome encodes the
,


Millions (microbes)
to
Billions (human),

information is encoded in a cont., A’s, C’s, G’s and T’s string?

it is located ?


What information is
directly ?

should the identified directly information be presented ?


Two approaches,


1.
Ab

initio approach,





Ab

initio
-
> predicts functional elements by
statistical features
and used to identify novel
functional elements,

-
>
sequence
similarity
to previously known one.



Single largest set of
functional elements
in a genome consists of genes,


75
-
90%

of microbial genome contains gene
-
coding regions,


Sequence fragment between
two stop
codons

of the same reading frame is
called an
open reading frame (ORF)
,



Ab

initio
prediction
-

based
on
di
-
codons
, or six
-
mers
,


Eg
.,
di
-
codon

,

largely occur in
noncoding

regions than in coding regions in
Shewanella

oneidensis
,


4,096
different
di
-
codons

in a genome (
4
6

= 4,096
),



Total numbers
of occurrences of X
in coding and
noncoding

regions.


Relative frequency (RF)of
X in coding regions
=
number
of occurrences of X
/
total
number
in
coding
regions


Est. RF of
X in
non
-
coding regions in
a similar
fashion.



X’s relative frequency in a
coding region



X’s relative frequency in a
noncoding

region,



If
X have
the same
RF
-

preference
value is zero.

-

X
has a higher
RF in
coding than in a
non
-
coding
region;


otherwise
, it will
be


Overall preference value =
sum of all preference
values of the
di
-
codons
.


Positive preference value
-
>
coding region


Negative preference value
-
>
noncoding

region
.


GRAIL AND SORFIND,


HIDDEN MARKOV MODELS,



Consecutive 6
-
mers or
di
-
codons

are
independent
,


Modeling
dependence relationships
among
consecutive
di
-
codons
,

Baysian

formula



Similar
sequence patterns
around the
,


Predict new translation starts
based on
previously
known,


Weight matrix
,



Identify
all ORFs
in six reading frames,


Measure
the
coding
potential
,


High translation
-
start score
and the whole region has
high coding potential


Strong coding potential on
righ
t and low coding
potential on
left.



Length
distribution of all known
genes is
not
uniform
.



Asymmetric

and
heavy
tail on the right side
.



Different G+C
compositions
have
different
di
-
codon

frequencies,


One set of
di
-
codon

RF lead to incorrect
predictions.

.


Normalization factor
.


Not
overlap
with any genes
,


Reliable
prediction software
programs,


These regions are
masked

out
before
running a gene
-
finding program.


A non
-
gene
is a region in an ORF that
does
not overlap any coding
regions


set
A
contains only
genes
and
set B
contains only
non
-
genes,


Examine
the common features of sets A
& B

consists of a list of vectors

for each gene

consists of a list of
for each
nongene
.

-

one set consists of all genes and
the other set all
nongenes
.




are connected with



-

main prediction framework.



Input

Nodes

Output node

Hidden layer




BLAST


First
to find a subset of genes


Ab

initio method
to find the rest of the genes in the
genome.


EST
-
based Gene Predictions


Conserved (long) regions across multiple genomes,

(a)
megaBLAST

(b) SENSEI

(c)
MUMmer

Very long sequence
comparisons.


First find short (size of 8)
ungapped

sequence
matches.


Sequences to be aligned
are closely related.


Speed up computational
time and reduce the
memory requirement.


Extend them into longer
gapped alignments .


Utilizing a suffix trees
data structure.



Non
-
contiguou

sequence matches.


Very less time and memory requirement
, than BLAST.


-

predicts genes through genome
-
scale sequence
comparison


Genome A

Genome B

Genes


GRAIL :


All predictions divide into
.


Genes with scores between
are put into the


All genes with scores between
in the
, etc.




Cont.,


Different reliability thresholds applied for
different purposes.

,




for a regular gene prediction program.

,


Mycobacterium
leprae

has

tRNA

(transfer RNA),
rRNA

(ribosomal RNA),
sRNA

(small RNA),
srpRNA

(signal recognition particle RNA),

etc.

.


tRNAs



adapter molecules that decode the genetic code.


rRNA



catalyze the synthesis of proteins.

Cont.,


(1) RNA signals are a combination of


for example,
tRNA

genes


designed to
recognize particular types of RNA genes.

Cont., `


(2)
,

,

,


Accuracy greater than
99%
,


False positive rate at
one false prediction per 15
gigabases
.







and
,


Transcription process is initiated by
.



Hidden Markov model (HMM)
-

,


Promoter sequences have

than that of
nonpromoter

sequences.




Consensus matrix.

of the conserved
k
-
mers

-



Signal Scan and NNPP


Promoter
-
gene structure or the more general
structure of promoter
-
gene
-
gene
-

. . .
-
gene


(1)
Predicte

promoter region and a
terminator,


(2) Set of genes arranged in tandem on the
same strand,


(3) Functional information of the genes
involved.


Identify transcriptional regulatory networks


rho
-
dependent and rho
-
independent
,

Three nucleic acid binding sites :





Finds rho
-
independent transcription
terminators ( Bacterial genomes ).


Catalyze successive reactions in


http://genomics4.bu.edu/operons/
,

Cont.,




biosynthesis of tryptophan



phenylpropionate

catabolic pathway


Using these known
operons
,


1)
within an
operon

vs. between
operons
,


(2)


EC classes for enzymes,


An ad hoc way,


If
of gene is
known, its functional category will be
labeled.



Correlates with density
of genes,

.


Transcriptional starts of genes
.


Commonly used threshold is

0.6
.


Human genome threshold is
0.8
,

-

.


Genome annotation process.

per fixed
length of genomic sequence.


Cont.,

Exact and
approximate string matching.

Matching all the repeat
sequences in its database against the DNA
sequence.

Either exact or approximate
match, using a clustering technique.

of
genes in a genome are unique.

One gene’s
location differ from their corresponding genes

Cont.,

Defined from
), where b1, b2, . . . ,
bn

is a permutation
of a1, a2, . . . , an.

.



Proteins are annotated in terms of

I.
Physical attributes,

II.
Molecular weight,

III.
Membrane spanning regions,

IV.
Structural domains, or three
-
dimensional structure.


FASTA
including
sequential positions,
methods used for prediction, BLAST hits, etc.



I.
EGG pathways,

II.
Pfam

families,

III.

EC classes,

IV.
COG groups.

Cont.,


Modeled Genes,


Functional Assignments,


RNA genes,


Repeats,


General Sequence Features.



An environment for each annotation.



MAGPIE
-

containing
associated with a
.



Is an open source annotation tool for microbial
genomes,


Ab

intio

and computational approach,


Models for prediction,


Evaluation,


Large
-
scale annotation efforts,


RNA
-
coding genes and its prediction,


Promoter


Structure and function of each gene


Operon


Basic unit of genes,


Genome
-
Scale gene mapping and pathway analysis