Parsing A Bacterial Genome
Mark Craven
Department of Biostatistics & Medical Informatics
University of Wisconsin
U.S.A.
craven@biostat.wisc.edu
www.biostat.wisc.edu/~craven
The Task
Given
: a bacterial genome
Do:
use computational methods
to predict a “parts list” of
regulatory elements
Outline
1.
background on bacterial gene regulation
2.
background on probabilistic language models
3.
predicting transcription units using probabilistic language
models
4.
augmenting training with “weakly” labeled examples
5.
refining the structure of a stochastic context free grammar
The Central Dogma of
Molecular Biology
Transcription in Bacteria
Operons in Bacteria
•
operon
: sequence of one or more genes transcribed as a unit
under some conditions
•
promoter
: “signal” in DNA indicating where to start
transcription
•
terminator
: “signal” indicating where to stop transcription
promoter
gene
terminator
gene
gene
mRNA
The Task Revisited
Given
:
–
DNA sequence of E. coli
genome
–
coordinates of
known/predicted genes
–
known instances of operons,
promoters, terminators
Do:
–
learn models from known
instances
–
predict complete catalog of
operons, promoters,
terminators for the genome
Our Approach:
Probabilistic Language Models
1.
write down a “grammar” for elements of interest
(operons, promoters, terminators, etc.) and relations
among them
2.
learn probability parameters from known instances of
these elements
3.
predict new elements by “parsing” uncharacterized DNA
sequence
Transformational Grammars
•
a transformational grammar characterizes a set of legal
strings
•
the grammar consists of
–
a set of abstract
nonterminal
symbols
–
a set of
terminal
symbols (those that actually appear in
strings)
–
a set of
productions
A Grammar for Stop Codons
•
this grammar can generate the 3 stop codons:
taa
,
tag
,
tga
•
with a grammar we can ask questions like
–
what strings are derivable from the grammar?
–
can a particular string be derived from the grammar?
The Parse Tree for
tag
A Probabilistic Version
of the Grammar
•
each production has an associated probability
•
the probabilities for productions with the same left
-
hand
side sum to 1
•
this
grammar has a corresponding Markov chain model
1.0
1.0
0.7
0.3
1.0
0.2
0.8
A Probabilistic Context Free Grammar
for Terminators
START
PREFIX
STEM_BOT1
STEM_BOT2
STEM_MID
STEM_TOP2
STEM_TOP1
LOOP
LOOP_MID
SUFFIX
B
t
l
STEM_BOT2
t
r
t
l
*
STEM_MID
t
r
*
| t
l
*
STEM_TOP2
t
r
*
t
l
*
STEM_MID
t
r
*
| t
l
*
STEM_TOP2
t
r
*
t
l
LOOP
t
r
B B LOOP_MID B B
t
l
*
STEM_TOP1
t
r
*
B LOOP_MID
|
B B B B B B B B B
a
| c | g | u
B B B B B B B B B
PREFIX STEM_BOT1 SUFFIX
t =
{
a,c,g,u
},
t
*
=
{
a,c,g,u,
}
c
g
a
c
c
g
c
c
-
u
-
c
-
a
-
a
-
a
-
g
-
g
-
g
c
u
g
g
c
g
u
a
u
c
c
-
u
-
u
-
u
-
u
-
u
-
u
-
u
-
u
prefix
stem
loop
suffix
Inference with
Probabilistic Grammars
•
for a given string there may be many parses, but some are
more probable than others
•
we can do prediction by finding relatively high probability
parses
•
there are dynamic programming algorithms for finding the
most probable parse efficiently
Learning with
Probabilistic Grammars
•
in this work, we write down the productions by hand, but
learn the probability parameters
•
to learn the probability parameters, we align sequences of a
given classs (e.g. terminators) with the relevant part of the
grammar
•
when there is hidden state (i.e. the correct parse is not
known), we use Expectation Maximization (EM) algorithms
Outline
1.
background on bacterial gene regulation
2.
background on probabilistic language models
3.
predicting transcription units using probabilistic language
models
[Bockhorst et al.,
ISMB/Bioinformatics
‘03]
4.
augmenting training with “weakly” labeled examples
5.
refining the structure of a stochastic context free grammar
untranscribed region
transcribed region
ORF
SCFG
position specific Markov model
semi
-
Markov model
-
35
-
10
TSS
ORF
last
ORF
RIT
prefix
RDT
prefix
stem
loop
stem
loop
start
RIT
suffix
RDT
suffix
end
end
spacer
start
spacer
spacer
prom
intern
post
prom
intra
ORF
pre
term
UTR
A Model for Transcription Units
The Components of the Model
•
stochastic context free grammars (SCFGs)
represent
variable
-
length sequences with long
-
range dependencies
•
semi
-
Markov models
represent variable
-
length sequences
•
position
-
specific Markov models
represent fixed
-
length
sequence motifs
Gene Expression Data
•
in addition to DNA sequence data,
we also use expression data to make
our parses
•
microarrays
enable the
simultaneous measurement of the
transcription levels of thousands of
genes
genes/
sequence positions
experimental
conditions
Incorporating Expression Data
ACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTAGCTGACAGCTAGATCGATAGCTCGATAGCACGTGTACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCT
•
our models parse two sequences simultaneously
–
the DNA sequence of the genome
–
a sequence of expression measurements associated with
particular sequence positions
•
the expression data is useful because it provides
information about which subsequences look like they are
transcribed together
Predictive Accuracy for Operons
Predictive Accuracy for Promoters
Predictive Accuracy for Terminators
Accuracy of Promoter & Terminator
Localization
Terminator Predictive Accuracy
Outline
1.
background on bacterial gene regulation
2.
background on probabilistic language models
3.
predicting transcription units using probabilistic language
models
4.
augmenting training data with “weakly” labeled examples
[Bockhorst & Craven,
ICML
’02]
5.
refining the structure of a stochastic context free grammar
Key Idea: Weakly Labeled Examples
•
regulatory elements are inter
-
related
–
promoters precede operons
–
terminators follow operons
–
etc.
•
relationships such as these can be exploited to augment training
sets with “weakly labeled”
examples
Inferring “Weakly” Labeled Examples
g1
g2
g3
g4
g5
ACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTAGCTGACAGCTAGATCGATAGCTCGATAGCACGTGTACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCT
TGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGATCGACTGTCGATCTAGCTATCGAGCTATCGTGCACATGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGA
•
if we know that an operon ends at g4, then there must be a
terminator shortly downstream
•
if we know that an operon begins at g2, then there must be a
promoter shortly upstream
•
we can exploit relations such as this to augment our training
sets
Strongly vs. Weakly Labeled
Terminator Examples
gtccgttccgccactattcactcatgaaaatgag
ttcagagagccgcaagatttttaattttgcggtttttttgtatttgaatt
ccaccatttctctgttcaatg
gtccgttccgccactattcactcatgaaaatgagttcagagagccgcaagatttttaattttgcggtttttttgtatttgaattccaccatttctctgttcaatg
end of stem
-
loop
strongly labeled terminator
:
weakly labeled terminator:
extent of terminator
sub
-
class: rho
-
independent
Training the Terminator Models:
Strongly Labeled Examples
rho
-
dependent
terminator model
negative
model
rho
-
independent
terminator model
negative examples
rho
-
independent
examples
rho
-
dependent
examples
Training the Terminator Models:
Weakly Labeled Examples
rho
-
dependent
terminator model
negative
model
rho
-
independent
terminator model
negative examples
weakly labeled examples
combined terminator model
Do Weakly Labeled Terminator
Examples Help?
•
task: classification of terminators (both sub
-
classes) in
E. coli
K
-
12
•
train SCFG terminator model using:
–
S
strongly labeled examples and
–
W
weakly labeled examples
•
evaluate using
area
under
ROC curves
0.5
0.6
0.7
0.8
0.9
1
0
20
40
60
80
100
120
140
Area under ROC curve
Number of strong positive examples
250 weak examples
25 weak examples
0 weak examples
Learning Curves using
Weakly Labeled Terminators
Are Weakly Labeled Examples Better
than Unlabeled Examples?
•
train SCFG terminator model using:
–
S
strongly labeled examples and
–
U
unlabeled
examples
•
vary
S
and
U
to obtain learning curves
Training the Terminator Models:
Unlabeled Examples
rho
-
dependent
terminator model
negative
model
rho
-
independent
terminator model
unlabeled examples
combined model
0
40
80
120
250 unlabeled examples
25 unlabeled examples
0 unlabeled examples
Area under ROC curve
Number of strong positive examples
0.6
0.8
1
0
40
80
120
250 weak examples
25 weak examples
0 weak examples
Weakly Labeled
Unlabeled
Learning Curves: Weak vs. Unlabeled
Are Weakly Labeled Terminators
from
Predicted
Operons Useful?
•
train operon model with
S
labeled operons
•
predict operons
•
generate
W
weakly labeled terminators from
W
most confident
predictions
•
vary
S
and
W
0.5
0.6
0.7
0.8
0.9
1
0
20
40
60
80
100
120
140
160
Area under ROC curve
Number of strong positive examples
200 weak examples
100 weak examples
25 weak examples
0 weak examples
Learning Curves using
Weakly Labeled Terminators
Outline
1.
background on bacterial gene regulation
2.
background on probabilistic language models
3.
predicting transcription units using probabilistic language
models
4.
augmenting training with “weakly” labeled examples
5.
refining the structure of a stochastic context free grammar
[Bockhorst & Craven,
IJCAI
’01]
Learning SCFGs
•
given the productions of a grammar, can learn the
probabilities using the Inside
-
Outside algorithm
•
we have developed an algorithm that can add new
nonterminals & productions to a grammar during learning
•
basic idea:
–
identify nonterminals that seem to be “overloaded”
–
split these nonterminals into two; allow each to
specialize
Refining the Grammar in a SCFG
•
there are various “contexts” in which each grammar
nonterminal may be used
•
consider two contexts for the nonterminal
0.4
0.4
0.1
0.1
•
if the probabilities for look very different, depending
on its context,
we add a new nonterminal
and specialize
0.1
0.1
0.4
0.4
Refining the Grammar in a SCFG
•
we can compare two probability distributions
P
and
Q
using Kullback
-
Leibler divergence
0.4
0.4
0.1
0.1
0.1
0.1
0.4
0.4
P
Q
Learning Terminator SCFGs
•
extracted grammar from the literature
(~ 120 productions)
•
data set consists of 142 known
E. coli
terminators, 125
sequences that do not contain terminators
•
learn parameters using Inside
-
Outside algorithm (an EM
algorithm)
•
consider adding nonterminals guided by three heuristics
–
KL divergence
–
chi
-
squared
–
random
SCFG Accuracy After Adding 25
New Nonterminals
SCFG Accuracy vs.
Nonterminals Added
Conclusions
•
summary
–
we have developed an approach to predicting transcription units in
bacterial genomes
–
we have predicted a complete set of transcription units for the
E. coli
genome
•
advantages of the probabilistic grammar approach
–
can readily incorporate background knowledge
–
can simultaneously get a coherent set of predictions for a set of
related elements
–
can be easily extended to incorporate other genomic elements
•
current directions
–
expanding the vocabulary of elements modeled (genes, transcription
factor binding sites, etc.)
–
handling overlapping elements
–
making predictions for multiple related genomes
Acknowledgements
•
Craven Lab
: Joe Bockhorst, Keith Noto
•
David Page, Jude Shavlik
•
Blattner Lab
: Fred Blattner, Jeremy Glasner, Mingzhu Liu,
Yu Qiu
•
funding from National Science Foundation, National
Institutes of Health
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο