Parsing A Bacterial Genome

blabbingunequaledΤεχνίτη Νοημοσύνη και Ρομποτική

24 Οκτ 2013 (πριν από 3 χρόνια και 5 μήνες)

52 εμφανίσεις

Parsing A Bacterial Genome

Mark Craven


Department of Biostatistics & Medical Informatics

University of Wisconsin

U.S.A.


craven@biostat.wisc.edu

www.biostat.wisc.edu/~craven

The Task

Given
: a bacterial genome

Do:
use computational methods
to predict a “parts list” of
regulatory elements

Outline

1.
background on bacterial gene regulation

2.
background on probabilistic language models

3.
predicting transcription units using probabilistic language
models

4.
augmenting training with “weakly” labeled examples

5.
refining the structure of a stochastic context free grammar


The Central Dogma of

Molecular Biology

Transcription in Bacteria

Operons in Bacteria



operon
: sequence of one or more genes transcribed as a unit
under some conditions


promoter
: “signal” in DNA indicating where to start
transcription


terminator
: “signal” indicating where to stop transcription



promoter

gene

terminator

gene

gene

mRNA

The Task Revisited

Given
:


DNA sequence of E. coli
genome


coordinates of
known/predicted genes


known instances of operons,
promoters, terminators

Do:


learn models from known
instances


predict complete catalog of
operons, promoters,
terminators for the genome

Our Approach:

Probabilistic Language Models

1.
write down a “grammar” for elements of interest
(operons, promoters, terminators, etc.) and relations
among them

2.
learn probability parameters from known instances of
these elements

3.
predict new elements by “parsing” uncharacterized DNA
sequence

Transformational Grammars


a transformational grammar characterizes a set of legal
strings


the grammar consists of


a set of abstract
nonterminal

symbols



a set of
terminal

symbols (those that actually appear in
strings)



a set of
productions

A Grammar for Stop Codons


this grammar can generate the 3 stop codons:
taa
,
tag
,
tga


with a grammar we can ask questions like


what strings are derivable from the grammar?


can a particular string be derived from the grammar?

The Parse Tree for
tag

A Probabilistic Version

of the Grammar


each production has an associated probability


the probabilities for productions with the same left
-
hand
side sum to 1


this

grammar has a corresponding Markov chain model

1.0

1.0

0.7

0.3

1.0

0.2

0.8

A Probabilistic Context Free Grammar
for Terminators

START

PREFIX

STEM_BOT1

STEM_BOT2

STEM_MID

STEM_TOP2

STEM_TOP1

LOOP

LOOP_MID

SUFFIX

B

t
l

STEM_BOT2

t
r

t
l
*

STEM_MID

t
r
*

| t
l
*

STEM_TOP2

t
r
*

t
l
*

STEM_MID

t
r
*

| t
l
*

STEM_TOP2

t
r
*

t
l

LOOP

t
r

B B LOOP_MID B B

t
l
*

STEM_TOP1

t
r
*

B LOOP_MID
|



B B B B B B B B B

a

| c | g | u

B B B B B B B B B

PREFIX STEM_BOT1 SUFFIX

t =
{
a,c,g,u
},

t
*

=
{
a,c,g,u,


}

c

g

a

c

c

g

c

c
-
u
-
c
-
a
-
a
-
a
-
g
-
g
-

g

c

u

g

g

c

g

u

a

u

c

c

-
u
-
u
-
u
-
u
-
u
-
u
-
u
-
u

prefix

stem

loop

suffix

Inference with

Probabilistic Grammars


for a given string there may be many parses, but some are
more probable than others



we can do prediction by finding relatively high probability
parses



there are dynamic programming algorithms for finding the
most probable parse efficiently

Learning with

Probabilistic Grammars


in this work, we write down the productions by hand, but
learn the probability parameters



to learn the probability parameters, we align sequences of a
given classs (e.g. terminators) with the relevant part of the
grammar



when there is hidden state (i.e. the correct parse is not
known), we use Expectation Maximization (EM) algorithms

Outline

1.
background on bacterial gene regulation

2.
background on probabilistic language models

3.
predicting transcription units using probabilistic language
models
[Bockhorst et al.,
ISMB/Bioinformatics

‘03]

4.
augmenting training with “weakly” labeled examples

5.
refining the structure of a stochastic context free grammar

untranscribed region

transcribed region

ORF

SCFG

position specific Markov model

semi
-
Markov model

-
35

-
10

TSS

ORF

last

ORF

RIT

prefix

RDT

prefix

stem

loop

stem

loop

start

RIT

suffix

RDT

suffix

end

end

spacer

start

spacer

spacer

prom

intern

post

prom

intra

ORF

pre

term

UTR

A Model for Transcription Units

The Components of the Model


stochastic context free grammars (SCFGs)

represent
variable
-
length sequences with long
-
range dependencies


semi
-
Markov models

represent variable
-
length sequences


position
-
specific Markov models

represent fixed
-
length
sequence motifs

Gene Expression Data


in addition to DNA sequence data,
we also use expression data to make
our parses


microarrays

enable the
simultaneous measurement of the
transcription levels of thousands of
genes

genes/

sequence positions

experimental

conditions

Incorporating Expression Data

ACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTAGCTGACAGCTAGATCGATAGCTCGATAGCACGTGTACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCT


our models parse two sequences simultaneously


the DNA sequence of the genome


a sequence of expression measurements associated with
particular sequence positions


the expression data is useful because it provides
information about which subsequences look like they are
transcribed together

Predictive Accuracy for Operons

Predictive Accuracy for Promoters

Predictive Accuracy for Terminators

Accuracy of Promoter & Terminator
Localization

Terminator Predictive Accuracy



Outline

1.
background on bacterial gene regulation

2.
background on probabilistic language models

3.
predicting transcription units using probabilistic language
models

4.
augmenting training data with “weakly” labeled examples
[Bockhorst & Craven,
ICML

’02]

5.
refining the structure of a stochastic context free grammar

Key Idea: Weakly Labeled Examples


regulatory elements are inter
-
related


promoters precede operons


terminators follow operons


etc.



relationships such as these can be exploited to augment training
sets with “weakly labeled”

examples


Inferring “Weakly” Labeled Examples


g1

g2

g3

g4

g5

ACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTAGCTGACAGCTAGATCGATAGCTCGATAGCACGTGTACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCT

TGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGATCGACTGTCGATCTAGCTATCGAGCTATCGTGCACATGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGA


if we know that an operon ends at g4, then there must be a
terminator shortly downstream


if we know that an operon begins at g2, then there must be a
promoter shortly upstream


we can exploit relations such as this to augment our training
sets

Strongly vs. Weakly Labeled
Terminator Examples

gtccgttccgccactattcactcatgaaaatgag
ttcagagagccgcaagatttttaattttgcggtttttttgtatttgaatt
ccaccatttctctgttcaatg

gtccgttccgccactattcactcatgaaaatgagttcagagagccgcaagatttttaattttgcggtttttttgtatttgaattccaccatttctctgttcaatg

end of stem
-
loop

strongly labeled terminator
:

weakly labeled terminator:

extent of terminator

sub
-
class: rho
-
independent

Training the Terminator Models:

Strongly Labeled Examples

rho
-
dependent
terminator model

negative

model

rho
-
independent
terminator model

negative examples

rho
-
independent
examples

rho
-
dependent
examples

Training the Terminator Models:

Weakly Labeled Examples

rho
-
dependent
terminator model

negative

model

rho
-
independent
terminator model

negative examples

weakly labeled examples

combined terminator model

Do Weakly Labeled Terminator
Examples Help?


task: classification of terminators (both sub
-
classes) in
E. coli

K
-
12


train SCFG terminator model using:


S
strongly labeled examples and


W
weakly labeled examples


evaluate using
area

under
ROC curves

0.5

0.6

0.7

0.8

0.9

1

0

20

40

60

80

100

120

140

Area under ROC curve

Number of strong positive examples

250 weak examples

25 weak examples

0 weak examples

Learning Curves using

Weakly Labeled Terminators

Are Weakly Labeled Examples Better
than Unlabeled Examples?


train SCFG terminator model using:


S
strongly labeled examples and


U
unlabeled

examples



vary
S

and
U
to obtain learning curves

Training the Terminator Models:

Unlabeled Examples

rho
-
dependent
terminator model

negative

model

rho
-
independent
terminator model

unlabeled examples

combined model

0

40


80


120

250 unlabeled examples

25 unlabeled examples

0 unlabeled examples

Area under ROC curve

Number of strong positive examples

0.6

0.8

1

0

40


80


120

250 weak examples

25 weak examples

0 weak examples

Weakly Labeled

Unlabeled

Learning Curves: Weak vs. Unlabeled

Are Weakly Labeled Terminators
from
Predicted

Operons Useful?


train operon model with
S

labeled operons


predict operons


generate
W
weakly labeled terminators from
W
most confident
predictions


vary
S

and
W

0.5

0.6

0.7

0.8

0.9

1

0

20

40

60

80

100

120

140

160

Area under ROC curve

Number of strong positive examples

200 weak examples

100 weak examples

25 weak examples

0 weak examples

Learning Curves using

Weakly Labeled Terminators

Outline

1.
background on bacterial gene regulation

2.
background on probabilistic language models

3.
predicting transcription units using probabilistic language
models

4.
augmenting training with “weakly” labeled examples

5.
refining the structure of a stochastic context free grammar
[Bockhorst & Craven,
IJCAI

’01]

Learning SCFGs


given the productions of a grammar, can learn the
probabilities using the Inside
-
Outside algorithm


we have developed an algorithm that can add new
nonterminals & productions to a grammar during learning



basic idea:


identify nonterminals that seem to be “overloaded”


split these nonterminals into two; allow each to
specialize


Refining the Grammar in a SCFG


there are various “contexts” in which each grammar
nonterminal may be used


consider two contexts for the nonterminal

0.4

0.4

0.1

0.1


if the probabilities for look very different, depending
on its context,
we add a new nonterminal

and specialize

0.1

0.1

0.4

0.4

Refining the Grammar in a SCFG


we can compare two probability distributions
P

and
Q

using Kullback
-
Leibler divergence

0.4

0.4

0.1

0.1

0.1

0.1

0.4

0.4

P

Q

Learning Terminator SCFGs


extracted grammar from the literature
(~ 120 productions)


data set consists of 142 known
E. coli

terminators, 125
sequences that do not contain terminators


learn parameters using Inside
-
Outside algorithm (an EM
algorithm)


consider adding nonterminals guided by three heuristics


KL divergence


chi
-
squared


random



SCFG Accuracy After Adding 25
New Nonterminals



SCFG Accuracy vs.
Nonterminals Added



Conclusions


summary


we have developed an approach to predicting transcription units in
bacterial genomes


we have predicted a complete set of transcription units for the
E. coli

genome


advantages of the probabilistic grammar approach


can readily incorporate background knowledge


can simultaneously get a coherent set of predictions for a set of
related elements


can be easily extended to incorporate other genomic elements


current directions


expanding the vocabulary of elements modeled (genes, transcription
factor binding sites, etc.)


handling overlapping elements


making predictions for multiple related genomes

Acknowledgements


Craven Lab
: Joe Bockhorst, Keith Noto


David Page, Jude Shavlik


Blattner Lab
: Fred Blattner, Jeremy Glasner, Mingzhu Liu,
Yu Qiu


funding from National Science Foundation, National
Institutes of Health