Regular grammars, Trinucleotide repeats, PROSITE motifs, and ...

californiamandrillSoftware and s/w Development

Dec 13, 2013 (3 years and 6 months ago)

61 views

Transformational Grammars


Colourless

green ideas sleep furiously”

-

Noam Chomsky

We

might

ask

“Is

this

novel

sentence

(or

sequence!)

grammatical?”

i
.
e
.
,

does

the

language

described

by

some

grammar

validly

contain

this

sentence??

Chomsky

turned

this

question

on

its

head

and

instead

asked
:

“Could

the

grammar

we’re

considering

have

possibly

generated

this

sentence?”

He

developed

finite

formal

machines

(“grammars”)

that

can

theoretically

recursively

enumerate

the

infinitude

of

possible

sentences

of

the

corresponding

language
.


Transformational Grammars

The Chomsky hierarchy of grammars

The more deeply nested the grammar, the simpler the rules.
These are easiest to parse, but are also the most restricted

Unrestricted

Context
-
sensitive

Context
-
free

Regular

Slide after Durbin,
et al
., 1998

Regular Grammars

Symbols and Productions (A.K.A “rewriting rules”)

All transformational grammars are defined by their set of
symbols and the production rules for manipulating
strings consisting of those symbols

Only
t
wo
types

of symbols:


Terminals (generically represented as
“a”
)


these actually appear in the final observed string (so
imagine nucleotide or amino acid symbols)


Non
-
terminals (generically represented as
“W”
)


abstract symbols


easiest to see how they are used
through example. The start state (usually shown as
“S”
) is a
commonly used non
-
terminal

The non
-
terminals are often used as place holders that
disappear from the final string

Regular Grammars

Symbols and Productions (A.K.A “rewriting rules”)

Only two productions are allowed in a regular grammar!

We often also use a special terminal symbol “
e
´ ZKLFK LV
XVeG WR GeQRWe WKe QXOO VWULQJ DQG WR eQG D SURGXFWLRQ«

W


aW


W


aW


W


e


Don’t freak out! It’s easier to demonstrate how this
all works than it is to describe!

Regular Grammars

Symbols and Productions (A.K.A “rewriting rules”)

W

= {S = “
Start”
}

a

= {
A,G,C,T
,
e
}

S


A
S


S


C
S

S


G
S


S


T
S



S


e



Imagine

we

always

start

with

S

--

then

we

can

repeatedly

choose

any

of

the

valid

productions,

with

S

being

replaced

each

time

by

the

string

on

the

right

hand

side

of

the

production

we’ve

chosen


Here’s a trivial regular grammar that can produce
all possible nucleotide sequences:

Regular Grammars

Symbols and Productions (A.K.A “rewriting rules”)

W

= {S = “
Start”
}

a

= {
A,G,C,T
,
e
}


S


A
S
|
C
S
|
G
S
|
T
S
|
e




Imagine

we

always

start

with

S

--

then

we

can

repeatedly

choose

any

of

the

valid

productions,

with

S

being

replaced

each

time

by

the

string

on

the

right

hand

side

of

the

production

we’ve

chosen


Here’s a trivial regular grammar that can produce
all possible nucleotide sequences:

Protein motifs as regular grammars

“Classic” PROSITE motifs

S



rW
1

| kW
1

W
1



gW
2

W
2



[
afilmnqstvwy
]W
3

W
3



[
agsci
]W
4

W
4



fW
5

| yW
5



W
5



lW
6
| iW
6
| vW
6
| aW
6

W
6



[
acdefghiklmnpqrstvwy
]W
7


W
8



f | y | m



RU
1
A_HUMAN


SRSLKMRGQAFVIFKEVSSAT

SKLF_DROME


KLTGRPRGVAFVRYNKREEAQ

ROC_HUMAN


VGCSVHKGFAFVQYVNERNAR

ELAV_DROME


GNDTQTKGVGFIRFDKREEAT



RNP
-
1 Motif

Slide after Durbin,
et al
., 1998

[RK]
-
G
-
{EDRKHPCG}
-
[AGSCI]
-
[FY]
-
[LIVA]
-
x
-
[FYM]

Does this remind you of anything we’ve seen before?

Automata

Formal grammars are
generative
. However,
e
ach Chomsky
grammar can be
parsed

using a corresponding abstract
computational machine, or
automata



The

automata

for

the

two

most

general

grammars

are

of

great

theoretical

interest

but

are

of

less

practical

significance

for

us

because

of

the

time

and

space

complexity

of

the

algorithms



their

decision

problems

may

only

be

computationally

feasible

in

special

cases
.

We will focus on the first two only!!


Grammar



Parsing automaton

Regular grammar

Context
-
free grammar

Context
-
sensitive grammar

Unrestricted grammar

Finite State automaton

Push
-
down automaton

Linear bounded automaton

Turing machine

Trinucleotide

Repeat Disorders

A family of diseases resulting from a
trinucleotide

expansion

Can we identify sequences with well
-
defined repeat characteristics?

Fragile X


associated with 200 to 4000 repeats of a CGG
trinucleotide

in the FMR
-
1 gene

Unaffected individuals have typically 5
-
40 copies, but
individuals with intermediate numbers are considered to
have a “
premutation
” with variable penetrance

CAG Repeats


at least 9 different “
PolyQ
” disorders
have been identified so far. Most are autosomal dominant


Huntington disease


affected individuals have >35 copies
of the CAG repeat in the HD (
huntington

Disease) gene

A Finite State Automaton

The FMR triplet repeat considered
as a sequence of
states


The grammar
generates
, the automaton
parses

1

2

3

4

5

6

7

8

S

e

g

g

g

g

c

g

c

t

c

c

a

The FMR triplet regular grammar:

S



g
W
1

W
1



c
W
2

W
2



g
W
3

W
3



c
W
4

W
4



g
W
5


W
5



g
W
6

W
6



c
W
7
|

a
W
4

|

c
W
4


W
7



g
W
6

W
8



g


A Finite State Automaton

The FMR triplet repeat considered
as a sequence of
states


FSAs can be either
deterministic
, or
non
-
deterministic
.
Because our FMR repeat FSA offers multiple paths for
accepting state 6, this is a
non
-
deterministic

FSA.

An automaton with only one possible sequence of
states (the “state path”) is always deterministic.

1

2

3

4

5

6

7

8

S

e

g

g

g

g

c

g

c

t

c

c

a

Note however that there are no probabilities associated
with the state transitions. This FSA is therefore NOT a
probabilistic

model or
stochastic

model.

Finite State Automata

Moore vs. Mealy machines

The FSA shown above is a so
-
called “Mealy machine”

--

Mealy machines “accept” or “emits” upon
transition

to a new state


Later we will see and use examples of “Moore machines”

--

Moore machines instead “accept on state”

1

2

3

4

5

6

7

8

S

e

g

g

g

g

c

g

c

t

c

c

a

Moore and Mealy machine are always
interconvertible
.

Think about ways to redraw this FSA as a Moore Machine

Finite State Automata

The FMR regular grammar as a Python data structure

This is just one possible embodiment!



This
dict

has keys that are states, and values that are lists of
“acceptance conditions”. The acceptance conditions are in
the format of a tuple with the symbol that would lead to
acceptance, and the state that should be “transitioned to”.

states =


{


"Start" : [("G" , "W1")],


"W1" : [("C" , "W2")],


"W2" : [("G" , "W3")],


"W3" : [("C" , "W4")],


"W4" : [("G" , "W5")],


"W5" : [("G" , "W6")],


"W6" : [("C" , "W7") , ("A" , "W4"), ("C" , "W4")],


"W7" : [("T" , "W8")],


"W8" : [("G" , "End")]



}

Reducing an FSA to Python code

The deterministic case

This is fairly straightforward:


initialize
cur_state

to “Start”


initialize
cur_position

in test sequence to zero


Initialize
result_string

to “”


Iterate over positions in sequence:


is the symbol at
cur_position

a valid production?


No? Failure. Return
False


Yes! Accept symbol


set
cur_state

to
new_state


is
cur_state

now “End”?


Yes! Success! Return
result_str


concatenate symbol at
cur_position

to
result_str


Exhausted test sequence? Failure. Return
False

Reducing an FSA to Python code

The non
-
deterministic case is less straightforward!


We can no longer just iterate over the test sequence!


For each symbol in the test sequence, we might have to consider multiple
valid productions (think loop, yes?)


We therefore may need to explore “branches” corresponding to these
alternatives before we find one that is “correct”



Although not necessarily the most efficient way, recursion is an easy
way to explore these branches:


If a possible production is valid, assume that it is correct by accepting the
symbol and new state


Increment the position in the test sequence


“Success” or “Failure” can easily be propagated back up through the
recursion by testing the result of the recursive call and returning the
resulting return sequence.


If it gets past the recursive call test, the branch has failed, decrement the
position in the test sequence, and go to the next possible production


If there are no more productions to consider, we’ve failed, return
False

Python focus


classes

Like functions, minimally
, all we need is a statement
block of Perl
code
that we have given a name!

Defining a
class

c
lass

I_dont_do_much

(object):


#
any code you like
!!



pass

Capital letters OK

…but it won’t do anything interesting though until we
have specified some
data

and some
methods
!

Python classes are essentially user
-
defined
data types

Python focus


classes

This method corresponds to the “constructor” in other
OOP languages

The
__
init
__
method

c
lass

FSA (object):





def

__
init
__(self, states):



self.states

= states


First argument always
self

Variables declared
outside

of a method have the same

value in all instances of that class!

Variables prepended with
self

become
instance variables,
and are visible throughout the
namespace of a class instance

Methods defined as functions

Python focus


classes

User
-
defined methods

c
lass

FSA (object):





def

__
init
__(self, states):



self.states

= states


# initialize some other stuff




def

test (self,
seq
,
cur_state

= “Start”):



some_var

= 0



# do some things




return something

First argument always
self

Variable
some_var

is visible only within the user defined
test

method!

User methods defined as functions

These are the interface with your class

Python focus


classes

Using classes


my_FSA

= FSA(
my_state_dict
)


result =
myFSA.test
(“AGCTGGGGTTTAATT”)


Instantiate

a

class

by

invoking

its

name,

and

providing

the

arguments

the

__
init
__

method

expects

We can make as many instances of a class as we need!


Invoke

class

methods

just

by

using

the

instance

identifier

in

conjunction

with

the

method

name

using

attribute

notation!