Architectures for Predicting the

cracklegulleyAI and Robotics

Oct 19, 2013 (3 years and 9 months ago)

69 views

Recurrent Neural Network
Architectures for Predicting the
Fates of Proteins

John Hawkins

Overview of Talk


Introduction, hypotheses and approach


Biological Sequences


The Domain


Recurrent Neural Networks


The
models


Case Study


Subcellular Localisation


Project Plan


Expected Outcomes

Introduction


Applied machine learning project


Construct Prediction Services


Understand subtleties of biological
problems


Project Hypotheses


1) The General bias of RNNS


Recurrent Neural Networks have a natural affinity for
problems in the domain of pattern recognition in
biological sequences


2) The Specific bias of RNNS


As a pattern becomes more ambiguous, then the
particular choice of recurrent architecture becomes
more critical.


3) Modelling with RNNS


The techniques for analysing RNN behaviour will
prove efficacious in extracting models of biological
processes.

Project Approach


Based on a number of case studies from
‘Fates of Proteins’


Investigation of sequence features


Benchmarking Simulations


Analysis of performance


Final analysis


Bounds of applicability


Subtleties of problem
-
architecture map


Extracting insights from mature classifiers



Cell Biology


Quick and Dirty


Membranes


Organelles


Transport


Nucleus


DNA>RNA>Protein


Localisation


Modification



Fates of Proteins


Subcellular Localisation, e.g.


Mitochondria


Peroxisome


Nucleus


Lysosome


Post
-
translational Modification, e.g.


Disulphide Bond Formation


Glycosylation


Biological Sequences


Many Important Biological Molecules are
Polymers.


Thus representable as a sequence of discrete
symbols.


Sequence M
= [
m
1
,
m
2
,

…, m
n
] where:


DNA
m
i



{
A, T, G, C }


RNA
m
i



{
A, U, G, C }


Protein
m
i



{
G, A, V, L, I, P, S, C, T, M, D,
E, H, K, R, N, Q, F, Y, W }

Information Content


How much information in a linear
sequence?


Two crucial elements to function


Physical/chemical properties


Molecular shape


Each residue has well known properties


Denaturation. (Anfinsen,1973).


Sequence defines arrangement of chemical
properties which in turn defines folding.


Chaperones and Prions.


Biological Patterns


Motifs


General term for patterns


Numerous Definitions & Visualisations


PROSITE Patterns


Regular Expression


PROSITE Profiles


Probability Matrix


LOGOs


Machine Learning


Function Approximation


Bias is generally unavoidable


(Mitchell, 1980)


Three Sources of Bias


Input Encoding


Function Structure (Architecture)


Parameter adjustment algorithm
(learning)



Neural Networks


Graphical Model consisting of layers of
nodes connected by weights


Feed forward neural networks


Fixed input window


Signal propagates in a single pass through the
layers


Recurrent Neural Networks


Signal processed in parts


Recurrent connections maintain a memory state


Output generated after processing the last piece
of the input signal

Simple Neural Networks


FFNN O
h

=

θ
(
W
1


I
1

+
W
2



I
2

+
b
)


RNN O
h

=
θ
(
W
1



I
2

+
W
2



θ
(
W
1



I
1

+
b
) +
b
)

Why use RNNs in Bioinf


With small weight values the state machines
implemented resemble Markovian models.



(Tino,
Č
er
ň
ansk
ý

& Be
ň
u
š
kov
á

2002,Hammer & Tino
2003, Tino & Hammer 2003)


Bias Simulations


Cluster the hidden node
activations after sequence presentation.


RNNs inherently group sequences containing motifs.


(Bodén & Hawkins 2005)


Furthermore, when we add deletes and small
amounts of motif shift, the RNNs still maintain
grouping.


(Hawkins & Bodén 2005)

RNNs in Bioinformatics


Protein Secondary Structure


(Baldi, Brunak, Frasconi, Soda & Pollastri
1999)


Similarity to grammatical inference


Bi
-
Directional RNN

RNNs in Bioinf contd


Continuation of Research


PSI
-
BLAST profiles for substitution info


Ensembles for protein tertiary structure


Delayed Time Recurrent Connections


Pollastri & McLysaght, 2004.


RNNs have access to structural Info


Deliberate Architectural Variations can
prove effective in tuning a machine.




Architectural Variations

Architectural Variations

Case study
-

Subcellular
Localisation


Goal


To explore the applicability of
RNNs to predicting subcellular
localisation.


Method
-

Benchmark several RNNs
against a FFNN using a pre
-
existing
training set.


Method


Training set from TargetP, detecting:


Signal peptides (Endoplasmic Reticulum)


Mitochondrial targeting peptides


Chloroplast targeting peptides


Other…


N
-
terminal peptide of varying length


Trained to recognise whether each
residue is part of the targeting peptide

Results

Results

Results

Case Study
-

Conclusions


RNNs demonstrate clear applicability


For well defined patterns, two different
RNNs seem to perform equivalently


On the ambiguous pattern, the
architectures distinguish themselves


Prompting the thesis that when patterns
are ambiguous, different architectures
are sensitive to different features within
the sequence.

Project Plan


Two Further Case Studies


Peroxisomal Localisation


Nuclear Localisation


Nuclear Import


Nuclear Export


DNA/Binding and Regulation


Analysis


Machine


Problem Mapping


Knowledge Extraction


Project Methods


Standard Armada of RNNs to be deployed


Plus custom models designed for problem
specifics.


Benchmarked against several standard
machine learners


Naïve Bayes


FFNNs and SVMs


Evaluated in terms of overall performance.
i.e. MCC, Sensitivity and Specificity.

Peroxisomal Localisation


Predominantly controlled by a C
-
terminal sequence called the PTS1
signal.


Roughly 12 residues long


Known dependencies between
locations

Nuclear Localisation


Occurs through the Nuclear Pore
Complex


Using Importins and Exportins

Nuclear Import


Large number localisation signals.


Usually within the mature protein and not
removed after arrival.


Patterns either too loose or specific to
provide good generalisation.


Ideas


Potential Long Range Dependencies


Boosting algorithm to generate an ensemble

Nuclear Export


Number of diverse mechanism


Nuclear Export Signal


Hydrophobic residues + variable spacers


Within or c
-
terminally to
α
-
helix


Final Leucine (or sub) exposed for interaction


Recognised on Proteins for Export


Present on adaptor molecules for RNA export

Analysis


FSA Extraction


Take trained RNN and cluster state node
activations.


Each cluster becomes a machine state


Analyse the behaviour of the machine to
presentation of sequences to determine
transition rules.


Potential Benefits


Separation of sub
-
automata indicates distinct
mechanisms.


Rough estimate of computational complexity

Analysis
-

Dynamical


Treat each hidden node activation as a
dimensions in a state space.


Either apply to tasks with small
networks (eg Peroxi) or use PCA to
reduce dimensionality.


Visualise state space trajectories and
examine for attractors and phase
shifts.


Expected Outcomes


Construction of prediction services


Gathering empirical data for hypothesis 1.


Investigate alternative architectures


Gathering empirical data for hypothesis 2.


Machine analysis
-

post training


Gathering empirical data for hypothesis 3.


Use of final predictor suite for Bioinformatics
e.g. analysis of network of nuclear proteins.


The End…

?