Neural Networks for Biological Sequence Classification

navybeansvietnameseNetworking and Communications

Oct 24, 2013 (3 years and 9 months ago)

80 views



Neural Networks for Biological Sequence Classification




A. GIULIANI, F. RIGANTI FULGINEI, R. RUSSO, A. SALVINI

Dipartimento di Elettronica Applicata

Università di Roma TRE

Via della Vasca Navale 84


00146 Roma

ITALY





Abstract:
-

A neural network
(NN) approach for protein sequence classification is presented. The aim is to investigate
on the capability of simple feed
-
forward NNs which perform fast learning time (Fast Learning Neural Networks,
FLNNs) to classify the protein secondary structure fro
m the knowledge of the aminoacid sequence. Comparison
between proposed FLNN results and other NN approaches are shown.


Key
-
Words:
-

Neural Networks, Amino cid Sequence Processing , Protein Folding


1 Introduction

The approaches based on Neural Networks (
NNs)
dedicated to the identification of protein secondary
structure have been showing a growing applications in
these last years [1]
-
[3]. In this paper the authors present
an investigation on the capability of simple Back
-
propagation Feed Forward NNs (Fast

Learning NNs:
FLNNs) [3], to be successfully used in this task. In paper
[3] this possibility was already discussed and interesting
results have been presented. However, the accuracy of
the proposed method [3] even if allowed an interesting
learning time
reduction, did not allow to achieve the
same accuracy of more complex NN architectures, such
as Bidirectional Recurrent Neural Networks (BRNN),
proposed in [2]. On the other hand, the sophisticated
BRNN architecture, do not allow short learning time.
Thus,

in this paper, the FLNNs will be used as sub
-
networks of a more complex NN architecture with the
aim to improve accuracy. Each sub
-
net will be separated
trained, and this approach allows to preserve the fastness
in learning.


1 Protein Structure and Fold
ing

A Protein is a macromolecular polymer having twenty
different amino acids linked together, in various order by
peptide bonds (polypeptide). The protein structure can be
described by four different subsequent levels. The amino
acid polypeptide sequence
is called “primary structure”,
whereas the “secondary structure” is the terminology
associated to the spatial configuration of the amino acid
chain. The further structure folding is called “tertiary
structure” while, complexes of two or more polypeptide
c
hains, held together by non
-
covalent forces, but with a
precise 3D configuration are called “quaternary
structure” (see Fig.1).

The complete identification of all protein structures is a
quite complex task. It needs the use of different
techniques, each de
dicated to a particular structure level.

In this paper level is investigated the secondary structure
identification. The task is studied by NNs.

The secondary protein structure is characterized by two
main variants called: “alpha helix” and “beta
-
sheet”.



Fig.1: Protein structure schemes


Moreover, a further structure classification has to be
introduced. In particular to other eight sub
-
classes can be
observed. Usually, the structure classification is
performed by analyzing these last cited eight sub
-
cl
asses
and this will be also do in the present approach. In Fig. 2,
the cited eight sub
-
classes and their corresponding
symbols are listed.



H

Alpha Helix

G

Helix

I

Helix


E



Extended Strand

B

Isolated


Bridge

T

Hydroge
n Bonded Turn

S

Bend

“.”

Random


Fig.2:

Symbols and Classes of Proteins


3. Protein database

The protein sequence used both for training and
validating NNs have been downloaded from the “Protein
Data Bank” (PDB) [4].

The PDB archive consists of t
ext files referred to 1400
proteins. Each protein can show different length: from 15
to 500 amino acids. A data file example is shown in Fig.
3.



Fig. 3: Example of PDB text data files containing
protein information.


As it is possible to see, the prote
in primary sequence
(next simply called sequence) is made up of letters. To
each letter corresponds one type of amino acid. Then we
have to manage 2 categories of symols: Class Symbols
(Fig. 2) and Amino Acid Symbol (Fig.3). Obviously, in
PDB other protein

characteristics are indicated. For our
scopes the secondary structure symbol will be consider.


Amino Acid
Name

Code

Alanine

A

Argyinine

R

Asparagine

N

Aspartic Acid

D

Cysteine

C

Phenylalanine

F

Glycine

G

Glutamic Acid

E

Glutamine

Q

Isoleucine

I

Histidine

H

Leucine

L

Lysine

K

Methionine

M

Proline

P

Serine

S

Tyrosine

Y

Threonine

T

Tryptophan

W

Valine

V


Fig.4:

Amino Acid codes


4. NN Input and Output patterns

The PDB text files have been translated into a numerical
form to make the NN

input
-
output data file. In particular,
the text
-
number translations have been performed by a
binary orthogonal code (i.e. a 20
-
bit code have been
associated to each letters listed in Fig. 3). Similarly, a
same orthogonal coding has been adopted for the le
tter
-
number translation of class symbols (Fig. 2).

The binary string associated to the protein structure
under analysis is the input of the implemented NNs.
Since the string length is not a constant by considering
all strings into the databases, and the g
eneric string
length is quite large, it is not convenient to present to
NNs, as input, the entire amino acid sequence. Then a
strategy to reduce the string length and to homogenize
the length for all strings has been implemented.

Thus, only the protein se
quence and its secondary class
has been extracted from PDB and translated in binary
codes. The primary structure has been processed by a
window approach.


The reading
-
window of the protein structure has a
length imposed equal to 9 (or 19 residuals). The
am
ino acid which occupies the central position into
the window will be assumed as the NN input (seed
Fig. 5). The NN output will use the secondary structure
code.




Fig.5: Secondary structure of central amino acid.


5. Neural Strategy and Architecture

As
previously said, the strategy is to train simple Back
-
propagation NNs as sub
-
nets of the full NN architecture.
Thus, the binary structure code will be use to input 28
sub
-
nets. The number of 28 is due to to the strategy
followed (see Fig. . In fact, each s
ub
-
net has an output of
three neurons, one is corresponding to one secondary
class, the second corresponds to a second different class,
while the third indicates that the input does not
correspond to one of the two classes indicated by the first
two output

neurons. For example, let us assume that the
sequence of Fig. 5 is used to interrogate the sub
-
net
which output neurons are trained to predict the class “E”,
“B” and “Not E or B”. In this case the neuron which
manage “E” should be respond with “0”. The sa
me
should be made by the second neuron which manages
“B”, while the third output should commute to “1”. Then
the “EB” sub
-
net output should have the code “001”. It is
clear that all the sub
-
nets managing “H” should have
activated to the “1” value their “H”

output neuron.

Each sub
-
net has been implemented by three layer feed
-
forward NN (see Fig. 6). The scheme of the implemented
28 sub
-
nets is shown in Fig. 7.



Fig. 6: Sub
-
net Architecture



Fig. 7: List of the 28 implemented sub
-
nets.



Fig.8: Flow
-
Ch
art of the implemented algorithm


Among all PDB sequences, 90% has been used for NN
training, 10% for test. The training stop criterion is: no
improving error value for 20,000 epochs of training.

The final algorithm flow
-
chart is shown in Fig. 8. Each
of
the 28 sub
-
nets produces an answer on the class of the
input sequence. All the answers must be post
-
processed
to take into account the possibility of non
-
coherent
responses (i.e. some sub
-
nets assign the sequence to a
class which is excluded by other sub
-
n
ets). The idea is to
post
-
process the final output of all sub
-
nets by means of
a further back
-
prop NN. This last NN has an input nequal
to the output of the 28 sub
-
nets NN. The output is the
actual class. Thus, this last NN has been trained again on
the re
sults of the first NN. In Fig. 9, the scheme of the
implemented procedure to build the final NN training
pattern is shown referring to a case of prediction of a
class H. The final NN is then posed in cascade with the
parallel of the first 28 ones, and per
forms the final
results.




Fig. 9: Building of last NN training pattern from 28 sub
-
nets global output.



6. Validation

The performances of NN equal to the parallel of the 28
sub
-
nets is estimated by means of the parameter called
Q8. This last value is e
qual to the sum of the well
predicted classes divided by the total number of classes.
The proposed FLNNs results are shown in Fig. 10 in
comparison with the result obtained in [2] (BRNN) and
in [3] (parallel architecture):




Fig. 10: Comparison of differ
ent approach performances.
The proposed approach is FLNNs, the approach in [2] is
indicated by BRNN and finally the approach proposed in
[3] is indicated by Parallel architecture.



Conclusions

A neural network (NN) approach for protein
sequence classific
ation has been presented. From
obtained results it appears that fast learning time
(Fast Learning Neural Networks, FLNNs) are an
interesting method to classify the protein secondary
structure from the knowledge of the amino acid
sequence. By comparison be
tween proposed FLNN
results and other NN approaches it seems possible to
say that the result accuracy is quite good even if the
implemented NNs are simples and then the learning time
is not eccessive
.



Bibliography

[1] A. Hatzigeorgiou, M. Reczko, N. Mach
e (1996):
“Functional Site Prediction on the DNA sequence by
Artificial Neural Networks”, IEEE International Joint
Symposia on Intelligence and Systems (IJSIS '96),
Rockville (MD), pp. 12
-
17.


[2] G. Pollastri, D. Przybylski, B. Rost, P. Baldi (2002):

Improving the Prediction of Protein Secondary
Structure in Three and Eight Classes Using Recurrent
Neural Networks and Profiles”, Proteins: Structure,
Function and Genetics No. 47, pp. 228
-
235.


[3] T. Masetti, A. Salvini, M. Carli, A. Neri (2003), “A
fast

learning neural network approach to protein
secondary structure identification”, Proceedings of
3rd
International Symposium on Image and Signal
Processing and Analysis (ISPA03), Rome, Italy,
September 18
-
20, 2003
, Vol
.2, pp. 696
-
699.


[4] Research Collaboratory for Structural Bioinformatics,
Protein Data Bank,

www.rcsb.org/pdb/