Online Supporting Information B: Generation of the Complexity Measure Factors

chatventriloquistΤεχνίτη Νοημοσύνη και Ρομποτική

1 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

98 εμφανίσεις


1

Online Supporting Information B: Generation of


the Complexity Measure Factors


A protein sequence is formed by 20 native amino acids whose single character
codes are: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, and Y. It is very
difficult to

find its characteristic pattern particularly when the sequence is very long.
To cope with this situation, we resort to the images derived from the amino acid
sequence by means of the space
-
time evolution of cellular automaton [1], as briefed
below.


Suppo
se a protein P consists of
N

amino acids; i.e.

1 2 3
R R R R
N

P


(1)

where R
1

represents the first residue of the protein, R
2

the second residue, and so forth.
To transform a protein sequence from a characte
r code to a numerical one, we adopted
the code
-
converting relation as given in Table 1, which can better reflects the
chemical and physical properties of an amino acid, as well as its structure and
degeneracy, as elaborated in Xiao et al. [2]. If each of t
he constituent amino acids in
the protein
P

is coded in a binary code according to Table 1, the protein sequence will
be transformed to a serial of 5
N

digital elements, where each of the elements is either
0 or 1. For example, the sequence “PLQHRS…” is acc
ordingly transformed to
“000010001100100001010011001001…”. Each of these elements can be treated as a
pixel with “0” for “white” and “1” for “black”, then by following the space
-
time
evolution procedures as described in Xiao et al. [2], the protein
P

would

correspond to
a cellular automaton image.


Cellular
automata

[1] are discrete dynamical systems whose behavior is
completely specified in terms of a local relation. A cellular automaton can be thought
of as a stylized universe consisting of a regular gri
d of cells, each of which is in one of
a finite number of possible states, updated synchronously in discrete time steps
according to a local, identical interaction rule

[1]
.
The concept of cellular
automat
a

has attracted a great deal of interest in recent
years because many extremely complex

2

patterns can be evolved by just repeatedly applying some very simple rules. This is
particularly useful for emulating complicated physical, social, and biological systems.


In this study, the practical approach to gene
rate the
cellular
automat
on

image for a
given protein sequence can be described as follows.


According to Table 1, the residue chain of Eq.1 is initially converted to a sequence
with
5
N
digits (or grids); i.e,,

1 2 5
g ( )g ( ) g ( ) g ( ), ( 0)
N N
t t t t t


(2)

where
g ( ) 0 or 1
i
t


( 1, 2, , 5 )
i N


as defined in [3]. Suppose the time for each
updated step is consecutively expressed by
0, 1, 2, ,
t
 
, we have

1 2 5
1 2 5
1 2 5
g (0) g (0) g (0) g (0)

g (1) g (1) g (1) g (1)

g (2) g (2) g (2) g (2)



N N
N N
N N



1 2 5

g ( )g ( ) g ( ) g ( )
N N















   


(3)

where

-1 1
-1 1
-1 1
-1 1
0, if g ( ) 0, g ( ) 0, g ( ) 0
0, if g ( ) 0, g ( ) 0, g ( ) 1
1, if g ( ) 0, g ( ) 1, g ( ) 0
0, if g ( ) 0, g ( ) 1, g ( ) 1
g ( 1)
1, if g
i i i
i i i
i i i
i i i
i
t t t
t t t
t t t
t t t
t




  
  
  
  
 
-1 1
-1 1
-1 1
-1 1
( 0, 1, , )
( ) 1, g ( ) 0, g ( ) 0
0, if g ( ) 1, g ( ) 0, g ( ) 1
1, if g ( ) 1, g ( ) 1, g ( ) 0
0, if g ( ) 1, g ( ) 1, g ( ) 1
i i i
i i i
i i i
i i i
t
t t t
t t t
t t t
t t t










 

  


  

  


  


(4)

with the
spatially periodic boundary conditions; i.e.,

0 5
g ( ) g ( )
N
t t


and
5 1 1
g ( ) g ( )
N
t t




(5)


3

T
he rule of Eq.4 is actually the
84-th

space
-
time

evolution rule for cellular
automation models [1]. In the cellular automaton image for a protein sequence
,
its

constituent
residues are c
oupled with each other as an entity. While producing
its
cellular automaton image
, the
individual cell
state

corresponding to a certain amino
acid
residue
is colligated with

the
residues both
prior
and

behind it.
Because of this,
the cellular automaton ima
ge can help reveal some
implicit sequence features

that
are
usually
difficult to
be visualized. It was found that, of the 256 evolving rules [1], the
84-th

space
-
time

evolution rule of Eq.4 is the best in this regard. In other words, by
means of Eq.4, we can convert the one
-
dimensional protein sequence to a two
-
dimensional matrix or a graph, from which some important features can be effec
tively
extracted for incorporation into
SqCo
PseAA
P

of Eq.5.

In this study, the
-th
i

grid at
t

is filled with white color if
g ( ) 0
i
t


and black if
g ( ) 1
i
t

. Accordingly, each g
-
string in Eq.3 corresponds to a narrow ribbon mixed
with white and black colors. Scanning these ribbons successively on to a screen or
sheet will generate a 2D (2
-
dimensional) black
-
and
-
white image. The image thus
evolved is calle
d the cellular automaton image
.


A protein sequence is actually a symbolic sequence for which the complexity
measure factor can be used to reflect its sequence feature or pattern [4]. Among the
known measures of complexity, the Lempel
-
Ziv (LZ) complexity r
eflects the order
that is retained in the sequence, and hence was adopted in this study. Below, let us
first introduce some basic definition about LZ complexity.

1 2 3 N
A Alphabet of symbols (for a binary seque
nce we have two symbols,
namely 0 and 1)
S Finite lengths sequences formed by A, e.
g., S
v(S) Vocabulary of sequence S; it is the
set of all

      

substrings of S
S Number of the elements in the set S min
us one; i.e, if A {0,1} and
S 010 then v(S) {0,1,01,10,010}
and v(S ) {0,1,01}








  

   




(
6)


4


The LZ complexity of a sequence can be measured by th
e minimal number of
steps required for its synthesis in a certain process. For each step only two operations
were allowed in the process: either generating an additional symbol which ensures the
uniqueness of each component
]
:
1
[
1
k
k
i
i
S



or copyi
ng the longest fragment from
the part of a synthesized sequence.

Its substring is expressed by

1 2
[:] ... (1 i < j )
i i i j
S i j N
  
 
  

(
7
)

The complexity measure factor,
)
(
S
C
LS
, of a non
-
empty sequence
S

synthesized
according to the following procedure is defined by the minimal number of steps

]
:
1
[
]
:
1
[
]
:
1
[
]
:
1
[
)
(
1
1
2
1
1
N
i
S
i
i
S
i
i
S
i
S
S
H
m
k
k











(
8
)


Let us assume that
1 2 3
...
N
S
 


has been reconstructed by the program up to
the digit
r
a
, and
r
a

has been newly inserted. The string up to
r
a

will be denoted by
[1:]
S r
, where the dot denotes that
r
a

is newly inserted in order to check whether the
rest of the s
tring
]
:
1
[
N
r
S


can be reconstructed by a simple copying. First, suppose
1


r
a
q

and see whether
q

is reproducible
from


q
r
S
]
:
1
[
. If the answer is “no”,
then we insert
1


r
a
q

into the sequence followed by a dot. Thus, it could not be
obtained by the copying operation. If the answer is “yes”, then no new symbol is
needed and we can go on to proceed with
2
1



r
r
a
a
q

and repeat the same procedure.
The LZ complexity
is the number of dots (plus one if the string is not terminated by a
dot). F
or example, for the string

S
=0001101001000101, the
LZ schema of synthesis
generates the following components
( )
H S
and the corresponding complexity
( )
LS
C S
:











6
)
(
101
1000
100
10
001
0
)
(
S
C
S
H
LS

(
9
)

implying that the complexity measure factor
for the string

S
=0001101001000101 is 6.


5

We can
derive

M complexity

factors

if the image has M row
s
. These complexity
factors

can

all

be used t
o serve
as
the
pseudo amino acid component
s. However, it
was observed that
the
highest success rates were resulted if
the first 10 complexity

factors were
used.


6


Table 1

Binary coding for 20 different amino acids

Type

Code

Character

P

L

Q

H

R

S

F

Y

W

C

Decimal

1

3

4

5

6

9

11

12

14

15

Binary

00001

00011

00100

00101

00110

01001

01011

01100

01110

01111

Character

T

I

M

K

N

A

V

D

E

G

Decimal

16

18

19

20

21

25

26

28

29

30

Binary

10000

10010

10011

10100

10101

11001

11010

11100

11101

11110



7

Referenc
es

1.

Wolfram S (1984) Cellular automation as models of complexity. Nature 311:
419
-
424.

2.

Xiao X, Shao S, Ding Y, Huang Z, Huang Y, Chou KC (2005) Using
complexity measure factor to predict protein subcellular location. Amino
Acids 28: 57
-
61. doi:
10.10
07/s00726
-
004
-
0148
-
7

3.

Chou KC, Shen HB (2009) Review: recent advances in developing web
-
servers for predicting protein attributes. Natural Science 2: 63
-
92 (openly
accessible at
http://www.scirp.org/journal/NS
/
). doi:
10.4236/ns.2009.12011

4.

Xiao X, Sha
o SH, Huang ZD, Chou KC (2006) Using pseudo amino acid
composition to predict protein structural classes: approached with complexity
measure factor. J Comput Chem 27: 478
-
482. doi:
10.1002/jcc.20354