The protein words

estonianmelonAI and Robotics

Oct 24, 2013 (3 years and 10 months ago)

97 views

Supple
ment material for the paper

Application of latent semantic analysis to protein remote homology detection

.

This material mainly describes

the three basic building blocks of protein sequences.


The protein words

Many

biological problem
s

being

address
ed

are

analogous

to the problem of processing an unknown natural

la
n
guage when
the only available knowledge is the set of

its most basic units. For natural languages, these most

basic units can be either a
l-
phabet letters (e.g., Hebrew,

Greek, English, and
so forth) or syllabic signs (e.g.,
Chinese
, Jap
a
nese
,

and so forth). In protein

sequence
s, the most basic units are the 20 nat
u
ral

amino acids.

Unlike the natural language, the exact

words


of proteins are
not clear.

In this study, several basic building
blocks of protein sequences are
investigated

for protein remote homology d
e-
tection.

1 N
-
grams

In the absence of any

language information
, one can think that a word is some consecutive basic units that usually occur t
o-
gether.

Leslie et al.
(Leslie, et al., 2002)

used k
-
spectrum kernel for protein remote h
o
mology detection
and get a
satisfactory

result. The k
-
spectrums are
the set of all possible subsequences of amino acids of a fixed length k.
The k
-
spectrums are also
referred to N
-
grams in natural language. The N
-
gram model has been extensively used in nat
u
ral language pro
ceeding, such
as word
-
segmentation, part
-
of
-
speech tagging,
speech recognition
,
word
-
sense disa
m
biguation

and so on
(Daniel and James,
2000)
. Cheng et al.
(Cheng, et al., 2005)

also used the N
-
grams combined wit
h the simple method for protein classification
and provided state
-
of
-
the
-
art perfor
m
ance on GPCR families.

In this study
,
the N
-
grams of amino acids are used as one of the protein

words


for protein remote homology dete
c
tion.
The
total words of protein se
quences

increase
exponentially

with
N
, so the value of
N

is taken as 3. It is suff
i
cient to detect
remote homology of protein s
e
quences.

2
Patterns

The N
-
grams are the combination of the basic amino acids, which cannot accurately describe the
ingredient
s

o
f proteins with
low sequence identities. Especially in the

twilight zone

, N
-
grams cannot
recognize

the homology of proteins.
It is necessary
to use a more general word
. I
n ev
o
lution, amino acid substitution is a common phenomenon. Thus the protein words
can

be
composed of amino acids and the wil
d
card (

.
’) that

can be any of the twenty amino acids. Here are some examples:
E.L.K
,
NGF
,
KI...L
,
Q..Y.A..L
. The protein words used here are same as the patterns in biological sequence analysis. Let
Σ

be a a
l-
phabet, a pattern
(Pisanti, et al., 2002)

is a string in

Σ
U
Σ
(
Σ
U{

.

}
)*
Σ
, that is, a string on the alph
a
bet

Σ
U

{‘
.

}

that starts and
ends with a solid
character (
not wildcard). A pattern P is also called a <L
, W
> pattern if any su
b
string of P with length W or
more

contains at least L solid characters.

Using patterns as the protein secondary stru
c
ture words
,

Dong et al.
(Dong, et al.,
2005)

have predicted the secondary structures of proteins

with a satisfa
c
tory result
.

The TEIRESIAS
(Rigoutsos and Floratos
, 1998)

alg
o
rithm
is used to extract
patterns

in protein sequences.
TEIRESIAS

is a
deterministic

algorithm that allows one to carry out pa
t
tern discovery

in biological sequences

and has been used to annotate
proteins
(Rigoutsos, et al., 2002)
.
The alg
o
rithm carries out this

task without need for a data model and generates all

patterns
appearing K or more times in the input wi
th the

additional guarantee of pattern maximality in both composition

and length.

In this study, the TEIRESIAS algorithm
is

executed on all the
training set of
protein sequences and totally 71009 pa
t
terns
are

extracted. The produced patterns contain too mu
ch redundant information and many machine learning methods can

t pe
r-
form well in the high
-
dimensional feature space. It is highly desirable to reduce the native space

by

removing non
-
informative or redundant
patterns
.

A large number of feature selection

me
thods have been developed for this task, including
document frequency, information

gain, mutual information, chi
-
square and term strength.

The
chi
-
square

algorithm is s
e
lec
t-
ed
in
this

study because it is one of
the
most effective feature selection methods

in

document class
i
fication

(Yang and Pede
r-
sen, 1997)
. After chi
-
square selection, 8000 patterns are selected as the
cha
r
acteristic

words.

3
Motif

By focusing o
n limited, highly conserved regions of

pr
o
teins, motifs can often reveal important clues to a

protein’s role even
if it is not globally similar to any known

protein
(Nevill
-
Manning, et al., 1998)
. The motifs for

most catalytic sites and bin
d-
ing sites are conserved over

much wid
er taxonomic distances and evolutionary time

than are the s
e
quences of the proteins
themselves. Thus,

motifs often represent functionally important regions such

as catalytic sites, binding sites, protein
-
protein
intera
c
tion

sites, and structural motifs.

Se
veral computer algorithms exist for automatically constructing a characteristic set of sequence motifs from a set of bi
o-
logical sequences. In this research, the MEME/MAST system version 3.0
(Bailey and Elkan, 1994, Bailey and Gribskov,
1998)

is used to discover motifs and search databases. MEME represents motifs as position
-
dependent letter
-
probability
m
a-
trices that

describe

the probability of each possible letter at each position in the pattern. MEME uses statistical mo
d
eling
techniques to automatically choose the best width, number of occurrences and descri
p
tion for each motif. Since motifs only

exist in related protein s
e
q
uences, the training sequences of the same superfamily are used to generate motifs. Totally, there
are 3231 motifs e
x
tracted.



REFERENCE

Bailey, T. L. and Elkan, C. (1994) Fitting a mixture model by expectation maximization to discover

motifs in biopolymers. In Proceedings
of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28
-
36, Menlo Park, California

Bailey, T. L. and Gribskov, M. (1998) Combining evidence using p
-
values: application to sequence h
omology searches. Bi
o
informatics 14,
48
-
54

Cheng, B. Y., Carbonell, J. G. and Klein
-
Seetharaman, J. (2005) Protein classification based on text document classification techniques.
Proteins. 58, 955
-
970

Daniel, J. and James, H. M. (2000) Speech and Language

Processing: An introduction to Natural Language Processing, Computational
Linguistics, and Speech Processing. Prentice
-
Hall

Dong, Q. W., Wang, X. L., Lin, L., Guan, Y. and Zhao, J. (2005) A Seqlet
-
based Maximum Entropy Markov Approach for Protein Se
c-
o
n
dar
y Structure Prediction. Science in China Ser. C Life Sciences 48, 87
-
96

Leslie, C., Eskin, E. and Noble, W. S. (2002) The spectrum kernel: A string kernel for SVM protein classification. In Pr
o
ceedings of the
Pacific Symposium on Biocomputing, pp. 564
-
575

Nevill
-
Manning, C. G., Wu, T. D. and Brutlag, D. L. (1998) Highly specific protein sequence motifs for genome analysis. Proc Natl Ac
ad
Sci U S A 95, 5865
-
5871

Pisanti, N., Crochemore, M., Grossi, R. and Sagot, M. F. (2002) A Basis for Repeated Motifs in Pa
ttern Discovery and Text Mining. IGM
2002
-
10, Juillet

Rigoutsos, I. and Floratos, A. (1998) Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm. BioIn
formatics
14, 55
-
67

Rigoutsos, I., Huynh, T., Floratos, A., Parida, L. and Pl
att, D. (2002) Dictionary
-
driven protein annotation. Nucleic Acids Research 30,
3901
-
3916

Yang, Y. and Pedersen, J. A. (1997) A comparative study on feature selection in text categorization. In Proc of 14th intern
a
tional confe
r-
ence on machine learning, pp.

412
-
420, San Francisco, USA