Application of a new probabilistic model for recognizing complex ...

breakfastcorrieΒιοτεχνολογία

22 Φεβ 2013 (πριν από 4 χρόνια και 5 μήνες)

236 εμφανίσεις

BIOINFORMATICS
Vol.20 Suppl.1 2004,pages i6–i14
DOI:10.1093/bioinformatics/bth916
Application of a new probabilistic model for
recognizing complex patterns in glycans
Kiyoko F.Aoki,Nobuhisa Ueda,Atsuko Yamaguchi,Minoru
Kanehisa,Tatsuya Akutsu and Hiroshi Mamitsuka

Bioinformatics Center,Institute for Chemical Research,Kyoto University,Gokasho,Uji,
611-0011 Kyoto,Japan
Received on January 15,2004;accepted on March 1,2004
ABSTRACT
Motivation:The study of carbohydrate sugar chains,or
glycans,has been one of slow progress mainly due to the
difficulty in establishing standard methods for analyzing their
structures and biosynthesis.Glycans are generally tree struc-
tures that are more complex than linear DNA or protein
sequences,and evidence shows that patterns in glycans may
be present that spread across siblings and into further regions
that are not limited by the edges in the actual tree structure
itself.Current models were not able to capture such patterns.
Results:We have applied a new probabilistic model,called
probabilistic sibling-dependent tree Markov model (PSTMM),
which is able to inherently capture such complex patterns of
glycans.Not only is the ability to capture such patterns import-
ant in itself,but this also implies that PSTMM is capable of
performing multiple tree structure alignments efficiently.We
prove through experimentation on actual glycan data that this
new model is extremely useful for gaining insight into the hid-
den,complex patterns of glycans,which are so crucial for the
development and functioning of higher level organisms.Fur-
thermore,we also show that this model can be additionally
utilized as an innovative approach to multiple tree alignment,
which has not been applied to glycan chains before.This
extension on the usage of PSTMM may be a major step for-
ward for not only the structural analysis of glycans,but it
may consequently prove useful for discovering clues into their
function.
Contact:mami@kuicr.kyoto-u.ac.jp
1 INTRODUCTION
Glycobiology is the study of the structure,biosynthesis and
biology of carbohydrate sugar chains,or glycans,the majority
of which are located on the outer surface of cellular mac-
romolecules,and which assist in crucial activities for the
development and function of complex,multicellular organ-
isms.The understanding of glycans,however,is far from
complete,especially in obtaining a good grasp of their struc-
tures,let alone understanding their function completely.The

To whomcorrespondence should be addressed.
basic unit of glycans is the monosaccharide,analogous to
amino acids for proteins,or nucleotides for DNA.But because
monosaccharides contain on average 6–8 hydroxyl groups to
which other monosaccharides can bind,glycans can become
complex,branched,tree structures,in contrast to the linear
structures of proteins or DNA.Compounded with such basic
structural complexities,the biosynthesis of glycans also con-
found biologists in that it is not just a direct process of adding
monosaccharides toanexistingchain,incontrast tohowtRNA
add amino acids to proteins (Varki et al.,1999).
As a first step into the understanding of glycans,we can
examine their existing tree structures,which are used for
recognition by various agents such as pathogens as well as
by proteins that enable the development and functioning of
the organism.For example,it has been shown in the literat-
ure that lectins recognize glycans via certain monosaccharide
configurations (patterns) on the outermost portion of their tree
structures;sialic acids as ligands have beenshowntobe recog-
nized by proteins of animal,plant and microbial origin,or
more specifically,sialic acid binding lectins.Furthermore,it
seems that recognition can be affected by specific structural
variations andmodifications of certainmonosaccharides,their
linkage to the underlying sugar chain,and the structure of
these chains (Varki,1997).Not only would an understand-
ing of structural patterns in glycans be used to further support
studies insugar recognition,but suchworkwouldbe helpful in
unraveling their biological functions (Bertozzi and Kiessling,
2001;Drickamer,1988).Thus our work to find patterns in
known glycan structures is to not only reveal possible motifs
in glycans,but also to lead to conjectures into their functions
through such approaches as multiple tree alignment.
There are many areas of research on tree structures in
bioinformatics,such as phylogenetic tree estimation (Csürös,
2002;Sjölander,1998),RNA secondary structure ana-
lysis [including similarity analysis (Höchsmann et al.,2003;
Jannson and Lingas,2001),alignment (Aoki et al.,2003;
Sakakibara,2003) and prediction (Knudsen and Hein,1999)],
and orthology analysis (Arvestad et al.,2003) that often
use models in theoretical computer science (Jiang et al.,
1995) or probabilistic models such as (hidden) Markov
i6
Bioinformatics 20(Suppl.1) ©Oxford University Press 2004;all rights reserved.
by guest on February 21, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
Application of a new probabilistic model
models (Durbin et al.,1998;Krogh et al.,1994),Bayesian
networks (Friedman,1998) and stochastic context-free gram-
mars (SCFGs) (Sakakibara et al.,1994).However,these
models only concern themselves with the direct connections
between the nodes in the trees (i.e.the edges).This is,of
course,reasonable when presented with a tree structure,as
it is assumed that there is no relationship between nodes
that are not connected directly by an edge.Unfortunately,
in the case of glycans,this is a major drawback in that it is
necessary to capture such dependencies that are not bounded
simply by the edges of the tree structure.Such depend-
encies are seemingly inherent in glycan structures,not to
mention many other biological structures.Such implied pat-
terns hidden behind a tree structure have not been explored
before.
Therefore,we approach the problemof capturing such com-
plex patterns that are inherently hidden across the breadth
of trees using a new probabilistic model called probabilistic
sibling-dependent tree Markov model (PSTMM) (Ueda et al.,
2004),which incorporates a hidden dependency pattern that
is able to capture relationships that may exist across sib-
lings or even further.PSTMM is a significant extension of
a HMMs (Durbin et al.,1998) for trees that integrates sib-
ling relationships,so that just as longer range dependencies
across a sequence can be captured by linear HMMs,more
complex relationships across and over siblings are able to
be captured by this new model.Furthermore,because of the
probabilistic properties of PSTMM,not only are the patterns
inherent in glycans captured,but standard methods for recog-
nizing the most common patterns can enable us to specify
exactly which patterns are most prevalent and assist us in
predicting plausible patterns for recognition in new struc-
tures.We make note that Ueda et al.(2004) only present
the PSTMMmodel and its learning algorithm.However,we
showthat PSTMMis also capable of performing multiple tree
alignments as an intermediate step in this process of com-
plex pattern recognition.PSTMMcan approach multiple tree
alignments fromthe perspective of usinga probabilistic model
for aligning tree structures,much as HMMs are used for mul-
tiple sequence alignment.This is a significant extension to
the PSTMMmodel in that multiple tree alignments have not
been previously applied to glycan structures in analyzing their
functionality,and thus we were able to expand the range of
capabilities of PSTMM.
The only other comparable probabilistic models are the
probabilistic tree Markov Model (PTMM) (Diligenti et al.,
2003),which,as we mentioned earlier for other tree models,
only considers dependencies along the edges of the tree,and
hierarchical HMMs (Fine et al.,1998),which appear structur-
allysimilar but focus onsequence analysis.AlthoughPTMMs
could also be used for multiple tree alignment,we will show
through our experimental results that the more complex pat-
terns captured by PSTMMallows it to find better multiple tree
alignments.
Table 1.Common monosaccharides,their abbreviations,and their symbols
Sugar Abbr.Sym.
Glucose Glc
￿
Galactose Gal

Mannose Man 
Sialic acid NeuNAc
￿
N-acetylglucosamine GlcNAc ￿
N-acetylgalactosamine GalNAc ￿
Fucose Fuc 
Xylose Xyl 
Glucuronic acid GlcA
Iduronic acid IdA
We have vigorously verified the performance of PSTMM
by experimenting on actual glycan data and compared it with
other models that accounted for only parent–child dependen-
cies.We show that PSTMMstatistically outperformed these
other models by a significant margin over the same datasets
and that the computational complexity is within the practical
limits as demonstrated by other models in similar application
domains.Specifically,it is the same as that for both learning
and parsing SCFGs;both are O(n
3
) where in SCFGs,nwould
be sequence length and in PSTMM,n would be the number of
nodes in the input trees.Thus we can showthat while PSTMM
is capable of capturing richer patterns of information,the cost
for this gain in performance is negligible.
After training on the glycan data,we calculated the most
likely state transition path,from which we were able to per-
formmultiple tree alignments very easily,and we were able to
find interesting patterns in the data.Indeed,these promising
results reveal an exciting newpath for glycobiology research.
2 BACKGROUND
2.1 Glycan structures
The structures of glycans are complex in that they are not
simple like sequences;they are branched tree structures with
one of two types of linkages (e.g.see Fig.7).The basic
component is the monosaccharide unit,or sugar,of which
a handful are most common in higher animal oligosacchar-
ides (Table 1).Each sugar is linked to one or more other sugars
by various types of linkages (i.e.α or β) and between different
hydroxyl groups on each sugar.
There are several classes of glycans that are known,based
oncertainbasic patterns mainlyinthe core structure (a subpor-
tion of the tree starting fromthe root).Two major classes are
N- and O-glycans.Table 2 lists the classes as available in the
KEGG GLYCAN database (Kanehisa et al.,2004).
2.2 Terminology and notation
Atree is an acyclic connected graph,and we refer to a vertex
of a tree as a node.A rooted tree is a tree having a starting
i7
by guest on February 21, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
K.F.Aoki et al.
Table 2.Glycan classes from KEGG Glycan,sorted by total number of
structures.(A) is the total number of glycans,(B) the total number of sibling
pairs and (C) the average number of nodes within each glycan
Class (A) (B) (C)
N-Glycan 2068 1865 11.1
Glycoside 931 380 4.1
Sphingolipid 914 421 8.1
O-Glycan 746 515 6.6
Glycosaminoglycan 596 431 8.3
LPS 447 213 7.3
Polysaccharide 344 153 8.6
Neoglycoconjugate 111 57 6.4
GPI 75 47 7.3
Oligosaccharide 34 2 4.5
Glycerolipid 30 2 3.5
node called the root,from which the rest of the tree extends.
Any node on a unique path fromthe root to a node is called an
ancestor of the node,and if node x is an ancestor of node y,y
is a descendant of x.The nodes x
i
that are only one edge away
from a node y are called the children of y,and if node x is a
child of node y,y is a parent of x.Nodes x and y are siblings
if they have the same parent,and a node with no children is
a leaf.A subtree of tree T is a tree whose nodes and edges
are subsets of those of T,an ordered tree is the rooted tree
in which the children of each node are ordered,and a labeled
tree is a tree in which a label is attached to each node.All
trees in this paper are ordered,labeled and rooted trees.
We use the following notation in this paper.Let T =
{T
1
,...,T
|T|
} be a set of labeled ordered trees,where T
u
=
(V
u
,E
u
),V
u
(={x
u
1
,...,x
u
|V
u
|
}) is a set of nodes and E
u
is
a set of edges.x
u
1
is the root of tree T
u
,|V| = max
u
|V
u
|,
t
u
(i) is a subtree of T
u
,having x
u
i
as the root of t
u
(i),
L
u
⊆ {1,...,|L
u
|} is a set of indices of leaves in t
u
,and
C
u
(p) ⊆ {1,...,|C
u
(p)|} is a set of indices of children of
x
u
p
in T
u
.Let |C| = max
u,p
|C
u
(p)|.If we let x
u

(p) and
x
u

(p) be the eldest and youngest child of node p,respect-
ively,then Y
u
(p) = C
u
(p) −x
u

(p).Each node x
u
j
has label
o
u
j
∈ ,where  = {σ
1
,...,σ
||
} is the set of labels (i.e.
the alphabet) applied to the nodes.For simplicity,we will
often use j for node x
u
j
if understood from the context,and
for node j,we will use i,k and p to refer to the immediately
elder sibling,the immediately younger sibling and the parent,
respectively.
3 PROBABILISTIC SIBLING-DEPENDENT
TREE MARKOV MODEL
The PSTMMby Ueda et al.(2004) incorporates newdepend-
encies between the siblings of the tree in addition to parent–
child dependencies.Figure 2 illustrates these dependencies
embedded in the tree fragment of Figure 1,where the state
Fig.1.For node x
j
in a labeled ordered tree,the immediately elder
and younger nodes are nodes x
i
and x
k
,respectively,and the parent
node is node x
p
.
Fig.2.Dependencies in PSTMM for Figure 2.A white node is a
state,and a shaded node is a label.
of a node depends on the states of its parent and immediately
elder sibling,if one exists.
3.1 Parameters and probabilistic structure
PSTMMhas three probability parameters,π,a and b.The ini-
tial state probability π[s
l
] (= P(z
u
1
= s
l
;θ)) is the probability
that state (z
u
1
) of root node x
u
1
is s
l
,the state transition probab-
ility a[{s
q
,s
l
},s
m
] (=P(z
u
j
= s
m
| z
u
p
= s
q
,z
u
i
= s
l
;θ)) is the
conditional probability that the state of a node x
u
j
is s
m
given
that the states of its parent (x
u
p
) and immediately elder sibling
(x
u
i
) are s
q
ands
l
,respectively,andthe label output probability
b[s
l

h
] (=P(o
u
j
= σ
h
| z
u
j
= s
l
;θ)) is the conditional prob-
ability that the output label of node x
u
j
is σ
h
given that the state
of x
u
j
is s
l
.For simplicity,we hereafter use π[l],a[{q,l},m]
and b[l,σ
h
],instead of π[s
l
],a[{s
q
,s
l
},s
m
] and b[s
l

h
],
respectively.Note that

l
π[l] =1,

m
a[{s
q
,s
l
},s
m
] =1
and

h
b[l,σ
h
] =1.
To describe the probabilistic structure of PSTMM,upward,
forward and backward probabilities are defined.The upward
probability U
u
(s
q
,x
u
p
) is the probability that all labels of sub-
tree t
u
(p) are generated and that the state of node p is s
q
.The
forward probability F
u
(s
q
,s
l
,x
u
j
) is the probability that for
node j,all labels of the subtrees of each of the elder siblings
are generated,the state of node j is s
l
,and the state of parent
p is s
q
.The backward probability B
u
(s
q
,s
m
,x
u
j
) is the prob-
ability that for node j,all labels of the subtrees of each of the
younger siblings and node j are generated,s
m
is the state of
i8
by guest on February 21, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
Application of a new probabilistic model
Fig.3.Updating (left) F
u
(q,l,j),(center) B
u
(q,m,j) and (right) U
u
(q,p).The cyan node is j [or p for U
u
(q,p)].The blue areas are used
for updating.
j,and s
q
is the state of its parent.
1
F
u
(q,l,j) =



If x
u
j
= x
u

(p) then a[{q,−},l],
o.w.

m
F
u
(q,m,i)U
u
(m,i)a[{q,m},l].
B
u
(q,m,j) =



If x
u
j
= x
u

(p) then U
u
(m,j),
o.w.U
u
(m,j)

l
a[{q,m},l]B
u
(q,l,k).
U
u
(q,p) =







If C
u
(p) = ∅ then b[q,o
u
p
],
o.w.b[q,o
u
p
]

m
F
u
(q,m,j)B
u
(q,m,j).
(s.t.j ∈ C
u
(p))
Figure 3 illustrates this procedure for updating each of
these three probabilities.The likelihood for a given tree is
obtained by using U
u
(l,1) (U at the root of the tree),and
the likelihood for a given set of trees is computed as a
product of the likelihood for each tree in the set:L(T;θ) =

u

l
π[l]U
u
(l,1).
3.2 Estimating the parameters
Themaximumlikelihoodis calculatedusingtheEMalgorithm
to estimate the probability parameters of PSTMM.To
describe the EM procedure for PSTMM,in addition to F,
B and U,a downward probability D
u
(s
l
,x
u
j
) is defined,
which is the probability that all labels of a tree except for
those of subtree t
u
(j) are generated and that the state of
node x
u
j
is s
l
.The downward probability
2
at a node can
be computed using the downward probability at its par-
ent and the forward and backward probabilities at its
1
We hereafter use F
u
(q,l,j) and B
u
(q,m,j) and U
u
(q,p) for F
u
(s
q
,s
l
,x
u
j
),
B
u
(s
q
,s
m
,x
u
j
) and U
u
(s
q
,x
u
p
),respectively.
2
We hereafter use D
u
(l,j),instead of D
u
(s
l
,x
u
j
).
siblings (Fig.4)
3
:
D
u
(l,j) =























If j is the root then π[l],
else if j = x
u

(p) then

q
D
u
(q,p) b[q,o
u
p
] F
u
(q,l,j),
o.w.

q
D
u
(q,p) b[q,o
u
p
] F
u
(q,l,j)

m
a[{q,l},m] B
u
(q,m,k)
Figure 5 is the pseudocode for calculating the four types
of probabilities,F,B,U and D,based on a bottom–up and
top–downdynamic programmingmethod.Alevel-order num-
beringof the nodes of the giventree fromthe root tothe leaves
4
is first performed.U,B and F is then calculated in reverse
order,fromthe leaves to the root,followed by the calculation
of D,in order,fromthe root to leaves.
Using these four probability parameters,expectation values
are computed in order to update our probability parameters.
We illustrate with γ
u
({s
q
,s
m
},s
l
),which is the expectation
value that the state of a node is s
l
andthat the states of its parent
and immediately elder sibling are s
q
and s
m
,respectively.
This expectation value
5
can be calculated using the following
EM algorithm which is repeated until a certain convergence
criterion is satisfied.In the E-step,defining H
u
(q,m,l,j) =
F
u
(q,m,i)U
u
(m,i)a[{q,m},l]B
u
(q,l,j) and j = v
u

(p),
γ is calculated as follows
6
:
γ
u
({q,m},l)
=

p:C
u
(p)
D
u
(q,p)b[q,o
u
p
]

j∈Y
u
(p)
H
u
(q,m,l,j)
L(T
u
)
.
In the M-step,using this γ,we update ˆa as follows [see Ueda
et al.(2004) for the detailed calculation and update procedure
3
Note that for any node i,the likelihood for a tree can be calculated using the
upward and downward probabilities at i as L(T
u
;θ) =

l
U
u
(l,i) D
u
(l,i).
4
We assume that the ordering used here is set according to the ordering of
the siblings.
5
We hereafter use γ
u
({s
q
,s
m
},s
l
) for γ
u
({q,m},l) for simplicity.
6
For completeness,we note that the summations over the children of p:C
u
(p)
are for when C
u
(p)
= ∅.
i9
by guest on February 21, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
K.F.Aoki et al.
Fig.4.The cyan node is j,and the blue and red areas are used to
update D
u
(l,j).
of all the probability parameters]:
ˆa[{s
q
,s
m
},s
l
] =

u
γ
u
({s
q
,s
m
},s
l
)

u

l

γ
u
({s
q
,s
m
},s
l

)
,
3.3 Retrieving the most likely state transition
Once the likelihood is maximized,we can now find the most
likely state path that actually maximized the probabilities.
This is done by taking the calculations of B and U and
modifying them to calculate φ
B
(q,m,j) and φ
U
(q,p).
φ
B
(q,m,j) represents the maximum probability that for a
state transition from node j,the state of j is s
q
and all the
labels of the subtrees for each of the younger siblings and
node j are generated.Accordingly,φ
U
(q,p) is the maximum
probability that for a state transition from node p,all labels
of subtree t
u
(p) are generated and the state of node p is s
q
.
The computation of these two variables are performed in the
following manner:
φ
U
(q,p) =





If C
u
(p) = ∅ then b[q,o
u
p
],
o.w.max
m
b[q,o
u
p
]a[q,−,i]φ
B
(q,m,i).
(s.t.x
u
i
= x
u

(p))
φ
B
(q,m,j) =

If x
u
j
= x
u

(p) then φ
U
(m,j),
o.w.max
l
φ
U
(m,j)a[{q,m},l]φ
B
(q,l,k).
To retrieve the actual states that produced these values,we use
τ
U
(q,p) =



If C
u
(p) = ∅ then 0,
o.w.arg max
m
b[q,o
u
p
]a[{q,−},i]φ
B
(q,m,i).
(s.t.x
u
i
= x
u

(p))
τ
B
(q,m,j) =

If x
u
j
= x
u

(p) then 0,
o.w.arg max
l
φ
U
(m,j)a[{q,m},l]φ
B
(q,l,k).
With these formulas,we can then calculate P

and
q

j
.P

[= max
l
π[l]φ
U
(l,1)] is the probability that all
Fig.5.Pseudocode for calculating F,B,U and D in PSTMM
labels are outputted along the best state transition,and q

j
(=arg max
l
π[l]φ
U
(l,1)) is the best state transition fromj.
Thus,starting from the root,we can trace through the
tree to retrieve the maximized state transitions for x
j
(j =
2,...,|V
u
|):q

j
= τ
U
(q

p
) for x
j
= x
u

(p),andq

j
= τ
B
(q

i
)
otherwise.The resultingset of states {q

1
,...,q

|V
u
|
} give us the
most likely state transition path for the given tree U.
i10
by guest on February 21, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
Application of a new probabilistic model
3.4 Computational complexity for estimating
the parameters
Out of the equations for calculating the expectation values and
updating the parameters in the E- and M-steps of the learning
algorithm,the most time-consuming part of this algorithm is
obviously calculating γ.The time complexity for calculating
γ with all possible parameter values reaches O(|T| · |S|
3
· |V| ·
|C|),since we must compute O(|V| · |C|) for each γ,which
is then repeated O(|S|
3
) times for all possible combinations
of states.However,we note that this complexity is equival-
ent to that of SCFGs as a maximal bound for a probabilistic
model.
4 EXPERIMENTAL RESULTS
For our experiments
7
to verify the performance of PSTMM,
we usedglycanstructure data fromthe KEGGGLYCANdata-
base (Kanehisa et al.,2004).We will first describe the models
with which we compare the performance of PSTMM in our
experiments.Then we will describe the results of the experi-
ments using glycan data and illustrate a multiple sequence
alignment produced as a result,along with the patterns found.
4.1 Other methods compared with PSTMM
In this section,we will introduce a simple probabilistic model
and its corresponding mixture version,which will be com-
pared with PSTMM.This simple model,which we call Label
Pair Model (LPM),focuses on each parent–child pair by
simply counting the frequency with which each pair occurs.
Note that this model does not have ‘states’ because one state
corresponds one-to-one with a label.
The mixture of Label Pair Models (MLPM) is a prob-
abilistic model with the following parameters:w[c,σ
h

h

]
(

h
w[c,σ
h

h

] = 1 for each pair of c and h

) and
π[c,σ
h
] [

h
π(c,σ
h
) = 1 for each c].For a component c,
w[c,σ
h

h

] [=P(o
u
j
= σ
h
| o
u
p
= σ
h

,c)] is the conditional
probability that label σ
h
is outputted at a node given that σ
h

is
outputted at its parent node,and π[c,σ
h
] (= P(o
u
1
= σ
h
| c))
is the probability that the root label is σ
h
.
Label Pair Model is simply an MLPMcontaining just one
component,so no iteration of the EMalgorithmis applied to
LPM;its parameters are only calculated once.As for MLPM’s
estimationprocedure,interestedreaders mayseeMcLachlan’s
review on mixture models (McLachlan and Peel,2000).The
likelihood L for a given set of trees is given by MLPM as
L =

u

c
p(c)π[c,o
u
1
]

j
a[c,o
u
j
,o
u
p
].
Note that for capturing patterns based on multiple parent–
child relationships in a given set a trees,MLPMhas the same
representational power as that of PTMM.Therefore,using
MLPMin our experiments to compare with the performance
of PSTMMsuffices to prove its performance advantage.
7
All of our experiments were performed on a Linux machine with dual Intel
Xeon 3.0 GHz processors and 8 GB of RAM.
4.2 Glycan experiment
In preparing the glycan structures for our experiment,we first
read in each structure’s node and edge information from the
KEGG GLYCAN database and ordered the children of each
node according to the hydroxyl to which they were attached.
Thus,each node j corresponded to a monosaccharide,and
each immediately younger sibling to j corresponded to the
monosaccharide attached to the hydroxyl group immediately
below the node.For example,Figure 7 illustrates how two
GlcNAc (￿) nodes attached to Mannose () are ordered;for
G04023,the lower child is attached to hydroxyl two (2) while
the upper child is attached to hydroxyl six (6).
We selected the following glycan classes as our datasets for
our experiment:Glycosaminoglycan,N-Glycan,O-Glycan,
and Sphingolipid.The other classes were disregarded due
to either an insufficient number of glycans or an insufficient
average glycan size (i.e.number of nodes in each tree).We
also analyzed the structures within each of the remaining four
classes and purged themof any trees that did not have siblings;
we only trained/tested on those structures that contained at
least one sibling pair.
Our experiment consisted of a 5-fold cross-validation for
the glycan structures within each class.That is,we created
datasets based on class and divided each dataset into five sub-
sets containing randomly selected tree structures from that
class.We tested each subset in one round for a total of five
rounds.For each test round,we trained with 50 randomly
selected structures from each of the non-test sets for a total
of 200 training structures,and we tested on all the structures
in the test set for that round.We also tested on a correspond-
ing negative example test set,which was a set of trees whose
tree size (i.e.number of nodes) and parent–child label pair
distribution was equivalent to that of the positive test set.The
negative test set was thus created so that the simpler models
would not be able to easily distinguish between the positive
and negative test sets.
We compared the performance of PSTMM with the two
simpler models using the following parameters
8
:|T| =200
for training,|S| =10,|| =19,|C| =5 and the number of
components in MLPM= 10.Note that the trees in each data-
set varied in tree size,so we needed to correct the likelihood
calculation for each tree accordingly.Therefore,each prob-
ability parameter value was multiplied by its size.That is,
for example we multiplied a[{s
q
,s
m
},s
l
] by |S|.These cor-
rected parameter values (or scores) were used to calculate
8
For |T|,50 training trees were randomly selected from each of the four
non-test sets.For this experiment,we allowed each state to transition to any
other state.However,as the main patterns in each class are better under-
stood,we can allow more limitations on the paths through which the states
can go for transition.For ||,we listed 10 monosaccharides in Table 1
as the most common in higher level organisms,but a scan through all the
structures in KEGG Glycan revealed 19 various monosaccharides,so we set
|| to 19.
i11
by guest on February 21, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
K.F.Aoki et al.
Fig.6.Score distributions for (left-to-right) N-Glycans,O-Glycans,Glycosaminoglycans and Sphingolipids.
Table 3.Glycan experiment results for (a) N-Glycans,(b) O-Glycans,(c)
Glycosaminoglycans and (d) Sphingolipids using 5-fold cross-validation
PSTMM MLPM LPM
a A1 0.92 0.678 (28.5) 0.551 (45.5)
A2 0.855 0.645 (22.5) 0.554 (33.6)
P 0.956 0.668 (29.2) 0.557 (39.8)
b A1 0.801 0.649 (11.4) 0.549 (24.4)
A2 0.753 0.638 (10.5) 0.571 (19.2)
P 0.841 0.627 (13.5) 0.550 (20.1)
c A1 0.919 0.696 (10.3) 0.487 (24.0)
A2 0.864 0.672 (11.3) 0.537 (23.8)
P 0.963 0.724 (9.6) 0.489 (24.4)
d A1 0.883 0.651 (14.3) 0.590 (28.0)
A2 0.831 0.650 (12.8) 0.617 (19.0)
P 0.929 0.641 (14.5) 0.613 (19.3)
a
A1,AUC;A2;Accuracy;and P,Precision.
the likelihood of each tree.Finally,we repeated this entire
experiment five times.
4.2.1 Performance results We averaged our results over
the 25 (5 ×5) runs,which is listed in Table 3,as area under
the ROCcurve (AUC) values,predictionaccuracies andpreci-
sions (at sensitivityof 0.3),for thethreemethods testedonfour
classes of glycans.An AUC value takes on a value between 1
and 0 (the higher,the better) and is defined as the false positive
threshold at zero sensitivity,where the false positive threshold
is based on the false positive rate,which is the proportion of
the number of false positives to the total number of negative
examples,and sensitivity is the proportion of the number of
correctly predicted examples to the total number of positive
examples.We define prediction accuracy as the threshold at
which the positive and negative test scores are best discrim-
inated,and precision is the proportion of correctly predicted
examples to the number of examples predicted to be positive.
For our experiment,a reasonable sensitivity value of 30%was
selected.
Table 3 provides t-values
9
in parentheses,indicating that
PSTMMstatistically outperforms both LPMand MLPMby a
significant margin.N-Glycanhadthe best performance among
all four classes,which may be because of its large dataset size.
However,we can see that even with a small dataset size such
as Sphingolipid,PSTMM has a considerable performance
advantage.It is apparent that there indeed exist long-range
dependencies across siblings that couldnot be capturedbyany
of theother methods.Inconsideringsuchresults,theincreased
timecomplexityis well worththeinformationgainedfromthis
model.
We also investigated the score distributions for each of the
25 runs in the above cross-validation.An example fromeach
class is listed in Figure 6.Neg1 indicates the distribution of the
negative test dataset while Neg2 refers to the distribution of
a test dataset whose trees’ labels were generated randomly.
We can see from these plots that the score distribution of
Neg2 is very broad,while the distribution of Neg1 is slightly
more concentrated,implyingthat PSTMMfoundparent–child
relationships in Neg1,while it did not find any such depend-
encies in Neg2.Correspondingly,the positive test dataset
(Pos) has an even higher score distribution concentrated at
a higher level.Interestingly,the distribution of Neg1 and
Pos are very close in O-Glycans,while they are quite distin-
guished in Glycosaminoglycans and Sphingolipids,implying
that most O-Glycans are basically parent–child dependent,
while these other two classes have more complex patterns
embedded within.
4.2.2 Most likely state transition for new patterns and
multiple tree alignment We analyzed the probabilities of
the states learned from our datasets to find the most likely
state transitions so that we would be able to find common
patterns in the datasets as well as to perform multiple tree
structure alignment.Figure 7 illustrates three tree structures
that PSTMMfound to have similar patterns.The state trans-
ition model learned from these structures is given below
9
t-values indicate the significance of the difference between two sets of
values;if the t-value is larger than a certain value,say 8.610,then we can
claim that the performance advantage of PSTMMis statistically significant
over another model at confidence level 99.9%.
i12
by guest on February 21, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
Application of a new probabilistic model
Fig.7.KEGGGlycans:(a) G04023,(b) G04206 and (c) G03990.The top structures are the actual glycans,and the bottomstructures are the
most likely state transition diagrams.
each glycan structure,with the state corresponding to each
monosaccharide emphasized in bold.We note a few inter-
esting characteristics gleaned from this model.For instance,
in the largest of these three glycans,G04206,we see many
repeated ￿ →

(GlcNAc→Gal) pairs.However,if we
look at the corresponding state diagram,not all of these
repeated pairs correspond to the same state transitions.A
closer look will reveal that pairs of branches from the same
ancestor have the same state transition pattern (across sib-
lings within the same subtree).For example,there are two
branches emanating fromthe Mannose () triplet at the core
of the N-Glycan structure of G04206.Referring to just the
lower Mannose subtree,there are two subtrees,both of which
contain this ￿ →

pair twice in sequence.However,the
upper branch of this sequence corresponds to a state transition
path of S
6
−S
3
−S
8
−S
1
,while the lower branch corresponds
to S
5
−S
6
−S
3
−S
3
.In the upper Mannose subtree,we find the
same pairs of monosaccharide branches and the same two sets
of state transition paths.Furthermore,the other two glycans,
G03990 and G04023,also contain the same pattern near their
leaves,revealing an overall pattern across the breadth of each
of these structures!
Figure 8 illustrates the full pattern that matched across these
three glycans;the intersection of the three state transition dia-
grams are given along with the corresponding glycan pattern
fragment.The multiple tree alignment should be apparent
from these diagrams as each tree is aligned according to the
common pattern found by PSTMM.However,we note an
interesting result fromthis alignment,which is that the lowest
branch fromeach of these glycans are not aligned at all.This
is because although each of these lowest branches from the
glycans match in terms of pairs of monosaccharides (the same
Fig.8.The common state transition diagramand its corresponding
glycan subtree.
￿ →

pair),they correspond to different states!There-
fore,they are actually considered not to align with each other
according to their sibling relationships.This is an interesting
point for further biological investigation.
5 CONCLUDING REMARKS
We have applied a newprobabilistic model for glycan chains,
and we have experimentally proven its effectiveness on actual
glycan data.We have shown promising results for performing
i13
by guest on February 21, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
K.F.Aoki et al.
multiple tree structure alignments and capturing patterns in
glycan chains that were previously not possible.From the
maximumlikelihood estimations for the trees in each class of
glycans,we caneasilyretrieve the actual patterns that are most
frequent (i.e.estimate the most likely state transition) within
each class,thus enabling us to predict long-range structural
patterns that would give insight into their function,which was
not previously possible with any other model.
ACKNOWLEDGEMENT
This work is supported in part by Grant-in-Aid for Scientific
Research on Priority Areas (C) “Genome Information
Science” from the Ministry of Education,Culture,Sports,
Science and Technology of Japan.
REFERENCES
Aoki,K.F.et al.(2003) Efficient tree-matching methods for accur-
ate carbohydrate database queries.Genome Informatics,14,
134–143.
Arvestad,L.,Berglund,A.C.,Legergren,J.and Sennblad,B.(2003)
Bayesian gene/species tree reconciliation and orthology analysis
using MCMC.Bioinformatics,19,i7–i15.
Bertozzi,C.R.and Kiessling,L.L.(2001) Carbohydrates and
glycobiology review:chemical glycobiology.Science,291,
2357–2364.
Csürös,M.(2002) Fast recovery of evolutionary trees with thousands
of nodes.J.Comput.Biol.,9,277–297.
Diligenti,M.,Frasconi,P.and Gon,M.(2003) Hidden tree Markov
models for document image classification.IEEETrans.PAMI,25,
519–523.
Drickamer,K.(1988) Two distinct classes of carbohydrate-
recognition domains in animal lectins.J.Biol.Chem.,263,
9557–9560.
Durbin,R.,Eddy,S.,Krogh,A.and Mitchison,G.(1998) Biological
Sequence Analysis.Cambridge University Press,Cambridge.
Fine,S.,Singer,Y.and Tishby,N.(1998) The hierarchical hidden
Markov model:analysis and applications.Mach.Learn.,32,
41–62.
Friedman,N.(1998) The Bayesian Structural EM Algorithm.
In UAI-98,Morgan Kaufmann Publishers,San Francisco,
pp.129–138.
Höchsmann,M.,Toller,T.,Giegerich,R.and Kurtz,S.(2003) Local
similarity in RNAsecondary structures.Proceedings of the IEEE
Computer Society Bioinformatics.IEEE Computer Society Press,
Washington DC,pp.159–168.
Jannson,J.and Lingas,A.(2001) A fast algorithm for optimal
alignment between similar ordered trees.LNCS,2089,
232–240.
Jiang,T.,et al.(1995) Alignment of trees—an alternative to tree edit.
Theoret.Comput.Sci.,143,137–148.
Kanehisa,M.,Goto,S.,Kawashima,S.,Okwno,V.and Hattori,M.
(2004) The KEGGresource for deciphering the genome.Nucleic
Acids Res.,32,D277–D280.
Knudsen,B.and Hein,J.(1999) RNAsecondary structure prediction
using stochastic context-free grammars and evolutionary history.
Bioinformatics,15,446–454.
Krogh,A,Brown,M.,Mian,I.S.,Sjolander,K.and Haussler,D.(1994)
Hidden Markov models in computational biology:applications to
protein modeling.J.Mol.Biol.,235,1501–1531.
McLachlan,G.andPeel,D.(2000) Finite Mixture Models.JohnWiley
&Sons,Inc.,New York.
Sakakibara,Y.(2003) Pair hidden Markov models on tree structures.
Bioinformatics,19,i232–i240.
Sakakibara,Y.,Brown,M.,Hughey,R.,Mian,I.S.,Sjolander,K.,
Underwood,R.C.and Haussler,D.(1994) Stochastic context-
free grammars for tRNA modeling.Nucleic Acids Res.,22,
5112–5120.
Sjölander,K.(1998) Phylogenetic inference in protein superfamilies:
Analysis of sh2 domains.Proceedings of the 6th ISMB.AAAI
Press,pp.165–174,ISBN 0-1-57735-053-7.
Ueda,N.,Aoki,K.F.and Mamitsuka,H.(2004) A general probab-
ilistic framework for mining labeled ordered trees.Proceedings
of the Fourth SIAM International Conference on Data Mining.
pp.357–368.
Varki,A.(1997) Sialic acids as ligands in recognition phenomena.
FASEB J.,11,248–255.
Varki,A.,et al.(eds) (1999) Essentials of Glycobiology.Cold Spring
Harbor Laboratory Press,New York.
i14
by guest on February 21, 2013http://bioinformatics.oxfordjournals.org/Downloaded from