Supplementary Data - Bioinformatics

piloturuguayanΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 3 χρόνια και 7 μήνες)

71 εμφανίσεις

© Oxford University Press 2011

1

Supplementary Information for:

In Silico

Identification Software (ISIS): A Machine Learning A
p-
proach to Tandem Mass Spectral Identification of Lipids

Lars J. Kangas
1
,
*
, Thomas O. Metz
2
,


Giorgis Isaac
2
, Brian T. Schrom
3
, Bojana Ginovska
-
Pangovska
4
,
Luning Wang
5
,
Li Tan
5
, Robert R. Lewis
5

and John H. Miller
5

1

Computational and Statistical Analytics Division, Pacific Northwest National Laboratory,
P.O. Box 999,
Richland, WA

99352, lars.ka
ngas@pnnl.gov

2

Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA

3

Chemical, Biological and Physical Sciences Division, Pacific Northwest National Laboratory, Richland, WA

4

Chemical and Material Sciences Division, Pacific
Northwest National Laboratory, Richland, WA

5

School of Electrical Engineering and Computer Science, Washington State University, Richland, WA


LTQ Linear Ion Trap Model

Additional equations implemented in the ISIS algorithm for the LTQ linear ion trap model are listed below.

1.

Initial internal thermal energy prior to dipole excitation.

The ions gain some internal energy at the source, both with and without an N
2

nebulizer gas, but lose this from collisio
n-
al cooling before excitation. The initial temperature is then estimated to have a mean of 298K which is sampled according to
~Gaussi
an distributions calculated by (Drahos and Vékey,
1999). The mean internal energ
y per oscillator is,














,

(1.
1
)

where
k
B

is the Boltzmann constant 8.617343E
-
05, and
C
peptide

is a temperature
-
frequency dependent factor ~0.2 for peptides
with temperature
T

in Kelvin

(
Drahos and
Vékey
, 1999
)
























,

(1.2)

where
T

is temperature in Kelvins.

Assuming only the vibrational oscillation is dominant, the mean internal energy is then
Etherm

times the degrees of fre
e-
dom
, or
















.

(1.3)

where
n

is the number of atoms in the ion.

The width
W

of the distribution is

















.

(1.4)

The internal energy distribution of the ion is Gaussian like
(
Drahos and
Vékey
, 1999), and is defined
at energy
E

by














[












]

(1.5)


2.

Time increment per collision.

The Kinetic Monte Carlo (KMC) simulation requires a rate of collisions,
, for which the collision ion is selected. Time
only indirectly moves forward by 1/
, yet, also keeps
track of each ion

s cumulative time to see if their average corresponds
to the excitation time.

L.J. Kangas et al.

2

KMC simulations increment time by

log(
)/
R
, where
R

is the total
for all possible events in the system, and
is a
uniformly sampled random number in [0, 1]. KMC thus uses two random numbers, one to select the event and one to move
time forward.

For a collision between an ion and a target atom, a larger ion moving with larger velocity in a higher num
ber density
n

of
the collision gas will have a higher probability of collisions.


The collisional cross
-
section
for an ion and target atom is













,

(
2
.
1
)

where
r
1

is the van der Waal’s radius for the target gas and the radius of the ion,
r
2
, is












,

(
2
.
2
)

where
R
i

is the radius of each atom
i

in the ion.

The sampled mean free time,
, can be defined from the gas number density,
collisional cross
-
section, and velocity as












,

(
2
.
3
)

where
is random number in [0
-
1],
is given by Equation 2.1,
v

is the relative velocity

between an ion and a target atom,
and
, the number density of the gas, is defined by









.

(
2
.
4
)

The last term in Equation (2.3),
, provides a sampling of the distribution. The probability of selecting that coll
i-
sion in a KMC simulation is
.


3.

Cooling schedule for
internal temperature decrease (
Zhang, 2004)
.

T
eff
, effective internal temperature decrease exponentially to the temperature of the buffer gas
T
0
,





(





)











,

(
3
.
1
)

where
t

is the elapsed time after the precursor has been fragmented, and for an ion with mass
M
,
r
c

is














,

(
3
.
2
)

where

is a cooling rate of an ion of mass 1000u and
c

is a constant.
= 104.6s
-
1

and
c

= 0.74 after an optimization in the
paper.


4.

Low mass cutoff.

The ion trap has a relatively high low mass cutoff (LMCO) in m/z, below which no ions are detected, and is defined from
q
, the activation value, by












.

(
4
.
1
)

Molecule Vector Encoding


The algorithm represents molecules as undirected graphs of atoms and bond
s stored in adjacency matrices (
Faulon, 2005
)
.
These are processed to tree structures, from which vectors can b
e encoded for machine learning (
Schietgat, 200
8
)
.

To make a vector from an adjacency matrix of a molecule, one atom is selected as a root vertex and the remaining atoms
and bonds are processed in a breadth
-
first search up to tree depth eight (a neighborhood size), with atoms being vertices and
bonds b
eing edges. The path through the tree from the root to a given atom, considering every atom and bond in the path, is
calculated to a vector index, i.e., the offset to an element in a vector. The vector element at that index is incremented by
one
(all eleme
nts are initially 0). A vector element is incremented once for each atom in the neighborhood.

In Silico Identification Software (ISIS): A Machine Learning Approach to Tandem Mass Spectral Identification of Lipids

3


Assuming that the root atom is in the set {C,H,O,N,S,P}, then without restrictions, any of these atom types could be
bonded to other atoms in {C,H,O,N,S,P} by a
bond order in {1,2,3}. Thus, the unrestricted index possibilities are the card
i-
nality of {C,H,O,N,S,P} times the cardinality of {1,2,3} times the cardinality of {C,H,O,N,S,P} or 6 x 3 x 6 = 108. This is
the number of unique indices in a neighborhood of two
. A neighborhood of size
n

has 6 x (3 x 6)
n
-
1

unique indices

a number
that grows too large for machine learning with growing
n

if every unique index is an element in a vector. Fortunately, diffe
r-
ent atom types restrict the number of bonds they establish.
Further, nature is kind in that the majority of combinations of atom
types and bond orders do not occur. This results in a sparse space that can be compressed without loss of information.

Table
S
1 shows that, as the radii of neighborhoods increase, the max
imum possible indices grow exponentially, but b
e-
cause only a small fraction of the indices are observed, a neighborhood vector only needs these indices. These are defined
here as packed indices, as in packed arrays, where indices are limited to only those
of interest. The packed indices are defined
from the training set of 22 lipids, where every atom in a lipid in turn was rooted, and the indices were calculated for other

atoms in the neighborhoods.


Table
S
1.

Radius effect on vector lengths.


Figure
S
1
shows that as more LIPID MAPS lipids are added at neighborhood sizes one to eight, the number of observed
indices increases. These increases will all approach asymptotes where no additional indices are observed when new lipids are
added (the steps in the c
urves are the result of adding “new” atom/bond configurations for the first time).



L.J. Kangas et al.

4

Fig
.

S
1.

Number of observed vector indices.

The radius 8 neighborhood in Figure
S
1 has 6550 observed indices from a filtered set of 18,399 LIPID MAPS lipids (fi
l-
ter expla
ined below), while Table
S
1 showed only 627 observed indices in the 22 training lipids. In the algorithm test (see
Results section in the paper),
in silico

spectra for all LIPID MAPS lipids were generated using only 627
-
element vectors.
Despite this appare
nt shortcoming, the rank tests (~identifications) of lipids are remarkably good, possibly because the 627
indices include the most important atoms and bonds in the near neighborhoods needed for predicting bond cleavage temper
a-
tures. As more metabolites are

added to the training set, the discrepancy of the number of indices used and those in a database
should decrease. It is possible that the neighborhood should be increased with more complex metabolites.

It can be shown that a neighborhood can be too small
to capture important properties in molecules as shown by examples
below. First, observe that the vectors are used in the ANN to encode the neighborhoods around bonds in calculations for pr
e-
dicting bond cleavage rates that in turn affect ion intensities in
spectra. A bond is defined by its two atoms, each rooted to
make two separate vectors as if the bond of interest did not exist. Both vectors are input to the ANN.

Figure
S
2 shows two lysophosphocholine (LPC) lipids. The first is an ester LPC, and the
second is an ether LPC. O
b-
serve that the double bonded oxygen in the neutral fragment is six atoms removed from the “184” bond (the 184 Da ion is a
diagnostic marker for phosphocholine containing lipids and corresponds to the head group).



Fig.

S
2.
Est
er and ether LPC lipids.


Table
S
2 shows how the experimental 184 Da ion intensities vary for different ester/ether LPC and PC lipids. As the
number of ethers increase/esters decrease, the relative intensity of the 184 Da ion decreases. The double bonded o
xygen in
the fatty acid(s) would not be “visible” to the ANN with a neighborhood radius less than six, and the ANN could not predict
the 184 Da intensity difference to separate these two lipid subclasses.




In Silico Identif
ication Software (ISIS): A Machine Learning Approach to Tandem Mass Spectral Identification of Lipids

5



Table
S
2.

Relative 184 Da ion intensities in ester and ether lipids.



Finally, to generate a vector, an element value is 0 if an atom does not have the index in the tree and is 1 if one atom has
the index. An element can be greater than 1 if the index occurs more

than once. For example, if a carbon as the root is bonded
to three other carbons, each by a bond order of one, then these three carbons are indistinguishable and thus have the same
index; the vector will have a value 3 at this index.

Equations and pseudo
code are provided below, describing the algorithm encoding atoms in a molecule using a breadth
-
first traversal of a tree structure.
The equations show explicitly the calculations of the first for atom positions from a cleaved
bond.


Index
1

= Atom
1

Index
2

=

















+ Atom
1



ATypes



BTypes

+ (Bond
1,2



1)


ATypes

+ Atom
2

Index
3

=















+ (Atom
1



ATypes



BTypes

+ (Bond
1,2


1)


ATypes

+ Atom
2
)



ATypes



BTypes

+ (Bond
2,3


1)


ATypes

+ Atom
3

Index
4

=
















+ ((Atom
1



ATypes



BTypes

+ (Bond
1,2


1)


ATypes

+ Atom
2



ATypes



BTypes

+ (Bond
2,3



1)


ATypes

+ Atom
3
)



ATypes



BTypes

+ (Bond
3,4


1)


ATypes

+ Atom
4

. . .,

L.J. Kangas et al.

6

where
ATypes

is the number of atom types,
BTypes

the number of bond types, Bond
n
,
n
+1
the bond order from atom at tree
depth
n

to atom at tree depth
n
+1, and Atom
n

is the enumeration of the atom type at tree depth
n
, for example C = 0, H = 1, O
= 2, etc.

The encoding a
lgorithm can recursively be defined as

Base case for
n

= 1:

Base [1] = Atom
1

Recursive case
n

> 1:

Base [
n
] = Base[
n
-
1]


ATypes



BTypes

+ (Bond
n
-
1,
n



1)


ATypes

+ Atom
n

Index
n

=















+ Base[
n
].


The recursive
definition is more attractive in programming code in that the indices do not need to be recomputed for each
atom from the root atom as the encoding progresses if the base values are remembered.


Overfitting the Artificial Neural Network

Large weight sets i
n ANNs relative to the number of available training vectors are associated with overfitting, which results
in a tendency for ANNs to learn the training data well, but
then
not to generalize this knowledge to novel data. Overfitting is
usually observed by A
NNs exhibiting perfect or near perfect tasks in predictions/classifications for each training pair, inputs
and outputs in the training set, but when new inputs are applied to the ANN, the outputs can deviate, sometimes extremely,
from the expected outputs.


When training ANNs using, for example, the backpropagation algorithm, this overfitting can be measured using cross
-
validation and/or testing sets. These sets for which both input and output vectors are available are held out from learning.
A
cross
-
validation set is tested in parallel to the training set during learning to recognize if the ANN over
-
memorizes the trai
n-
ing data. Over
-
memorization or loss of generality in the ANN is observed when the error decreases against the training set
and th
e error against the cross
-
validation set increases. A testing set will give the same information after the learning is co
m-
pleted by comparing its error against that of the training set.

Two often used methods to mitigate overfitting in a feedforward ANN i
s to reduce the size of the hidden layer when po
s-
sible or to stop learning early when it is observed that the cross
-
validating error starts to increase. Other methods include ad
d-
ing more training data or reconfiguring input/output vectors.

The ANN in ISIS
does not appear to overfit the training data as observed in testing. A different explanation from above
to overfitting is that vectors input to an ANN in testing or in usage are significantly deviated from the input vectors used
in
training that in turn ca
n result in unexpected outputs. The scheme encoding atoms and bonds in molecules to vectors calc
u-
lates specific vector indices. These vectors are packed in ISIS to only indices that are observed in the training set. When I
SIS
is tested or put to use, novel

molecules are only encoded to the packed vector indices. Thus, the ANN will not see vector el
e-
ments that were not in the packed training vectors. If the novel molecules result in calculated atom/bond indices that are no
t
in the packed vectors, then these
indices are discarded.

Discarding descriptions of novel molecules is possible because the near neighborhoods are encoded. The training set has
many examples of the first atoms defining a bond, for examples a C
-
C bond or a C
-
O bond. Each molecule in the tra
ining set
provides many examples of these atom pairs that are the most important in describing the bond strengths. Next, as the defi
n-
ing bond neighborhood increases to, for example, C(H2C(H)
-
OH, these are seen a little less often in the training set. As th
e
neighborhood increases and the atom/bond defining examples decrease in the training set, the atoms/bonds further away i
n-
fluences the bond strength less and less. As a consequence, the weights in the ANN from the vector elements corresponding
to atoms/bon
ds more remote from the bond in focus
should
tend towards zero values and will in turn cause less surprises in
In Silico Identification Software
(ISIS): A Machine Learning Approach to Tandem Mass Spectral Identification of Lipids

7

the ANN output

w
e say “should” because we believe this to be true if we used a large number of training exemplars and a
long training time)
.

Furt
her, the vectors encoding the atoms and bonds are discrete valued, each only having at the most three levels: three is
the maximum number of the same atom type that one atom can covalently bond to with bond order one (for example, when
one carbon is bonded

to three carbon, each by bond order one.). The discrete levels reduce the vector space significantly
compared to the space that would have been possible with real
-
valued input vectors. Also, to further mitigate the effect of
overfitting that is the cause
of too little data relative to the number of weights in the ANN, the number of training vectors is
doubled by a simple trick: as described, one atom/bond vector encodes one side of a bond and a second vector the other side
of the bond, but to predict when
bonds will cleave, the choice of the sides do not matter. So, both in training and in usage, the
two sides are swapped to generate two separate input vectors to the ANN. The prediction using the ANN is taken as the mean
of the two different predictions.


T
he explanation above for why the ANN in ISIS does not appear to overfit the training data is described in terms of the
influence each individual input has on the ANN. Typically, when considering overfitting, one needs to also consider that su
b-
sets of input

elements can be novel in testing and usage, but because the scheme of calculating the indices for atoms in ISIS is
based on the paths from rooted atoms next to the bonds in focus through all bond orders and atom types to an atom generate
unique indices, t
here cannot be input elements that can be combined into different sets in the normal sense.


Comparison: ISIS to MetFrag

We compared
the performance of
ISIS
in the
ranking
of candidate
molecules from LipidMaps

to
that of

MetFrag (Wolf,
2010; Hildebrandt, 2011).
This comparison used

the 45 test lipids in Table 2

in

the
paper.

The LipidMaps database is too
large
to load
for the online MetFrag

application
. This was solved by giving MetFrag
a
“database” of only the

candidates from
LipidMaps

to
rank

that were within

±

500 ppm mass margins of the precursor adduct
mass

of the test

lipid. Our initial test of
ISIS found an average of 18 candidates in LipidMaps for each test lipid. For larger lipids, the online MetFrag timed out
the
user
before all candidates

were processed. We solved this by reducing the
candidates to an average of 8

candidates for each
test lipid
, or a total of 360 candidates
.

Having first observed the limitations with the online MetFrag, w
e tried unsuccessfully
to use the
downloadable

version of the software
,
but

had to revert to using the online
application
with th
e adjustments of the
database and candidates
.

Both ISIS and MetFrag
compare

observed spec
tra

against
candidate spectra and return

rank list
s

with score
s

in the 0 to 1
ra
nge, representing the similarities

between

the observed spec
tra

and
their respective ca
ndidate spectra
.

MetFrag normalizes
each rank list such that the top ranked candidate receives a score of 1.0. ISIS provides the users the actual scores (R
-
squares)
from comparing the pairs of spectra,
i.e.,
the observed spectrum against each candidate
in
silico

spectrum. The actual scores
give the user
s

the option to not trust
any

ranked candidate if
its
score
is

low because t
he
observed
spectrum

is

poor
due to
, for
example,

the spectrum

containing
more than one species or the true candidate
is

not
present

in the database.

This information
is lost to the users if the scores are normalized.

Fig.

S
3 shows the distributions of the scores ISIS and MetFrag assign to the 360 candidates for the 45 lipids. For each a
l-
gorithm the scores are divided into the true (TP
) and false (TN) candidates.

The figure shows that ISIS assigns significantly
more very low scores to false candidates and assigns significantly fewer high scores to false candidates

compared to MetFrag
.
ISIS and MetFrag give similar high scores to tru
e
candidates (the curves overlap on the right side of the graph)
, but a third
curve overlapping the first two

(on the right side)

shows that

MetFrag also assigning a significant number of high scores to
false candidates.

L.J. Kangas et al.

8


Fig.
S
3
. Distributions are ISIS and

MetFrag scores for rank list candidates. The scores are separated into those from true and false
candidates.


That MetFrag has a high false positive rate can be seen in Table
S
3.
We set

a
n arbitrary

0.5 cutoff level

as if
each alg
o-
rithm

performed binary classifications
: a low score predict
s

a
false

candidate and a high score predict
s

a
true

candidate

(Fi
g-
ure

S
3 shows that

any cutoff between 0.3 and 0.7

would yield similar results)
.
MetFrag calls 132 false positives (ISIS 47)

of
the
244
f
alse

candidates
, giving
MetFrag

a low specificity of 0.459 compared to 0.807 for ISIS.


Table
S
3.

Statistics on ISIS and MetFrag

assuming a binary classification with a cutoff set at score 0.5.



Following Refe
re
nces below,
in Table S4,
we
list

all 360 candidates for the 45 test lipids with the scores from ISIS and
MetFrag.
Each
best correct
rank,
shown for MetFrag
,

is computed by recognizing that certain variations in lipids produce
very similar if not
identical

spectra in an ion trap spectrometer
, the instrument that generated the test lipids
.

For example, in
In Silico Identification Software (ISIS): A Machine

Learning Approach to Tandem Mass Spectral Identification of Lipids

9

the table for test lipid number 0, we assigned the correct candidate a rank 4 for MetFrag (
that

correct candidate has the 5
th

highest MetFrag score)
.


The
reader may observe that

ISIS does not have the same scores for multiple candidates when these candidates should
have the same spectra
. This

is explained by ISIS
performing

non
-
deterministic

Monte Carlo simulations to generate in
silico
spectra
.
ISIS, even when generating
the same
in silico

spectra

multiple times

from the same molecule, will make small vari
a-
tions.

These variations will decrease if we add more molecule copies into
the
Monte Carlo simulations
, and indeed it

is in our
forward path to

m
ultith
read

the
ISIS

algorithm

that in turn
will make

it
computationally
more feasible to run more molecule
copies in the simulations
.

It is also in our forward path to give ISIS the capability to recognize more classes of lipids and metabolites in genera
l. At
this time we
compared ISIS to MetFrag

on a small set of lipid classes.
To our knowledge,
MetFrag

does not exclude these
lipid classes from its ranking capability.

R
EFERENCES

Drahos, L. and Vékey, K. (1999) Determination of the Thermal Energy and its Distribution in Peptides.
J. Am. Soc. Mass
Spectrom.
,
10
, 323
-
328.

Faulon J.L., et al. (2005) Enumerating Molecules. In Lipkowitz, K., Larter, R. and Cundari, T.R. (eds.), Reviews
in Comput
a-
tional Chemistry Vol. 21. John Wiley & Sons, Hoboken, New Jersey.

Hildebrandt, C., et al. (2011) Database supported candidate search for Metabolite identification. J. Integrative Bioinformati
cs
8(2).

Schietgat L.
et al.
(2008) An efficiently comp
utable graph
-
based metric for the classication of small molecules.
Proc. of the
11th International Conference on Discovery Science
(LNAI 5525), pp. 197
-
209.

Wolf S., et al. (2010) In silico fragmentation for computer assisted identification of metabolite
mass spectra. BMC Bioinfo
r-
matics, 11, 148.

Zhang, Z. (2004) Prediction of Low
-
Energy Collision
-
Induced Dissociation Spectra of Peptides.
Anal. Chem.
,
76
, 3908
-
3922
.



L.J. Kangas et al.

10

Table S4.

ISIS and MetFrag scores for
360 candidate molecules for 45 test lipids.


In Silico Identification Software (ISIS): A Machine Learning Appro
ach to Tandem Mass Spectral Identification of Lipids

11




L.J. Kangas et al.

12




In Silico Identification Software (ISIS): A Machine Learning Approach to Tandem
Mass Spectral Identification of Lipids

13




L.J. Kangas et al.

14




In Silico Identification Software (ISIS): A Machine Learning Approach to Tandem Mass Spectral
Identification of Lipids

15


L.J. Kangas et al.

16




In Silico Identification Software (ISIS): A Machine Learning Approach to Tandem Mass Spectral Identification

of Lipids

17