Metabolomics spectral formatting, alignment and conversion tools ...

lambblueearthΒιοτεχνολογία

29 Σεπ 2013 (πριν από 4 χρόνια και 16 μέρες)

84 εμφανίσεις

BIOINFORMATICS
Vol.19 no.17 2003,pages 2283–2293
DOI:10.1093/bioinformatics/btg315
Metabolomics spectral formatting,
alignment and conversion tools (MSFACTs)
Anthony L.Duran,Jian Yang,Liangjiang Wang and
Lloyd W.Sumner

The Samuel Roberts Noble Foundation,Inc.,Ardmore,OK 73402,USA
Received on August 23,2002;revised on February 12,2003;accepted on April 7,2003
ABSTRACT
Motivation:The ampliÞed interest in metabolic proÞling has
generated the need for additional tools to assist in the rapid
analysis of complex data sets.
Results:A new program;metabolomics spectral formatting,
alignment and conversion tools,(MSFACTs) is described here
for the automated import,reformatting,alignment,and export
of large chromatographic data sets to allow more rapid visu-
alization and interrogation of metabolomic data.MSFACTs
incorporates two tools:one for the alignment of integrated
chromatographic peak lists and another for extracting informa-
tion from raw chromatographic ASCII formatted data Þles.
MSFACTs is illustrated in the processing of GC/MS metabolo-
mic data from different tissues of the model legume plant,
Medicago truncatula.The results document that various tis-
sues such as roots,stems,and leaves fromthe same plant can
be easily differentiated based on metabolite proÞles.Further,
similar types of tissues within the same plant,such as the Þrst
to eleventh internodes of stems,could also be differentiated
based on metabolite proÞles.
Availability:Freely available upon request for academic and
non-commercial use.Commercial use is available through
licensing agreement (http://www.noble.org/PlantBio/MS/
MSFACTs/MSFACTs.html).
Contact:lwsumner@noble.org
INTRODUCTION
Completion of many genome sequences and advances in
instrumental technologies have revolutionized the scale upon
which scientists now view and query biology.This view has
expanded towards a more holistic or systems biology per-
spective (Ideker et al.,2001a).Newtechnological approaches
that include high throughput comprehensive proÞling of gene
expression products are now a viable means of elucidating
gene function and biological responses to environmental stim-
uli (Ideker et al.,2001b).These proÞling efforts are broadly
classiÞed as transcriptomics,proteomics,and metabolomics

To whomcorrespondence should be addressed.
(Fiehn et al.,2000;Huhman and Sumner,2002;Roessner
et al.,2001;Sumner et al.,2002,2003;Trethewey,2001;
Trethewey et al.,1999).Although transcriptomics and proteo-
mics are quite advanced,metabolomics is still in its infancy
but may be key to the understanding of gene function (Hall
et al.,2002;Trethewey,2001).Currently,most metabolomics
approaches rely on hyphenated chromatographic separation
techniques coupled to mass spectrometry such as gas chro-
matography (GC/MS),liquid chromatography (LC/MS) or
capillary electrophoresis (CE/MS).Other approaches include
nuclear magnetic resonance (NMR) or optical spectroscopic
techniques such as infrared (IR) spectroscopy.
Although there has been signiÞcant progress in the acquisi-
tion of quantitative and qualitative metabolomic data,less
effort has been devoted to methods of data extraction,visual-
ization,and interpretation.Currently,the conversion of data
output from commercial instruments into a form that can be
used for further data interrogation and visualization is very
time-consuming and is performed manually using commer-
cial spreadsheets.In this report we introduce a new program;
metabolomics spectral formatting,alignment and conversion
tools,(MSFACTs).MSFACTs has been developed to provide
an automated,rapid,and ßexible means of reducing large
complex chromatographic/spectrometric data sets generated
in metabolomic studies into well-organized,two-dimensional
matrices that canthenbe easilyprocessedandvisualizedusing
commercial statistical software.
The functionality of MSFACTs is illustrated within the
scope of metabolome analyses of various tissues of Medicago
truncatula.The results illustrate the utility of this program
and showthat different tissues such as roots,stems and leaves
of the same plant can be easily differentiated based on meta-
bolite proÞles.Further,similar tissue types within the same
plant such as the Þrst to eleventh internodes of the aerial tissue
could also be differentiated based on metabolite proÞles.To
the best of our knowledge,this is the Þrst time that metabolo-
mic approaches have been used to differentiate tissues within
the same plant and thus,illustrates the resolving power of
metabolomics.
Bioinformatics 19(17) © Oxford University Press 2003;all rights reserved.
2283
by guest on September 29, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on September 29, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on September 29, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on September 29, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on September 29, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on September 29, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on September 29, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on September 29, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on September 29, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on September 29, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on September 29, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
A.L.Duran et al.
METHODS
Chemicals
HPLC grade water and chloroform was obtained from
J.T.Baker (Phillipsburg,NJ).MSTFA was purchased from
Pierce.Standard compounds used for customGC/MS library
construction were purchased fromSigmaÐAldrich,Fluka,and
Supelco (St.Louis,MO).All other chemicals were obtained
fromSigmaÐAldrich.
Plant growth and tissue collection
Medicago truncatula (cultivar Jemalong A17) seeds were
planted in 6 in.pots containing ScottÕs Metro Mix 350
(Marysville,OH) potting soil.Uninoculated plants were
grown in a controlled greenhouse environment for 82 days
during the months of November to February and maintained
at an average temperature of 28

C,40%relative humidity,and
a day length of 16 h.
Root,stems and leaves from ten replicate plants were har-
vested,placed in teßon-sealed glass tubes,and immediately
frozen under liquid nitrogen.The frozen tissue was lyo-
philized for 72 h and then placed in a −80

C freezer until
extracted and derivatized.Single lateral stems from each
of three different replicate plants were harvested.Individual
nodes from these stems were dissected starting at the apical
end and proceeding downward resulting in 11 individual
nodal samples per plant composed of leaves and stem tissue.
Samples were frozen,lyophilized,and stored at −80

C until
extracted and derivatized.
Extraction
All dry tissue was ground to a Þne powder in the collec-
tion vials with a glass rod.Approximately 6 mg of the dried,
homogenizedtissue was weighedinto1-dramvials containing
teßon inlays.Metabolite extraction was performed by adding
1.5 ml chloroform,1.5 ml bottled water and 3 µl internal
standard (ribitol 50 mg/ml) followed by vortexing for 1 min.
The sample was incubated at 50

C for 2 h with shaking.The
samples were then centrifuged in a swinging bucket rotor at
2900 × g for 30 min.Aliquots of 1 ml were collected from
the polar layer and transferred to 2.0 ml autosampler vials
(Agilent,Palo Alto,CA) with teßon/silicon septa.The extract
was dried in a speed vac (Savant,Albertville,MN) 2.5 h and
then stored at −20

C until ready for analysis.
Preparation of polar extracts
Dried polar extracts were prepared by methoximation in
160 µl of 20 mg/ml methoxyamine hydrochloride in pyrid-
ine at 50

C for 2 h followed by a brief sonication (<10 s)
to dislodge any pellet.Methoximation derivatizes carbonyl
functional groups,prevents cyclization,and stabilizes car-
bonyl moieties in the β-position of reducing sugars (Roessner
et al.,2000;Schweer,1982).Remaining polar functional
groups were further derivatized by adding 160 µl MSTFA
[N-Methyl-N-(trimethylsilyl)trißuoroacetamide] +1%TMCS
(Trimethylchlorosilane) followed by incubation at 50

C
for 30 min (Katona et al.,1999;Roessner et al.,2000).
Instrumentation/tissue analysis
Derivatized metabolite mixtures were analyzed using a
Hewlett Packard 6890 gas chromatograph,5973 mass select-
ive detector,and 6890 series injector.The integrated system
was operated under Chemstation (Agilent,Palo Alto,CA).
Polar samples were analyzed by injecting 1 µl with a split
injection ratio of 25:1.All samples were injected in triplic-
ate.Analyses were performed on a 60 m DB-5MS column
(J&WScientiÞc,Palo Alto,CA).Injection temperature was
maintainedat 250

Candthe interface temperature was 280

C.
Separations were achieved using a temperature program:
5 min isothermal heating at 50

C,followed by a 5

C/min
ovenrampto315

C,andholdingat the Þnal 315

Cfor 10 min.
Mass spectra were recorded at 2.48 scans per second with a
mass scanning range of 50Ð650 m/z.Each analysis required
approximately 1.4 h including machine equilibration time.
DATA ANALYSIS
Following chromatographic separation and data acquisition,
raw data Þles were integrated using HP Chemstation or con-
verted to ASCII text.Conversion of Chemstation Þles to an
ASCII Þle format can be performed using a macro (provided
with program) or commercial software such as MASSTransit
data converter (Palisade Corp.,NewÞeld,NY).Peak list data
were aligned,reformatted,and exported using the RTAlign
algorithmof MSFACTs.ASCII Þles were extracted,aligned,
reformatted,and exported using the RICExtract algorithmof
MSFACTs.Reformatted and exported Þles from MSFACTs
were then further processed and visualized using the prin-
cipal component analysis (PCA) software Pirouette v3.02
(Infometrix,Woodinville,WA).
Description and algorithms
MSFACTs is a standard Java/Swing application that imports,
aligns,and reformats spectral and chromatographic data
using two applications;RTAlign and RICExtract.MSFACTs
accepts and converts integrated peak lists,composed of
chromatographic retention times and peak areas,using the
tool entitled RTAlign.Alternatively,raw spectral or chro-
matographic data exported as ASCII formatted text can be
processed via the RICExtract tool.
The automated alignment function of MSFACTs signiÞc-
antly enhances the speed at which data can be processed and
visualized.Output from the program is a two-dimensional
matrix consisting of tab-delimited ASCII text that can then
be easily processed by additional commercial software pro-
grams such as Excel (Microsoft,Redmond,WA) or mul-
tivariate packages such as SAS (SAS Institute,Cary,NC),
Pirouette (Infometrix,Woodinville,WA) or MATLAB(Math-
works,Inc.,Natick,MA).SAS,Pirouette and MATLAB
provide data processing and visualization features such as
2284
MSFACTs
hierarchical cluster analysis (HCA),two-dimensional,and
three-dimensional (2D/3D) PCA.These statistical processing
and visualization tools allow for the determination of simil-
arities and differences between metabolomic data sets and
for the identiÞcation of individual components that are
responsible for the differences observed in the PCA.More
complex processing such as self-organizing maps (SOMs)
or neural networks (NN) should also be possible but not
demonstrated here.
RTAlign The RTAlign tool accepts integrated peak lists
generated by spectrometric software such as ChemStation
(Agilent,Palo Alto,CA).Peak lists are composed of a
sequential listing of centroided peak retention times and cor-
respondingareas.The programthenaligns multiple integrated
peaklists frommultiple chromatographic analyses basedupon
a retention time interval deÞned by users.Input can be indi-
vidual Þles (one integrated chromatogram per Þle) or batch
Þles (multiple integrated chromatogramÞles) and formatting
is ßexible to meet the requirements of multiple instruments.
The graphical interface of the RTAlign tool as well as rep-
resentative input and output data formats are illustrated in
Figure 1.
The input data for the RTAlign interface is a series of Þles,
with each individual Þle containing a list of peaks.For each
peak,a retention time and a peak area are captured.A pre-
sumption of RTAlign is that,for any two Þles,there is a
one-to-one correspondence among many of the peaks they
contain.However,due to small and common shifts in the
retention times,the peaks resulting from the same substance
in different analyses may give rise to signiÞcant variation in
their reported retention times.Identifying these peaks thus,
represents a traditional classiÞcation problem.To classify two
peaks as belonging to the same substance,a small retention
time window is used.Two peaks belonging to different runs
are believed to be of the same substance if the difference
between their retention times is within this window.A sum-
mary of the classiÞcation algorithmin MSFACTs is provided
below.
ClassiÞcation
To classify the peaks,RTAlign uses the following steps:
1.A parser class collects all of the peak data and generates
anunorderedlist of peaks (A
u
) that maintaintheir original
run identiÞer.
2.During the process,the minimum and maximum reten-
tion times are recorded and deÞne a range.
3.The unordered list of peaks is sorted based on retention
time in ascending order (A
s
).The actual sorting utilizes
the algorithm provided by the Java libraries,which is a
QuickSort implementation.
A
s
←QuickSort(A
u
)
4.The ordered list is passed through a controller that groups
the peaks into clusters.Here a cluster is deÞned as a col-
lection of peaks whose retention times are very close to
each other,as judged by a predeÞned retention time inter-
val or window.Within a cluster,the difference between
the minimumretentiontime (L
t
) andthe maximumreten-
tion time (H
t
) is smaller or equal to the windowsize (w).
When a subsequent peak whose retention time is more
than the window size plus the current maximum reten-
tion of the current cluster,the current cluster ends and a
new cluster begins.
The following pseudocode summarizes the clustering
process.
L
t
←A
s
[i]
H
t
←A
s
[i] +w
for i ←1 to length [A
s
]
if L
t
≤ A
s
[i] < Ht
then add A
s
[i] to the current cluster
else create a new cluster
add A
s
[i] to the new cluster
L
t
←A
s
[i]
H
t
←A
s
[i] +w
Collision resolution
Althoughtheprogramallows changingthesizeof theretention
time windowused for classiÞcation,it is obviously impossible
that a single size windowwouldbe large enoughtoaccount for
the variation due to retention time shifts and yet small enough
toachieve total separationof consecutive peaks withina single
run at the same time.A collision occurs when more than one
peak (P
t
) fromthe same run (R
x
) are determined to be in the
same cluster.
L
t
< R
x
P
t
;R
x
P
t

< H
t
￿ collision
Collisions are unavoidable in larger data sets due to limita-
tions of the classiÞcation process.If collisions are ignored,
MSFACTs simply exports the lowest value in the colliding
cell;however this results in lower performance data align-
ment.To improve the resolution of downstream processing,
it is highly desirable to resolve collisions.RTAlign provides
automated facilities to help resolve collisions and improve
subsequent data processing.These collision resolution facil-
ities and approaches are discussed below.
First,the processes of parsing and classiÞcation are fast
enough to allow experimenting with different window sizes
before a Þnal interval is selected.Window size can also be
approximated through statistical analysis of representative
data.For example,267 consecutive GC/MSanalyses acquired
over 17 days in a separate experiment contained approxim-
ately 32 270 data points that yielded an average standard
deviation (σ) in retention time of 0.0146 min for all peaks.
Applying a conÞdence level of 99.7%,we can statistically
2285
A.L.Duran et al.
High 19.197 19.531 19.585 19.805 20.088
Low 19.192 19.527 19.579 19.8 20.084
05020201.D(R) 19.197 19.531 19.585 19.803 20.087
05020201.D(A) 451979 1396954 250189 294593 226863
05020202.D(R) 19.194 19.53 19.584 19.802 20.084
05020202.D(A) 497196 1361651 237827 367820 297422
05020203.D(R) 19.193 19.529 19.583 19.805 20.087
05020203.D(A) 481056 1382929 227375 357742 293066
05020204.D(R) 19.192 19.528 19.582 19.804 20.084
05020204.D(A) 454409 1366355 229629 340992 313803
05020205.D(R) 19.192 19.529 19.584 19.802 20.087
05020205.D(A) 485252 1363662 193585 349421 277428
05020206.D(R) 19.192 19.527 19.579 19.803 20.086
05020206.D(A) 468929 1355906 215017 330590 205718
05020207.D(R) 19.195 19.529 19.584 19.801 20.084
05020207.D(A) 462860 1410837 213130 354511 210455
05020208.D(R) 19.193 19.528 19.585 19.801 20.087
05020208.D(A) 479074 1414008 216767 376911 267039
05020209.D(R) 19.193 19.528 19.582 19.8 20.088
05020209.D(A) 450822 1416162 210003 308525 263736
05020210.D(R) 19.193 19.529 19.584 19.803 20.086
05020210.D(A) 475325 1358657 177612 364250 269907
TIC: 05020201.D
@6.03mg root 1ml polar 160/160
Peak# Ret Time Type Width Area Start Time End Time
1 19.197 PV 0.043 451979 19.155 19.274
2 19.531 VV 0.034 1396954 19.462 19.566
3 19.585 VV 0.030 250189 19.566 19.623
4 19.803 VV 0.035 294593 19.773 19.887
5 20.087 VV 0.036 226863 20.041 20.127
TIC: 05020202.D
@6.03mg root 1ml polar 160/160
Peak# Ret Time Type Width Area Start Time End Time
1 19.194 BV 0.043 497196 19.105 19.278
2 19.530 VV 0.035 1361651 19.463 19.565
3 19.584 VV 0.031 237827 19.565 19.622
4 19.802 VV 0.043 367820 19.728 19.884
5 20.084 VV 0.042 297422 20.052 20.177
TIC: 05020203.D
@6.03mg root 1ml polar 160/160
Peak# Ret Time Type Width Area Start Time End Time
1 19.193 VV 0.043 481056 19.151 19.267
2 19.529 VV 0.034 1382929 19.468 19.566
3 19.583 VV 0.031 227375 19.566 19.615
4 19.805 VV 0.044 357742 19.726 19.888
A)
B)
C)
File#2
File#3
File#1
File#2
File#3
F
ile#1
Fig.1.(A) Graphical interface of MSFACTs RTAlign tool illustrating user input and output options as well as user deÞnable parameters such
as retention time interval,collision handling,output orientation and formation,Þeld marking variables,and a cluster cutoff.Representative
(B) input and (C) output data formats illustrate the alignment and reformatting performed simultaneously on multiple Þles by MSFACTs.
Output Þle information contains rows for both retention time,e.g.05020201.D(R),and area,e.g.05020201.D(A) for each aligned data Þle.
assume that 99.7%of measurements would be within ±3.0σ
or 0.0877 min and set our windowaccordingly to 0.0877 min.
Similarly 95%conÞdence levels would be set at ±1.96σ and
99.9%would be set at ±3.29σ based on user preference.The
user should be aware that as the window size increases so
does the frequency of collisions.Collision frequencies for the
above mentioned data set were 0.14%at the 95%conÞdence
level windowsetting and 0.75%at the 99.7%conÞdence level
windowsetting.Most collisions that remain after this step are
generally comprised of two closely adjacent peaks.
Second,since most of the subsequent data analyses that
followmake use of spreadsheet-like utilities,RTAlign tries to
2286
MSFACTs
minimize the occurrence of multiple peaks in the same cell.
This is achieved by one of the following two ways:
Forced-Þt:for each collision cell,the program Þrst checks
its left neighbor and determines if it is occupied.If not,it
will move the colliding peak with the smallest retention time
to the left cell.If the left neighbor already contains a peak,
or the current collision cell still has more than one peak,the
program will then move the peak with the largest retention
time to the right if it is unoccupied.If both neighbors are
already occupied,no forced-Þt will occur.
Cluster-split:an alternative way of reducing the number
of collision cells is to split collision clusters.The original
cluster is split into two at the middle of the original cluster
retention time range.All peaks within the original clusters are
then re-classiÞed based on the new cluster boundaries.The
cluster-split is our preferred method of collision resolution
in metabolomic projects as we believe it provides the best
representation of the data in highly complex mixture analyses
such as those encountered during metabolome analyses.
Lastly,the programallows for the markingof collisioncells.
These cells can then be resolved through subsequent manual
editing in spreadsheet programs.This is the most accurate
method of collision resolution,but also the most labor and
time intensive.The embedding of such marks makes it easy
for automatic processing using macros or other tools.
MSFACTs includes user-selectable options to allow cus-
tomization of data processing and output.These include
optional horizontal or vertical organization of output data and
automated splitting of large lists to Þt within the cell lim-
itations of some commercial spreadsheets.MSFACTs also
provides for threshold Þltering or setting of a minimal number
of Þlled cells required for clustering,i.e.the ÔCluster cutoffÕ
option of RTAlign.Caution is advised in using the cluster
cutoff Þlter as some cells may be unÞlled due to real biolo-
gical differences,poor chromatographic resolution,or poor
alignment.An optional mark-up feature allows special tag-
ging of data irregularities,which can be readily exploited by
macro routines to further assist in manual validation.
The processing of integrated and identiÞed peak lists is
advantageous because further processing such as PCA dir-
ectly links speciÞc chemically identiÞed peaks giving rise to
cluster differentiation,i.e.loading plots for PCA.Although
processing integrated peak lists is advantageous,it can be time
consuming and is dependent upon chromatographic integra-
tion parameters and processing.To allow more rapid but less
informative screening,an additional tool,RICExtract,was
developed to allow independent processing of non-integrated
or metabolic Þngerprint data (Fiehn et al.,2000;Sumner et al.,
2003).
RICExtract RICExtract allows for alignment of raw chro-
matographic or spectral data.The output can be used for
processing and visualization as described above,and this pro-
cess is commonly referred to as ÔbinningÕ.This approach does
not require peak detection and therefore,is faster and provides
full representation of the data.The graphical interface of the
RICExtract tool as well as representative input and output data
formats are illustrated in Figure 2.
The RICExtract tool was created to extract total or recon-
structedionchromatographic (RIC) informationfromGC/MS
or LC/MS Þles that have been exported as ASCII text Þles.
Most commercial instruments are capable of this Þle con-
version and a macro for Chemstation is provided with the
program.Additional commercial software packages such as
MASSTransit (Pallisade,NewÞeld,NY) are available for
conversion of most commercial MS Þle formats.Each data
point contains a retention time and a scan intensity count.
All RIC counts from different runs are then realigned as
a series of retention times,which were from the Þrst Þle
processed.The extraction of data involves straightforward
parsing and extracting,and the alignment is purely based
on retention time.Unlike RTAlign,there is no classiÞca-
tion or clustering steps in this part.Output from RICExtract
is a single Þle of tab-delimited,columnar data.The Þrst
column contains the retention time of the Þrst Þle and all
other columns contain corresponding scan intensity counts
fromsubsequent chromatographic Þles (see Fig.2).The over-
all size of chromatographic Þles is reduced by eliminating
mass spectral data.
The primary beneÞt of the RICExtract tool is that it requires
less user intervention than RTAlign,is much faster,and allows
one to rapidly and efÞciently screen samples to look for dis-
criminatingchromatographicregions.However,it onlydirects
the user back to a speciÞc region of the original chroma-
togram/spectra.The user must then determine the chemical
component responsible for the differences whereas RTAlign
provides a compound identiÞer that is either a chemical name
or less speciÞc peak number.RIC extraction is also advant-
ageous when using Pirouette software because baseline drift
can be corrected using the correction algorithm incorpor-
ated into this commercial software.RICExtract also avoids
the inconsistencies associated with data integration and peak
detection algorithms required to use RTAlign.Unfortunately,
RICExtract generally yields lower resolution clustering than
RTAlign due to the uncompensated slight shifts in retention
times.Future versions of this program will include reten-
tion time shift corrections;however,this can currently be
performed using a program based on correlation optimiz-
ation warping (COW) (Nielsen et al.,1998) freely avail-
able at (http://www.biocentrum.dtu.dk/mycology/analysis/
cow/#cowtool).
RESULTS
GC/MS metabolite proÞles of M.truncatula tissues were
obtained for dissimilar tissues including roots,stems,and
leaves.Representative GC/MStraces are providedinFigure 3.
GC/MS metabolite proÞles were also collected for highly
2287
A.L.Duran et al.
A)
B)
C)
RIC Input
RICLeaf1.txt
Scan,1
RIC,52837
RetTime,665587
51,4660
52,8924
53,1009
Scan,2
RIC,50928
RetTime,665994
51,4078
52,8350
53,1106
Scan,3
RIC,51385
RetTime,666401
51,4385
52,8072
53,1129
RICLeaf1.txt RICLeaf2.txt RICLeaf3.txt RICLeaf4.txt RICLeaf5.txt RICLeaf6.txt
11.089 52837 56985 50446 51467 52224 54781
11.096 50928 53863 50468 50900 50046 54573
11.103 51385 55372 53003 53881 51282 55422
11.109 51017 57873 53722 54101 51240 56589
11.116 54531 55318 54050 54930 53263 57381
11.123 54588 57490 55045 54685 53973 60421
11.130 55364 57781 55291 56980 56145 58902
11.136 57159 59638 56180 55531 57689 57643
11.143 56476 61060 55561 54921 55547 57343
File#1
File#1
File#2
File#3
File#4
File#5
File#6
Fig.2.(A) Graphical interface of MSFACTs RICExtract illustrating input and output parameters.Representative ASCII input Þle format
(B) froma single Þle and (C) output data format of multiple aligned and formatted Þles performed by RICExtract.
similar M.truncatula tissues including sequential nodal
segments fromindividual stem/aerial tissues.MSFACTs was
used to process GC/MS data for the comparison of the vari-
ous M.truncatula tissues.Root,stem,and leaf tissue data
consisted of 30 chromatographic analyses (3 tissues ×10 rep-
licate plants) that were processed using both RTAlign and
RICExtract.Portions of the input and output formats are
provided in Figures 1 and 2.PCAwas performed on the output
data provided by MSFACTs and is reported in Figure 4.Both
RTAlign and RICExtract methods of processing provide very
similar PCA plots.PCA results indicate that the metabolite
compositions of stems and leaves are more similar to each
other than to roots as expected since the aerial tissues are pho-
tosynthetic ÔsourcesÕ,while the roots are non-photosynthetic
ÔsinksÕ.The highly dissimilar tissues could be easily differen-
tiated using only the Þrst two principal components (PC) that
represent 92%of the variability(usingRTAlign).The thirdPC
(data not shown) at 1.48%is much lower than the Þrst two and
2288
MSFACTs
Roots
20.00 25.00 30.00 35.00 40.00 45.00 50.00 55.00 60.00 65.00
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
550000
600000
650000
700000
Time (min)
Abundance
Stems
20.00 25.00 30.00 35.00 40.00 45.00 50.00 55.00 60.00 65.00
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
550000
600000
650000
700000
Time (min)
Abundance
Leaves
20.00 25.00 30.00 35.00 40.00 45.00 50.00 55.00 60.00 65.00
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
550000
600000
650000
700000
Time (min)
Abundance
Fig.3.Example GC/MS data obtained for Medicago truncatula root,stems and leaves.Data from these analyses were further aligned and
reformatted using MSFACTs.
provides a valuable,visible measure of the instrumentation
and processing analytical variability.The total time required
for the processing of the 30 chromatographic analyses using
both tools was less than 30 min,compared with the 6Ð8 h
required for the traditional manual editing and spreadsheet
processing of all chromatograms.
Analyses and comparison of nodal sections fromindividual
aerial tissues from three separate M.truncatula plants were
2289
A.L.Duran et al.
Leaves
Stems
Roots
A
) RTAlign-153 peaks
RTAlign RSL log-trans
Variance Percent Cumulative
Factor1 74366.070312 67.404488 67.404488
Factor2 27679.962891 25.088776 92.493263
Factor3 1631.000732 1.478319 93.971581
Factor4 1182.773193 1.072051 95.043633
Factor5 1072.272949 0.971895 96.015526
Leaves
Stems
Roots
B) RICExtract-4570 data points
RICExtract RSL, baseline corrected, log-transformed
Variance Percent Cumulative %
Factor1 11358.095703 68.771782 68.771782
Factor2 1218.257324 7.376388 76.148170
Factor3 350.577942 2.122703 78.270874
Factor4 227.995636 1.380484 79.651360
Factor5 192.118301 1.163251 80.814613
Fig.4.Two-dimensional principal component analysis (2D-PCA) of M.truncatula roots,stems,and leaves.The data were generated using
(A) RTAlign and (B) RICExtract to format data that was then visualized with PCA (Pirouette).The 2D-PCA are of log-transformed data
(Tabachnick and Fidell,2001) and show strong differentiation of the tissues.Tabularized values are also provided for the various principal
components associated with each analysis.Note that scale of differentiation of the PCAclusters using the (B) RTAlign is signiÞcantly greater
than the (B) RICExtract suggesting that processing aligned peak list data provides greater differentiation.
also performed.The spatially dissected nodal tissue repres-
ents tissue that progressively varies in age.The resultant
3D-PCA plot of these data is provided in Figure 5.The
numbered nodal sections (similar relative node location) pro-
gressively group in PCAspace in accordance with nodal age.
The distance between groups (A,B and C) represents the
level of biological (plant-to-plant) variation.Further,the dis-
tance betweenreplicates is anindicator of instrumentationand
processing analytical variability.The data clearly show the
progressive differentiationof sequential nodal tissue andillus-
trate the high resolving power of metabolomics to spatially
resolve highly similar tissues within a single plant.The data
also suggest that the current resolving power is on the same
order of magnitude as biological variance,and therefore is
approaching the limitations imposed by biological variation.
DISCUSSION
MSFACTs allows the alignment of large numbers of chroma-
tograms according to a variable (time in GC/MS,but could
also be nm,ppmor Hz fromother spectroscopic data).Align-
ment is the columnization of corresponding data contained
2290
MSFACTs
Factor1 48140.980469 19.134521 19.134521
Factor2 34049.117188 13.533450 32.667973
Factor3 22483.578125 8.936513 41.604485
Factor4 14976.784180 5.952799 47.557285
Factor5 10073.452148 4.003880 51.561165
Fig.5.Three-dimensional PCA of the Þrst to eleventh internodes of M.truncatula aerial tissues of triplicate plants composed of stems
and leaves from triplicate plants using RTAlign.The nodal segments represent tissues of incremental age.The 3D-PCA shows not only
progressive segregation of the nodal segments in accordance with age but also that this segregation is on the same order as biological variance.
The nomenclature used is individual plant identiÞer (A,B or C),followed by the internode number,and then the analytical replicate number
such that B4Ð2 represents the second analytical replicate of the fourth internode of plant B.
within separate data Þles that have inherent variability over
time.This process generates a correlated,two-dimensional
matrix that can be further compared using statistical packages,
unsupervised methods such as HCA and PCA,or supervised
methods such as SOMs/NNs.For metabolomic data sets,
retention times may originate from integrated peak lists or
raw (X,Y) chromatographic data.The data are placed into
ÔbinsÕ based on a user speciÞed time window.Once the data
are aligned,the data are exported in a format readily useable
by many different statistical packages.Differences and sim-
ilarities between the large data sets (>1000s of GC/MS data
Þles) can then be determined through statistical processing
using EXCEL,SAS,Pirouette,or MATLAB.MSFACTs is
composed of two different processing tools;RTAlign and
RICExtract.If speed is desired,raw chromatograms can be
processed without peak detection using RICExtract through
the ÔbinningÕ of the ion chromatogram and alignment of this
data.If maximum biochemical information is desired,peak
integration and identiÞcation can be performed using com-
mercial programs before alignment and reformatting with
RTAlign.Prior to development of this tool,all chromatograms
had to be aligned by hand using spreadsheet-based programs
and required extensive time.Although MSFACTs is a simple
tool,it has very high value because it dramatically reduces the
amount of processing time by two orders of magnitude and
makes it possible to process large data sets (e.g.2000 GC/MS
analyses and peak lists containing 500 000 individual meas-
urements of metabolite concentration;separate experiment
and data not shown).
To the best of our knowledge there are no other tools that
have the same functionality as MSFACTs;however,we feel it
important to differentiate our programfromother commonly
2291
A.L.Duran et al.
used programs.Automated Mass Spectral Deconvolution and
IdentiÞcation Software (AMDIS) is a powerful algorithmfor
the deconvolution and subsequent identiÞcation of eluting
peaks based on resultant mass spectra (Dromey et al.,1976;
Hargrove et al.,1981;Pool et al.,1997;Stein,1999).Unfortu-
nately,AMDIS does not performcomparisons of data Þles for
similarities or differences once peak detection and identiÞca-
tion have been performed,e.g.howdoes the GC/MS analysis
of a mutant plant chemically differ from that of a wild-type
plant.The component detection algorithm(CODA) marketed
byAdvancedChemistryDevelopment is aprogramfocusedon
noise reduction in individual LC/MS Þles,but again does not
performany comparative function between data Þles.Further,
CODA and AMDIS are processing techniques that can be
performed pre/post data alignment to enhance the value of
metabolic data sets.
The alignment or ÔbinningÕ of related data for processing
is performed by MSFACTs followed by statistical comparis-
ons of data Þles using EXCEL,SAS,MATLAB or Pirouette.
We know of two software programs that have similar but
only partial functionality to MSFACTs.One is a propriet-
ary package available from Bruker entitled AMIX (analysis
of mixtures;http://www.bruker-biospin.de/NMR/nmrsoftw/
prodinfo/nmr_suit/amix/index.html) that will similarly bin
NMR data for statistical analysis.Unfortunately,this is
a commercial program that is both platform and NMR
speciÞc.MSFACTs is more versatile and will be made
available at no cost to non-commercial entities.Commer-
cial use is available through a licensing agreement (http://
www.noble.org/PlantBio/MS/MSFACTs/MSFACTs.html).It
may be possible to achieve the same functionality using mac-
ros written for common platforms such as HP Chemstation;
however,by doing so,one is then committed to a speciÞc data
platform.In comparison,MSFACTs will accept data from
multiple platforms including our primary tool,mass spectro-
metry (MS),but also UV,IR and NMR (i.e.any data source
capable of exporting ASCII data),making it a much more
versatile tool.
We stress that the simple algorithms are not the primary
focus of this report but the enhancement in data processing
speed that is paramount to the progress of metabolomics.The
enhancement in data processing helps advance the infant Þeld
of metabolomics by enabling throughput that is more con-
sistent with ÔomicÕ approaches.The desperate need for post
acquisition data processing was emphasized as a major con-
cern of the 1st International Congress on Plant Metabolomics
heldinApril of 2002inThe Netherlands (Hall et al.,2002) and
other recent publications (Mendes,2002;Sumner et al.,2003).
Conclusions
MSFACTs is a fast and efÞcient systemfor extraction,align-
ment,andorganizationof data froma wide varietyof analyses.
The output of this program is amenable to rapid visualiza-
tion and comparison of metabolomic data.The software has
an open architecture that will allow incorporation of future
tools and algorithms for continued enhancement of the pro-
gramÕs utility.Additional information on the functionality and
operation of MSFACTs is provided in the programÕs internal
help Þle.
ACKNOWLEDGEMENTS
This work was funded by The Samuel Roberts Noble
Foundation.
REFERENCES
Dromey,R.G.,SteÞk,M.J.,Rindßeisch,T.C.and DufÞeld,A.M.
(1976) Extraction of mass spectra free of background and neigh-
boring component contributions from gas chromatography/mass
spectrometry data.Anal.Chem.,48,1368Ð1375.
Fiehn,O.,Kopka,J.,Dormann,P.,Altmann,T.,Trethewey,R.N.and
Willmitzer,L.(2000) Metabolite proÞling for plant functional
genomics.Nat.Biotechnol.,18,1157Ð1161.
Hall,R.,Beale,M.,Fiehn,O.,Hardy,N.,Sumner,L.and Bino,R.
(2002) Plant metabolomics as the missing link in functional
genomics strategies.Plant Cell,14,1437Ð1440.
Hargrove,W.F.,Rosenthal,D.and Cooley,P.C.(1981) Improvement
of algorithmfor peak detection in automatic gas chromatographyÐ
mass spectrometry.Anal.Chem.,53,538Ð539.
Huhman,D.V.and Sumner,L.W.(2002) Metabolic proÞling of
saponins in Medicago sativa and Medicago truncatula using
HPLC coupled to an electrospray ion-trap mass spectrometer.
Phytochemistry,59,347Ð360.
Ideker,T.,Galitski,T.and Hood,L.(2001a) Anewapproach to decod-
ing life:sytems biology.Annu.Rev.Genomics Hum.Genet.,2,
343Ð372.
Ideker,T.,Thorsson,V.,Ranish,J.A.,Christmas,R.,Buhler,J.,
Eng,J.K.,Bumgarner,R.,Goodlett,D.R.,Aebersold,R.and
Hood,L.(2001b) Integrated genomic and proteomics analyses
of a systematically perturbed metabolic network.Science,292,
929Ð934.
Katona,Z.F.,Sass,P.and Molnar-Perl,I.(1999) Simultaneous
determination of sugars,sugar alchohols,acids,and amino
acids in apricots by gas chromatographyÐmass spectrometry.
J.Chromatogr.A,847,91Ð102.
Mendes,P.(2002) Emerging bioinformatics for the metabolome.
Brief.Bioinform.,3,134Ð145.
Nielsen,N.-P.V.,Carstensen,J.M.and Smedsgaard,J.(1998)
Aligning of single and multiple wavelength chroma-
tographic proÞles for chemometric data analysis using
correlation optimized warping.J.Chromatogr.A,805,
17Ð35.
Pool,W.G.,Leeuw,J.W.and Van de Graaf,B.J.(1997) Auto-
mated extraction of pure mass spectra from gas chroma-
tographic/mass spectrometric data.J.Mass Spectrom.,32,
438Ð443.
Roessner,U.,Luedemann,A.,Brust,D.,Fiehn,O.,Linke,T.,
Willmitzer,L.and Fernie,A.R.(2001) Metabolic proÞling allows
comprehensive phenotyping of genetically or environmentally
modiÞed plant systems.Plant Cell,13,11Ð29.
Roessner,U.,Wagner,C.,Kopka,J.,Trethewey,R.N.and
Willmitzer,L.(2000) Simultaneous analysis of metabolites
2292
MSFACTs
in potato tuber by gas chromatographyÐmass spectrometry.The
Plant J.,23,131Ð142.
Schweer,H.(1982) Gas chromatographyÐmass spectrometry
of aldoses as o-methoxime,o-2-methyl-propoxime and
o-n-butoxime pertrißouroacetyl dervivative on OV-225 with
methyl propane as ionization agent.J.Chromatogr.,236,
355Ð360.
Stein,S.E.(1999) An integrated method for spectrumextraction and
compound identiÞcation from GCMS data.J.Am.Soc.Mass
Spectrometry,10,770Ð781.
Sumner,L.W.,Duran,A.L.,Huhman,D.H.and Smith,J.T.(2002)
Metabolomics:a developing and integral component in functional
genomic studies of Medicago truncatula.In Romeo,J.T.
and Dixon,R.A.(eds.),Recent Advances in Phytochemistry,
Pergamon,Oxford,UK.
Sumner,L.W.,Mendes,P.and Dixon,R.A.(2003) Plant metabolom-
ics:large-scale phytochemistry in the functional genomics era.
Phytochemistry,62,817Ð836.
Tabachnick,B.G.and Fidell,L.S.(2001) Using Multivariate
Statistics,4th edn.Allyn & Bacon,Needham Heights,MA
pp.80Ð85.
Trethewey,R.N.(2001) Gene discoveryvia metabolic proÞling.Curr.
Opin.Biotechnol.,12,135Ð138.
Trethewey,R.N.,Krotzky,A.J.and Willmitzer,L.(1999) Metabolic
proÞling:a Rosetta Stone for genomics?Curr.Opin.Plant Biol.,
2,83Ð85.
2293