Previous Work in Heterogeneous Data Integration

tripastroturfΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

82 εμφανίσεις



Previous Work in Heterogeneous Data Integration


As mentioned in the main text, related prior work in the area of heterogeneous data integration falls into two categories:
methodological precursors involving naive Bayesian classifiers and biological precur
sors performing data integration for
simpler organisms. Naive Bayesian classifiers are themselves qui
te

well
-
studied and robust (see
(Mitchell 1997)

and
(Rish
2001)

for reviews), and t
heir applications for data integration in related biological areas have been mainly in the analysis
of protein
-
protein interaction (PPI) data.

Beyond the biological and computational challenges inherent in integrating large
heterogeneous genomic data colle
ctions,
a major
contribution

of HEFalMp not addressed by any previous system is
its
summarization of data as systems
-
level functional maps. Previous data integration methods generally provided
biomolecular interaction networks as their end product; HEFalMp

includes such functional relationship networks, but
also provides further analysis in the form of functional maps. These represent a uniform framework in which the millions
of edges in such networks can be further summarized in a biologically informative
way, yielding data
-
driven interactions
between pathways, diseases, and (as future work) tissue types and developmental stages.

(Rhodes et al. 2005)

employed a semi
-
naive Bayesian model to integrate a relatively sma
ll and highly curated subset of
human PPI data: ~40
K

PPI pairs from orthologous proteins in model organisms, ~200
M

coexpression measurements
spanning
only five

microarray
datasets
, and ~40M curated coannotations from the Gene Ontology.
The most important
w
ay in which this study differs from ours is in its use of prior knowledge: the Rhodes et al. classifier uses curated
coannotations from the Gene Ontology to predict protein
-
protein interactions, which is
fundamentally
different from the
standard
bioinforma
tics
paradigm of predicting new biological knowledge from experimental data.
Beyond this
s
ubstantial

basic difference, the Rhodes et al. study predicts only PPI
s
, not more
general
functional
relationships
, and it
also incorporates a large amount of data fr
om orthologous organisms, not from direct experimentation on human
systems
;

t
his has the potential to introduc
e biases based on the way in which orthology is inferred. Finally, the scope of
the Rhodes et al. study is quite different from that addressed in
our manuscript:
the previous study integrated
~
250
M
data
points

based on a ~3M pair gold standard, versus ~30B data points and ~55M gold standard pairs used by HEFalMp

in
>200 distinct functional areas
. This

difference
, combined with
Rhodes et al.'s
semi
-
m
anual normalization and filtering of
their three data types, makes it infeasible to scale
their
solution to a comprehensive functional view of the human
genome.

The STRING database
(Jensen et al. 2009)

focuses on a broader definition
of PPIs that includes functional relationships,
but the majority of its
human interactions
represent

experimental results imported from existing databases (BioGRID
(Stark et al. 2006)
, MINT
(Chatr
-
aryamontri et al. 2007)
, etc.) STRING also suffers the

same
potential drawback of using
curated databases (e.g. HPRD
(Mishra et al. 2006)
, Reactome
(Vastrik et al. 2007)
, and others) as training data relative to a
small, similarly curated gold standard (KEGG pathways
(Kanehisa et al. 2008)
).
While this provides an excellent means of
accessing

multiple reference databases of ex
perimental results
through
a
unified
interface, it

is
potentially

circular
when
framed as an application of

machine learning to predict new biological relationships.
In cases where STRING performs
data integration to predict new protein interactions, it do
es so by
regressing a

confidence score
against

new datasets
,
which

maps their raw results to membership
probabilities
in KEGG pathways.
While STRING can clearly scale to include
a

tremendous

amount of data, its focus is on aggregation of existing PPI datab
ases rather than on prediction of new
functional relationships; STRING itself
provides
neither a
uniform
machine learning method for integrating its
constituent data nor an interface for exploring the results at a
systems

level comparable to HEFalMp's func
tional maps
.

Other existing systems differ from HEFalMp in their biological, rather than computational, scope
s
; several data
integration
techniques
have proven to be quite successful in predicting functional relationships in simpler organisms.
Such integra
tions are necessarily smaller in computational scope as well; the most comparable study in yeast
(Myers et al.
2007)

included a similar
ly designed

gold stand
ard and coverage of ~200 functional contexts, but this still entailed
almost
1,000x

fewer data points.

The machine learning methodology developed in the HEFalMp system is an evolution of this
system and that described in
(Huttenhower et al. 2006)

and
in
(Myers et al. 2005)
, with the addition of Bayesian
regularization to
improve performance in the presence of very large data collections.
Other techniques applied to simpler
organisms include those of
(Jansen et al. 2003)

and
(Lee et al. 2004)

in yeast

and that of
(Date et al. 2006)

in the malaria
parasite
P. falciparum
. Jansen et al. focus exclusively on physical PPIs using a small data collection in yeast (e.g. no
coexpression data is integrated); with the exc
eption of several cutoffs to discretize continuous predictions into binary
interactions, their methodology is essentially a
n unmodified

application of naive Bayesian classifiers. Lee et al. employ a
variety of customized model fitting
to map
S. cerevisiae

experimental data values to a KEGG
-
based gold standard
, and it is
unclear whether their approac
h could
scale to the biological and computational
challenges

of the human genome.

Finally,


Date and Stoeckert's work with
P. falciparum

provides an excellent exa
mple of the power of functional data integration to
explore a largely uncharacterized biological system
; they also
build up
on the integration techniques of
(Troyanskaya et al.
2003)

with a biological focus on the mechanisms of malaria infection. This emphasizes the importance of
applying
data
integration
techniques
to new biological systems where they can most effectively collect, focus, and expand upon large
collections of experimental data.

Thus, from a biological perspective, it is significant that HEFalMp offers the first comp
rehensive functional integration
of human genomic data. This provides an opportunity to explore fundamental human biology and the molecular
mechanisms of disease at levels ranging from individual experimental results to the interplay between entire cellula
r
pathways.
Regularized naive Bayesian networks represent a machine learning technique with sufficient breadth to
incorporate billions of experimental data points, and functional mapping provides the depth to hierarchically summarize
this
tremendous amount

of
data in

a biologically meaningful way.

We hope that both techniques are useful to
computational and biological investigators alike in the investigation of the human genome and the genomes of other
organisms.


References


Chatr
-
aryam
ontri, A., Ceol, A., Palazzi, L.M., Nardelli, G., Schneider, M.V., Castagnoli, L., and Cesareni, G. 2007. MINT:
the Molecular INTeraction database.
Nucleic acids research

35:

D572
-
574.

Date, S.V. and Stoeckert, C.J., Jr. 2006. Computational modeling of the

Plasmodium falciparum interactome reveals
protein function on a genome
-
wide scale.
Genome research

16:

542
-
549.

Huttenhower, C., Hibbs, M., Myers, C., and Troyanskaya, O.G. 2006. A scalable method for integration and functional
analysis of multiple microa
rray datasets.
Bioinformatics (Oxford, England)

22:

2890
-
2897.

Jansen, R., Yu, H., Greenbaum, D., Kluger, Y., Krogan, N.J., Chung, S., Emili, A., Snyder, M., Greenblatt, J.F., and Gerstein
,
M. 2003. A Bayesian networks approach for predicting protein
-
prote
in interactions from genomic data.
Science (New
York, N.Y

302:

449
-
453.

Jensen, L.J., Kuhn, M., Stark, M., Chaffron, S., Creevey, C., Muller, J., Doerks, T., Julien, P., Roth, A., Simonovic, M. et
al.
2009. STRING 8
--
a global view on proteins and their fun
ctional interactions in 630 organisms.
Nucleic acids research

37:

D412
-
416.

Kanehisa, M., Araki, M., Goto, S., Hattori, M., Hirakawa, M., Itoh, M., Katayama, T., Kawashima, S., Okuda, S.,
Tokimatsu, T. et al. 2008. KEGG for linking genomes to life and the
environment.
Nucleic acids research

36:

D480
-
484.

Lee, I., Date, S.V., Adai, A.T., and Marcotte, E.M. 2004. A probabilistic functional network of yeast genes.
Science (New
York, N.Y

306:

1555
-
1558.

Mishra, G.R., Suresh, M., Kumaran, K., Kannabiran, N., Sur
esh, S., Bala, P., Shivakumar, K., Anuradha, N., Reddy, R.,
Raghavan, T.M. et al. 2006. Human protein reference database
--
2006 update.
Nucleic acids research

34:

D411
-
414.

Mitchell, T.M. 1997.
Machine Learning
. McGraw
-
Hill.

Myers, C.L., Robson, D., Wible,
A., Hibbs, M.A., Chiriac, C., Theesfeld, C.L., Dolinski, K., and Troyanskaya, O.G. 2005.
Discovery of biological networks from diverse functional genomic data.
Genome Biol

6:

R114.

Myers, C.L. and Troyanskaya, O.G. 2007. Context
-
sensitive data integration
and prediction of biological networks.
Bioinformatics

23:

2322
-
2330.

Rhodes, D.R., Tomlins, S.A., Varambally, S., Mahavisno, V., Barrette, T., Kalyana
-
Sundaram, S., Ghosh, D., Pandey, A.,
and Chinnaiyan, A.M. 2005. Probabilistic model of the human protein
-
protein interaction network.
Nature biotechnology

23:

951
-
959.

Rish, I. 2001. An empirical study of the naive Bayes classifier. In
IJCAI 2001 Workshop on Empirical Methods in Artificial
Intelligence
.

Stark, C., Breitkreutz, B.J., Reguly, T., Boucher, L., B
reitkreutz, A., and Tyers, M. 2006. BioGRID: a general repository for
interaction datasets.
Nucleic acids research

34:

D535
-
539.

Troyanskaya, O.G., Dolinski, K., Owen, A.B., Altman, R.B., and Botstein, D. 2003. A Bayesian framework for combining
heterogene
ous data sources for gene function prediction (in Saccharomyces cerevisiae).
Proceedings of the National
Academy of Sciences of the United States of America

100:

8348
-
8353.

Vastrik, I., D'Eustachio, P., Schmidt, E., Joshi
-
Tope, G., Gopinath, G., Croft, D.,

de Bono, B., Gillespie, M., Jassal, B., Lewis,
S. et al. 2007. Reactome: a knowledge base of biologic pathways and processes.
Genome biology

8:

R39.