Previous Work in Heterogeneous Data Integration
As mentioned in the main text, related prior work in the area of heterogeneous data integration falls into two categories:
methodological precursors involving naive Bayesian classifiers and biological precur
sors performing data integration for
simpler organisms. Naive Bayesian classifiers are themselves qui
studied and robust (see
for reviews), and t
heir applications for data integration in related biological areas have been mainly in the analysis
protein interaction (PPI) data.
Beyond the biological and computational challenges inherent in integrating large
heterogeneous genomic data colle
of HEFalMp not addressed by any previous system is
summarization of data as systems
level functional maps. Previous data integration methods generally provided
biomolecular interaction networks as their end product; HEFalMp
includes such functional relationship networks, but
also provides further analysis in the form of functional maps. These represent a uniform framework in which the millions
of edges in such networks can be further summarized in a biologically informative
way, yielding data
between pathways, diseases, and (as future work) tissue types and developmental stages.
(Rhodes et al. 2005)
employed a semi
naive Bayesian model to integrate a relatively sma
ll and highly curated subset of
human PPI data: ~40
PPI pairs from orthologous proteins in model organisms, ~200
, and ~40M curated coannotations from the Gene Ontology.
The most important
ay in which this study differs from ours is in its use of prior knowledge: the Rhodes et al. classifier uses curated
coannotations from the Gene Ontology to predict protein
protein interactions, which is
different from the
paradigm of predicting new biological knowledge from experimental data.
basic difference, the Rhodes et al. study predicts only PPI
, not more
, and it
also incorporates a large amount of data fr
om orthologous organisms, not from direct experimentation on human
his has the potential to introduc
e biases based on the way in which orthology is inferred. Finally, the scope of
the Rhodes et al. study is quite different from that addressed in
the previous study integrated
based on a ~3M pair gold standard, versus ~30B data points and ~55M gold standard pairs used by HEFalMp
>200 distinct functional areas
, combined with
Rhodes et al.'s
anual normalization and filtering of
their three data types, makes it infeasible to scale
solution to a comprehensive functional view of the human
The STRING database
(Jensen et al. 2009)
focuses on a broader definition
of PPIs that includes functional relationships,
but the majority of its
experimental results imported from existing databases (BioGRID
(Stark et al. 2006)
aryamontri et al. 2007)
, etc.) STRING also suffers the
potential drawback of using
curated databases (e.g. HPRD
(Mishra et al. 2006)
(Vastrik et al. 2007)
, and others) as training data relative to a
small, similarly curated gold standard (KEGG pathways
(Kanehisa et al. 2008)
While this provides an excellent means of
multiple reference databases of ex
framed as an application of
machine learning to predict new biological relationships.
In cases where STRING performs
data integration to predict new protein interactions, it do
es so by
maps their raw results to membership
in KEGG pathways.
While STRING can clearly scale to include
amount of data, its focus is on aggregation of existing PPI datab
ases rather than on prediction of new
functional relationships; STRING itself
machine learning method for integrating its
constituent data nor an interface for exploring the results at a
level comparable to HEFalMp's func
Other existing systems differ from HEFalMp in their biological, rather than computational, scope
; several data
have proven to be quite successful in predicting functional relationships in simpler organisms.
tions are necessarily smaller in computational scope as well; the most comparable study in yeast
(Myers et al.
included a similar
ard and coverage of ~200 functional contexts, but this still entailed
fewer data points.
The machine learning methodology developed in the HEFalMp system is an evolution of this
system and that described in
(Huttenhower et al. 2006)
(Myers et al. 2005)
, with the addition of Bayesian
improve performance in the presence of very large data collections.
Other techniques applied to simpler
organisms include those of
(Jansen et al. 2003)
(Lee et al. 2004)
and that of
(Date et al. 2006)
in the malaria
. Jansen et al. focus exclusively on physical PPIs using a small data collection in yeast (e.g. no
coexpression data is integrated); with the exc
eption of several cutoffs to discretize continuous predictions into binary
interactions, their methodology is essentially a
application of naive Bayesian classifiers. Lee et al. employ a
variety of customized model fitting
experimental data values to a KEGG
based gold standard
, and it is
unclear whether their approac
scale to the biological and computational
of the human genome.
Date and Stoeckert's work with
provides an excellent exa
mple of the power of functional data integration to
explore a largely uncharacterized biological system
; they also
on the integration techniques of
(Troyanskaya et al.
with a biological focus on the mechanisms of malaria infection. This emphasizes the importance of
to new biological systems where they can most effectively collect, focus, and expand upon large
collections of experimental data.
Thus, from a biological perspective, it is significant that HEFalMp offers the first comp
rehensive functional integration
of human genomic data. This provides an opportunity to explore fundamental human biology and the molecular
mechanisms of disease at levels ranging from individual experimental results to the interplay between entire cellula
Regularized naive Bayesian networks represent a machine learning technique with sufficient breadth to
incorporate billions of experimental data points, and functional mapping provides the depth to hierarchically summarize
a biologically meaningful way.
We hope that both techniques are useful to
computational and biological investigators alike in the investigation of the human genome and the genomes of other
ontri, A., Ceol, A., Palazzi, L.M., Nardelli, G., Schneider, M.V., Castagnoli, L., and Cesareni, G. 2007. MINT:
the Molecular INTeraction database.
Nucleic acids research
Date, S.V. and Stoeckert, C.J., Jr. 2006. Computational modeling of the
Plasmodium falciparum interactome reveals
protein function on a genome
Huttenhower, C., Hibbs, M., Myers, C., and Troyanskaya, O.G. 2006. A scalable method for integration and functional
analysis of multiple microa
Bioinformatics (Oxford, England)
Jansen, R., Yu, H., Greenbaum, D., Kluger, Y., Krogan, N.J., Chung, S., Emili, A., Snyder, M., Greenblatt, J.F., and Gerstein
M. 2003. A Bayesian networks approach for predicting protein
in interactions from genomic data.
Jensen, L.J., Kuhn, M., Stark, M., Chaffron, S., Creevey, C., Muller, J., Doerks, T., Julien, P., Roth, A., Simonovic, M. et
2009. STRING 8
a global view on proteins and their fun
ctional interactions in 630 organisms.
Nucleic acids research
Kanehisa, M., Araki, M., Goto, S., Hattori, M., Hirakawa, M., Itoh, M., Katayama, T., Kawashima, S., Okuda, S.,
Tokimatsu, T. et al. 2008. KEGG for linking genomes to life and the
Nucleic acids research
Lee, I., Date, S.V., Adai, A.T., and Marcotte, E.M. 2004. A probabilistic functional network of yeast genes.
Mishra, G.R., Suresh, M., Kumaran, K., Kannabiran, N., Sur
esh, S., Bala, P., Shivakumar, K., Anuradha, N., Reddy, R.,
Raghavan, T.M. et al. 2006. Human protein reference database
Nucleic acids research
Mitchell, T.M. 1997.
Myers, C.L., Robson, D., Wible,
A., Hibbs, M.A., Chiriac, C., Theesfeld, C.L., Dolinski, K., and Troyanskaya, O.G. 2005.
Discovery of biological networks from diverse functional genomic data.
Myers, C.L. and Troyanskaya, O.G. 2007. Context
sensitive data integration
and prediction of biological networks.
Rhodes, D.R., Tomlins, S.A., Varambally, S., Mahavisno, V., Barrette, T., Kalyana
Sundaram, S., Ghosh, D., Pandey, A.,
and Chinnaiyan, A.M. 2005. Probabilistic model of the human protein
protein interaction network.
Rish, I. 2001. An empirical study of the naive Bayes classifier. In
IJCAI 2001 Workshop on Empirical Methods in Artificial
Stark, C., Breitkreutz, B.J., Reguly, T., Boucher, L., B
reitkreutz, A., and Tyers, M. 2006. BioGRID: a general repository for
Nucleic acids research
Troyanskaya, O.G., Dolinski, K., Owen, A.B., Altman, R.B., and Botstein, D. 2003. A Bayesian framework for combining
ous data sources for gene function prediction (in Saccharomyces cerevisiae).
Proceedings of the National
Academy of Sciences of the United States of America
Vastrik, I., D'Eustachio, P., Schmidt, E., Joshi
Tope, G., Gopinath, G., Croft, D.,
de Bono, B., Gillespie, M., Jassal, B., Lewis,
S. et al. 2007. Reactome: a knowledge base of biologic pathways and processes.