INTRODUCTION TO MOLECULAR BIOINFORMATICS FOR DISEASE

moredwarfBiotechnology

Oct 1, 2013 (4 years and 1 month ago)

99 views

MOLECULAR BIOINFORMATICS FOR DISEASE

Tutorial Notes for Pacific Symposium on Biocomputing 2009



ATUL BUTTE

Department of Medicine (Medical Informatics) and Pediatrics, Stanford University, U.S.A.


MARICEL KANN

University of Maryland

Baltimore County, U.S.
A.


YVES A. LUSSIER

Center for Biomedical Informatics and Section of Genetic Medicine, Dept. of Medicine and
UC Cancer Research Center; The University of Chicago, U.S.A.


YANAY OFRAN

Bar Ilan University
, Israel


PREDRAG RADIVOJAC

School of Informatics, Ind
iana University, U.S.A.



Prose summary of the topic


Over the past 10 years, high
-
dimensional investigations related to human disease have expanded considerably in
breadth and depth. The breadth of such investigations spans at least 30 types of high
-
dimen
sional measurement
and experimental modalities, including RNA expression microarrays, DNA sequencing, protein identification,
mutagenesis, RNA interference, and many others. The depth of such investigations has grown to include
measurements of entire sets

of
transcripts, proteins,
metabolites,
and genomes. Most recently, these
technologies have started to be applied to the study of many diseases. In the US, the NIH Roadmap for Medical
Research

(Zerhouni, 2003)
has led to multiple funding opportunities for b
ioinformaticians to collaborate with
clinical researchers to promote and facilitate translational research. For example, the Clinical and Translational
Science Awards, the replacement for the General Clinical Research Centers, require a strong biomedical
i
nformatics collaborative component.

This tutorial focuses on the emerging fields of
bioinformatics
in diseases and phenomics: from
protein
structures to
protein
-
protein interactions to supracellular phenotypes. Experimental studies indicate that protein
in
teractions play a key role in many diseases, even in some that are considered complex or multifactorial.
While altered phenotypes are among the most reliable manifestations of altered gene functions, research
focused on systematic analysis of phenotype re
lationships to study human biology is still in its infancy. We
use
the word phenome and phenomics to describe the physical totality of all traits of an organism

(Mahner and Kary
1997).

F
rom Mendelian to multifactorial diseases

One of the ultimate goals of
biological sciences, and certainly one with a high impact on society, is to improve
our understanding of the processes and events that lead to disease in organisms. Molecular biologists, who
traditionally study the structure and function of individual prot
eins and genes, have gained insight and
introduced several discoveries that have ultimately reached the bedside. The deluge of newly sequenced
proteins offers tremendous amounts of data regarding the molecular basis of disease. The OMIM database
(Hamosh, S
cott et al. 2005) makes use of this opportunity. The curators of OMIM derive evidence from the
literature for the relationship between a clinical phenot
ype and its associated sequence/
mutation. OMIM is
primarily focused on Mendelian diseases, namely, disea
ses that are caused by a mutation in a single gene and
are inherited in Mendelian patterns. Simple genomic events that mutate or eliminate specific genes (e.g., frame
shift mutations or insertion of viral genes) can account for the genotypes underlying the
se diseases. An
important question in this respect is exactly how mutations in a gene lead to the observed pathology (Cargill,
Altshuler et al. 1999; Wang and Moult 2001; Wang and Moult 2003).

While OMIM is focused on monogenic disorders, it also contains
information about select “complex diseases”.
In fact, many if not most of human diseases are considered to be “complex” or “multifactorial”, and cannot be
fully accounted for by a single molecular event. Genomic and proteomic data enhance the study of thes
e
diseases as well, by helping to unravel the meticulous interaction networks that underlie them (Rual,
Venkatesan et al. 2005; Stelzl, Worm et al. 2005). Both the study of Mendelian diseases and that of complex
diseases increasingly rely on computational
tools and findings. In the Mendelian case
-

to understand the
biophysical effects of mutations and to realize how they lead to diseases
-

researchers often use computational
tools. In the case of multifactorial diseases, computational tools are even more p
ertinent. Biological processes
are not realized by a single molecule, but rather by the complex interaction of proteins with their environment,
including nucleic acids, ions, lipids, membranes and, of course, other proteins. Hence
,

to fully understand such

processes one needs to explore complex pathways, networks, expression patterns, control mechanisms and their
interrelationships.

Numerous molecular databases include in their annotation the implication of the protein in diseases. For
example, the SWISS
-
PR
OT database
(Bairoch, et al., 2005)
attempts to include in its annotation of proteins
“Disease(s) associated with any number of deficiencies in the protein”
, though the annotations are in free text
.
Similar annotations are becoming popular in many other da
tabases. The GeneCards (http://www.genecards.org)
database includes a separate section that describes the implication of a gene in various diseases. The KEGG
database (http://www.genome.jp/kegg/) has recently begun to curate disease
-
related pathways.
The G
enetic
Association Database (Becker, et al., 2004) similarly serves as an archive of the genetic association studies
while HGMD provides associations between molecular events such as mutations, insertions/deletions, and
splicing anomalies with disease (Ste
nson, et al., 2003). Finally, PharmGKB (Klein, et al., 2001) provides not
only gene
-
disease relationships, but also information about genotypes
-
drug interactions.
There are also many
domain specific databases which curate proteins that are involved in spec
ific diseases. Most of these attempts
are based on manual curation of literature.

O
ne aspect of this problem
is
that of protein interaction and disease. The study of protein interaction has
received a great deal of attention in recent years
, including PSB

sessions from 2006
-
2007 named Protein
Interactions and Disease
. In particular, numerous computational tools have been designed to enhance our
understanding of protein interaction. On the other hand, in the study of the molecular basis of disease, protein
interaction is increasingly acknowledged as a valuable prism through which disease might be analyzed,
understood and possibly even treated.
We will start by examining
how these two developments meet, namely
how computational analysis can be used to study t
he role of interaction in disease.

Protein interactions and their computational analysis

Every protein has a biological function, yet most of the biological functions are carried out by groups of
proteins interacting with each other and with other molecul
es in their environment in complex networks. The
main type of interaction that is of interest in this context is the interaction between proteins and other proteins.
This includes, for example the study of interactions between proteins and antibodies and t
he study of signal
transduction. Both these processes are critical to the understanding of many complex diseases and are central
tools in pharmacology and drug discovery. Many computational studies focus on protein
-
protein interactions
(Salwinski and Eisen
berg 2003) and their importance for predicting protein function (Rost, Liu et al. 2003).

Other types of interactions that are being studied extensively are:



protein


DNA / RNA interaction; critical for the understanding of expression and expression con
trol



protein


small molecule / metal ions interaction; critical for the understanding of protein function and to
drug discovery and design



protein
-
membrane interaction; critical for the understanding of a myriad of biological and pathological
processes,
including viral / microbial infections.

Interactions between proteins and other molecules can be physical, i.e. by chemically binding each other or by
binding together to a third substrate, or they can be functional, e.g. by controlling each others’ expre
ssion or by
participating in the same biochemical pathway. A complete picture of all the proteins that are involved in a
certain biological process would not only enhance our understanding of diseases but will also break new ground
in drug development by i
dentifying new targets for drugs (Ofran, Punta et al. 2005).

Computational studies of protein interaction usually attempt to do one of the following:



identify proteins that bind to a certain molecule



analyze or predict the interface
/
binding site



analyze
or predict the process in which a certain interaction participates



analyze or predict the effect of the loss / gain of a certain interaction

Protein interaction
s

and disease

Protein interaction could be implicated in pathological processes in one of two wa
ys: the elimination of an
essential interaction or the gain of a deleterious one (Ryan and Matthews 2005). Mendelian diseases are often
caused by a single mutation. Such mutation could lead to the complete elimination of the damaged gene
or

gene
product. O
bviously, these cases are not of interest in the context of

protein interaction

networks
. In many other
Mendelian cases, where the point mutat
ion does not eliminate the gene/
protein altogether, the pathological
effect could often be traced to an effect on
protein interaction (Wang and Moult 2001; Wang and Moult 2003).
Intrinsically unstructured proteins, participating in transient
-

possibly multiple
-

protein
-
protein interactions,
have also been related to disease (Uversky et al. 2006; Feng et al. 2006).
Undesired protein aggregation is the
key factor in amyloidoses, a class of disease
s

including Alzheimer's (Dobson, 2001
;
Fernandez et al. 2003).


In cancer, a quintessential complex disease, the main focus of the basic and pharmacological research is on th
e
protein interaction.
One of the most prominent anti cancer drug today, Gleevec, is a tyrosine kinase inhibitor
and is very effective in treating
chronic myeloid leukemia
.

In almost all types of cancer
,

there are interactions
that went
awry
. Constitutive
signal transduction, a result of an aberrant interaction, is implicated in many
tumors and is one of the main targets for therapy development (Levitzki and Gazit 1995). Control of the
expression of different proteins, which is mediated by protein
-
DNA inter
action, was shown to differ between
normal and cancerous cells, and between different types of malignancies
(Golub, Slonim et al. 1999)
. The
mechanisms of action of oncogenes and tumor suppressors are based on protein interactions
(Kamb, Gruis et al.
1994)
. Additionally, non
-
genetic diseases that are caused by infective agents (bacteria, viruses) depend almost
by definition, on the interaction of proteins and other molecules.

Analyzing the interaction networks in terms of specific interactions for each sin
gle disease has proven
successful to understand their molecular basis.
In contrast, l
ooking for general
topological
properties of the
whole network of human disease interactions

might prove useful to describe some general principles about
diseases (Jonsson

et al. 2006, Barabasi,
personal communication
).

Recently, i
t

has been
found that
proteins with
mutation
(
s
)

related to a
disease are more likely to interact with proteins already known to cause similar diseases
(Gandhi et al. 2006).


Computational analysis

of protein interactions in disease

Preliminary computational studies have attempted to characterize the relationship between protein interaction
and disease. Some of these studies tried to find an explanation to the deleterious effect of some mutations. F
or
example, Mirkovic and his coworkers attempted to rationalize the effect of the mutations in the BRCA gene that
lead to breast cancer. They found that many of these mutations could be traced to the protein interaction sites
(Mirkovic, Marti
-
Renom et al.
2004).

Other studies have attempted to comprehensively characterize the effect
of all known deleterious SNPs (Wang and Moult 2001) and even devise methods to predict the effect of
uncharacterized ones (Saunders and Baker 2002).
Methods for addressing effec
ts of mutations were reviewed
by Mooney (Mooney, 2005).

At the single protein molecular level, computational techniques such as homology modeling, molecular
dynamics and protein
-
protein docking have been used to predict protein
-
protein interactions and to
study their
determinants. Since the solution of the first high
-
resolution ion channel structure (Doyle et al. 1998), for
example, these techniques have been widely applied to the analysis of toxin
-
channel interactions (Wang et al.
2006; M'Barek et al.
2005
;
Wu et al. 2004).

Genome wide computational analysis of bacteria and viruses genomes can contribute to the understanding of
infectious diseases in humans. New computational techniques can be applied to the gene expression data to
monitor the host response

to the infective agents or drugs against them (Bandyopadhyay et al.

2006;
Cabusora
et al. 2005
;

Musser et al. 2005
;

Rachman et al. 2006). These system biology approaches will be key to uncover
complex interactions between host and pathogen, and new mechan
isms of pathogen resistance.

Computational techniques to study general network properties can be applied to the understanding of the
properties of disease related interactions. For instance, based on the differences in genes related to hereditary
diseases
with unrelated ones, a

support vector machine (SVM) based

classifier was recently proposed and

used
for
the systematic classification of all genes from the human genome (Xu et. al. 2006).

A number of other
studies was reviewed by Kann (2007), Oti and Brunn
er (2007) and Dalkilic et al. (2008).

The great challenges of the field remain to assess, analyze and predict the importance of protein interaction in
different diseases. Success in these tasks can be readily translated into progress in drug design or disc
overy and
ultimately lead to better treatment.

In recent years, the emergence of new experimental protocols and tec
h
niques such as RNA, DNA and protein
microarrays, two
-
hybrid systems, and mass spectrometry, as well as the explosion o
f

the number and size
of
sequence and stru
c
ture databases, have changed biomedical science. By taking advantage of the enormous
amount of data generated by all these techniques, computational biology can now attempt to capture more of the
complexity of a biological pro
c
ess. The

increasing number of computational studies of protein networks,
pathways, pr
o
tein
-
protein, protein
-
metabolite and protein
-
DNA/RNA interactions indicates that it is now
possible to address the connections between protein inte
r
actions and diseases. We are c
onfident that the papers
presented in this session will contribute to fu
r
ther advance
s in

this

important and rapidly increasing

area of
biomedical r
e
search.

Diseases and phenotypes

While the genotype represents an organism’s exact genetic make
-
up, t
he phen
otype of an individual
or
organism
is
the complete physical manifestation of that organism, usually considered as a sum of multiple
individual traits, such as internal and external appearance, ability, and behavior, which are known to differ
between organi
sms. A few examples will illustrate this difference. The DNA sequence of an organism is part of
its genotype. RNA measurements from cells from an organism are a phenotype. Single nucleotide
polymorphism measurements are part of its genotype, as are assign
ments to common haplotypes. It is also
worthwhile to consider what a disease is.
A disease
is an alteration of the mind or body of an organism that
causes uneasiness, dysfunction, suffering, or death to the organism, or those in contact with the organism.
In
addition, it is unavoidable to consider disease in a social context as well. While a disease can clearly be a
phenotype, propensity towards or heightened risk of a disease can also be phenotypes.

Phenotypes can be represented arbitrarily, but the power
to compare phenotypes within and across species of
organisms only comes when a useful representation is chosen. For example, many mouse models have been
created to simulate human diseases and phenotypes. Knockout mice, where a particular gene has been
elim
inated throughout the mouse or within specific tissues of the mouse, or transgenic mice, where a particular
gene has been “turned on” throughout the mouse or within specific tissues of the mouse, have traditionally been
stored and distributed through a var
iety of institutions, such as the Jackson Laboratories. Detailed phenotypes
on over 133 strains of mice, as well as scattered descriptions of thousands of additional strains are available
through their Mouse Phenome Database (Bogue, Grubb et al. 2007). Use
rs searching for a gene that might
explain a human phenotype can search on the Jackson Laboratory web
-
site for their phenotype of interest, using
a structured ontology, to yield a list of genes and mouse models that have been shown to have the phenotype in

question.

Human phenotypes are harder to characterize. While Freimer and Sabatti called for a Human Phenome Project
in 2003

(Freimer and Sabatti 2003), in most cases, the richest source of phenotypes for humans comes from
knowledge of human pathological c
onditions. The Online Mendelian Inheritance in Man database and web
-
site
is a large set of genetic loci and genes with mutations or other variants associated linked to monogenic inherited
disorders. These disorders are described in a free
-
text historical n
arrative, relating when each disease was first
described and linked to genetics. For many diseases, an additional clinical synopsis is provided, which is a list
of uncontrolled terms describing traits seen in the disorder, roughly arranged by organ system.

There are other ways to represent human phenotypes and diseases. Papers that are published on diseases need to
be indexed in the National Library of Medicine, so the Medical Subject Headings (MeSH) can be used to
represent diseases. The International Clas
sification of Diseases (ICD) is a system that started by representing
causes of death and disease going back to late 1800s (World Health Organization 2005); the ninth and tenth
edition of the ICD are used worldwide to communicate information about diseases

between and among
physicians, hospitals, payors, and public health officials. The
Systematized Nomenclature of Medicine

(SNOMED) has been
devel
oped by the College of American
Patholo
gists since in 1965, and is a much more
detailed representation of diseas
e, enabled by molecular classification (Chute 2000). The Unified Medical
Language System (UMLS) is an overarching standardized nomenclature maintained by the National Library of
Medicine that can be used to relate ICD, MeSH, SNOMED, and over a hundred othe
r vocabularies
(Bodenreider 2004). Finally, efforts such as Disease Ontology
(
http://diseaseontology.sourceforge.net/
)
are also
being developed

in order to assign hierarchical relationships among disease terms.
Disease Ontology provides
mapping to SNOMED a
nd is based on ICD terms.

The advantages and disadvantages to representing diseases in each of these ways depend on the data one wishes
to relate. For example, by considered diseases by MeSH, one can use other MeSH annotations across the
literature to rela
te drugs and symptoms to disease. By considering diseases by ICD, one can tie data from public
health sources, such as epidemiological data sets. By representing diseases by SNOMED, one can relate
diseases to known pathophysiological mechanisms causing tho
se diseases.

Use of UMLS might reduce the risk
and arbitrariness of choosing one disease representation over another.

Perez
-
Iratxeta was one of the first to link knowledge of diseases in MEDLINE to known alterations in
biochemistry, through MeSH annotation
s across the literature. She was then able to link these changes in
biochemistry to specific genes to make predictions of genes with mutations associated with those diseases
(Perez
-
Iratxeta, Bork et al. 2002). Using a data
-
driven approach, Butte and Kohane

linked publicly
-
available
microarray data with the phenotypic descriptors extracted from those experimental annotations, to broadly
associate environmental factors and studied phenotypes to genes showing differential expression (Butte and
Kohane 2006). A
number of other approaches has also been proposed, including those based on single type of
data (e.g. protein
-
protein interaction networks) (Chen, et al., 2006; Gonzalez, et al., 2007; Oti, et al., 2006) or
those that integrate a number of data types (Adie
, et al., 2005; Adie, et al., 2006; Aerts, et al., 2006;
Freudenberg and Propping, 2002; George, et al., 2006; Lage, et al., 2007; Radivojac, et al., 2008; Rossi, et al.,
2006; Turner, et al., 2003).

Linking phenotypes to protein
-
protein interactions

Whil
e genotypes and phenotypes are certainly studied individually, analyses which bring them together often
have the
highest impact

for medicine. The most common method of associating genotypes with phenotypes is
the calculation of quantitative
-
trait loci, whe
re consecutive regions of chromosomes are statistically associated
with quantifiable traits, such as blood pressure, height, or cardiovascular parameters (Nadeau, Burrage et al.
2003). However, phenotypes can be considered broadly, and can even include RNA

measurements or metabolic
measurements. For instance, expression quantitative trait loci (eQTLs) and metabolic quantit
a
tive trait loci
(mQTLs) may be used to find genetic loci associated with expression level differences of genes or metabolic
abundance (J
ansen and Nap 2001; Schadt, Monks et al. 2003; Fu, Swertz et al. 2007).

A significant methodological improvement was made

by Lage, et al. (
2007). Building on the assumption that
genetic syndromes with shared phenotypes often involve proteins in the same pa
thway, Lage and his team
formally calculate metrics to measure the similarity and differences between syndromes given clinical terms
that describe them. Then, when given an input query region of a chromosome suspected of being involved with
a genetic disea
se, their method returns a targeted prioritization of those genes within the region reflective of the
likelihood of each gene being involved with that disease. These scores are calculated based on the protein
-
protein interaction distance between each gene
and other genes known to be associated with diseases with
similar phenotypes.

Finally, we mention that the research has also addressed identification of drug targets
, effectively connecting
proteins, drugs

and diseases. A notable example is
a
work by
the B
ork group (Campillos et al.
2008)
, which
uses side
-
effect similarity extracted from drug labels to improve drug target identification
.

Resources

SWISS
-
PROT database
http://www.ebi.ac.uk/swissprot

Jackson Laboratories
http://www.jax.org

Mouse Phe
nome Database
http://www.jax.org/phenome


NCBI dbSNP
http://www.ncbi.nlm.nih.gov/projects/SNP

Online Mendelian Inheritance in Man (OMIM)
http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim


Gentrepid server
http://www.gentrepid.org

Endeavour server
http://ww
w.bits.vib.be/endeavour/

TOM server
http://www
-
micrel.deis.unibo.it/~tom/

Prospector server
http://www.genetics.med.ed.ac.uk/prospectr/

G2D

server
http://www.ogic.ca/projects/g2d_2/

PhenoPred server
www.phenopred.org

Reviews

1.

Ryan DP & Matthews JM. Protein
-
protein interactions in human disease.
Curr Opin Str
uct Biol
,
15
,
441
-
446 (2005)

2.

Oti M & Brunner HG. The modular nature of genetic diseases.
Clin Genet
,
71
, 1
-
11 (2007)

3.

Kann MG. Protein interactions and disease: computational approaches to uncover the etiology of
diseases.
Brief Bioinform
,
8
, 333
-
346 (2007)

4.

Lussier YA & Liu Y. Computational approaches to phenotyping: high
-
throughput phenomics.
Proc Am
Thorac Soc
,
4
, 18
-
25 (2007)

5.

Loscalzo J, Kohane I, & Barabasi A
-
L.
Human disease classification in the postgenomics era: a complex
systems approach to human pat
hobiology.
Mol Syst Biol
,
3
: 124 (2007)

6.

Dalkilic MM, Costello JC, Clark WT, & Radivojac P. From protein
-
disease associations to disease
informatics.
Front Biosci
,
13
, 3391
-
3407 (2008)

7.

Ideker T & Sharan R. Protein networks in disease.
Genome Res
,
18
, 644
-
65
2 (2008)


References

Adie, E.A., Adams, R.R., Evans, K.L., Porteous, D.J. and Pickard, B.S. (2005) Speeding disease gene discovery
by sequence based candidate prioritization,
BMC Bioinformatics
,
6
, 55.


Adie, E.A., Adams, R.R., Evans, K.L., Porteous, D.J.
and Pickard, B.S. (2006) SUSPECTS: enabling fast and
effective prioritization of positional candidates,
Bioinformatic
s
,
22
, 773
-
774.


Aerts, S., Lambrechts, D., Maity, S., Van Loo, P., Coessens, B., De Smet, F., Tranchevent, L.C., De Moor, B.,
Marynen, P.,

Hassan, B., Carmeliet, P. and Moreau, Y. (2006) Gene prioritization through genomic data fusion,
Nat Biotechnol
,
24
, 537
-
544.


Bairoch, A., Apweiler, R., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez,
R., Magrane, M., M
artin, M.J., Natale, D.A., O'Donovan, C., Redaschi, N. and Yeh, L.S. (2005) The Universal
Protein Resource (UniProt),
Nucleic Acids Res
,
33 Database Issue
, D154
-
159.


Becker, K.G., Barnes, K.C., Bright, T.J. and Wang, S.A. (2004) The genetic association da
tabase,
Nat Genet
,
36
, 431
-
432.


Bodenreider, O. (2004). "The Unified Medical Language System (UMLS): integrating biomedical
terminology."
Nucleic Acids Res

32 Database issue
: D267
-
70.


The Unified Medical Language System (
http://umlsks.nlm.nih.gov
) is a r
epository of biomedical
vocabularies developed by the US National Library of Medicine. The UMLS integrates over 2 million
names for some 900,000 concepts from more than 60 families of biomedical vocabularies, as well as 12
million relations among these con
cepts. Vocabularies integrated in the UMLS Metathesaurus include the
NCBI taxonomy, Gene Ontology, the Medical Subject Headings (MeSH), OMIM and the Digital
Anatomist Symbolic Knowledge Base. UMLS concepts are not only inter
-
related, but may also be
linked

to external resources such as GenBank. In addition to data, the UMLS includes tools for
customizing the Metathesaurus (MetamorphoSys), for generating lexical variants of concept names (lvg)
and for extracting UMLS concepts from text (MetaMap). The UMLS kn
owledge sources are updated
quarterly. All vocabularies are available at no fee for research purposes within an institution, but UMLS
users are required to sign a license agreement. The UMLS knowledge sources are distributed on CD
-
ROM and by FTP.


Bogue, M
. A., S. C. Grubb, et al.
(2007). "Mouse Phenome Database (MPD)."
Nucleic Acids Res

35
(Database
issue): D643
-
9.


The Mouse Phenome Database (MPD;
http://www.jax.org/phenome
) is a repository of phenotypic and
genotypic data on commonly used and genetically
diverse inbred strains of mice. Strain characteristics
data are contributed by members of the scientific community. Electronic access to centralized strain data
enables biomedical researchers to choose appropriate strains for many systems
-
based research
ap
plications, including physiological studies, drug and toxicology testing and modeling disease
processes. MPD provides a community data repository and a platform for data analysis and in silico
hypothesis testing. The laboratory mouse is a premier genetic m
odel for understanding human biology
and pathology; MPD facilitates research that uses the mouse to identify and determine the function of
genes participating in normal and disease pathways.


Butte, A. J. and I. S. Kohane (2006). "Creation and implications

of a phenome
-
genome network."
Nat
Biotechnol

24
(1): 55
-
62.


Although gene and protein measurements are increasing in quantity and comprehensiveness, they do not
characterize a sample's entire phenotype in an environmental or experimental context. Here we
comprehensively consider associations between components of phenotype, genotype and environment to
identify genes that may govern phenotype and responses to the environment. Context from the
annotations of gene expression data sets in the Gene Expression O
mnibus is represented using the
Unified Medical Language System, a compendium of biomedical vocabularies with nearly 1
-
million
concepts. After showing how data sets can be clustered by annotative concepts, we find a network of
relations between phenotypic,

disease, environmental and experimental contexts as well as genes with
differential expression associated with these concepts. We identify novel genes related to concepts such
as aging. Comprehensively identifying genes related to phenotype and environmen
t is a step toward the
Human Phenome Project.


Cargill, M., D. Altshuler, et al.
(1999). "Characterization of single
-
nucleotide polymorphisms in coding regions
of human genes."
Nat Genet

22
(3): 231
-
8.


Chen, J.Y., Shen, C. and Sivachenko, A.Y. (2006) Minin
g Alzheimer disease relevant proteins from integrated
protein interactome data,
Pac Symp Biocomput
,
11
, 367
-
378.



Chute, C. G. (2000). "Clinical classification and terminology: some history and current observations."
J Am
Med Inform Assoc

7
(3): 298
-
303.


Campillos, M., Kuhn, M., Gavin, A.C., Jensen, L.J. and Bork, P. (2008) Drug target identification using side
-
effect similarity,
Science
,
321
, 263
-
266.


Dalkilic, M.M., Costello, J.C., Clark, W.T. and Radivojac, P. (2008) From protein
-
disease associations t
o
disease informatics,
Front Biosci
,
13
, 3391
-
3407.


Dobson, C. M. (2001) The structural basis of protein folding and its links with human disease.
Philos Trans R
Soc Lond B Biol Sci
,
356
, 133
-
45.


Fernandez, A., Kardos, J., Scott, L.R., Goto, Y. and Berry
, R.S. (2003) Structural defects and the diagnosis of
amyloidogenic propensity,
Proc Natl Acad Sci U S A
,
100
, 6446
-
6451.


Freimer, N. and C. Sabatti (2003). "The human phenome project."
Nat Genet

34
(1): 15
-
21.


Freudenberg, J. and Propping, P. (2002) A si
milarity
-
based method for genome
-
wide prediction of disease
-
relevant human genes,
Bioinformatics
,
18 Suppl 2
, S110
-
115.



Fu, J., M. A. Swertz, et al.
(2007). "MetaNetwork: a computational protocol for the genetic study of metabolic
networks."
Nat Protoc

2
(3): 685
-
94.


We here describe the MetaNetwork protocol to reconstruct metabolic networks using metabolite
abundance data from segregating populations. MetaNetwork maps metabolite quantitative trait loci
(mQTLs) underlying variation in metabolite abundance

in individuals of a segregating population using
a two
-
part model to account for the often observed spike in the distribution of metabolite abundance
data. MetaNetwork predicts and visualizes potential associations between metabolites using correlations
o
f mQTL profiles, rather than of abundance profiles. Simulation and permutation procedures are used to
assess statistical significance. Analysis of about 20 metabolite mass peaks from a mass spectrometer
takes a few minutes on a desktop computer. Analysis o
f 2,000 mass peaks will take up to 4 days. In
addition, MetaNetwork is able to integrate high
-
throughput data from subsequent metabolomics,
transcriptomics and proteomics experiments in conjunction with traditional phenotypic data. This way
MetaNetwork wil
l contribute to a better integration of such data into systems biology.


George, R.A., Liu, J.Y., Feng, L.L., Bryson
-
Richardson, R.J., Fatkin, D. and Wouters, M.A. (2006) Analysis of
protein sequence and interaction data for candidate disease gene predicti
on,
Nucleic Acids Res
.
34
: e130.


Golub, T. R., D. K. Slonim, et al.
(1999). "Molecular classification of cancer: class discovery and class
prediction by gene expression monitoring."
Science

286
(5439): 531
-
7.


Gonzalez, G., Uribe, J.C., Tari, L., Brophy, C
. and Baral, C. (2007) Mining gene
-
disease relationships from
biomedical literature: weighting protein
-
protein interactions and connectivity,
Pac Symp Biocomput
,
12
, 28
-
39.



Hamosh, A., A. F. Scott, et al. (2005). "Online Mendelian Inheritance in Man (OMI
M), a knowledgebase of
human genes and genetic disorders."
Nucleic Acids Res

33
(Database issue): D514
-
7.



Jansen, R. C. and J. P. Nap (2001). "Genetical genomics: the added value from segregation."
Trends Genet

17
(7): 388
-
91.


The recent successes of geno
me
-
wide expression profiling in biology tend to overlook the power of
genetics. We here propose a merger of genomics and genetics into 'genetical genomics'. This involves
expression profiling and marker
-
based fingerprinting of each individual of a segregat
ing population, and
exploits all the statistical tools used in the analysis of quantitative trait loci. Genetical genomics will
combine the power of two different worlds in a way that is likely to become instrumental in the further
unravelling of metabolic
, regulatory and developmental pathways.


Kamb, A., N. A. Gruis, et al.
(1994). "A cell cycle regulator potentially involved in genesis of many tumor
types."
Science

264
(5157): 436
-
40.


Kann, M.G. (2007) Protein interactions and disease: computational appr
oaches to uncover the etiology of
diseases,
Brief Bioinform
.
8
:
333
-
46
.


Klein, T.E., Chang, J.T., Cho, M.K., Easton, K.L., Fergerson, R., Hewett, M., Lin, Z., Liu, Y., Liu, S., Oliver,
D.E., Rubin, D.L., Shafa, F., Stuart, J.M. and Altman, R.B. (2001) Int
egrating genotype and phenotype
information: an overview of the PharmGKB project. Pharmacogenetics Research Network and Knowledge
Base,
Pharmacogenomics J
,
1
, 167
-
170.


Lage, K., E. O. Karlberg, et al. (2007). "A human phenome
-
interactome network of protei
n complexes
implicated in genetic disorders."
Nat Biotechnol

25
(3): 309
-
16.


We performed a systematic, large
-
scale analysis of human protein complexes comprising gene products
implicated in many different categories of human disease to create a phenome
-
in
teractome network.
This was done by integrating quality
-
controlled interactions of human proteins with a validated,
computationally derived phenotype similarity score, permitting identification of previously unknown
complexes likely to be associated with d
isease. Using a phenomic ranking of protein complexes linked
to human disease, we developed a Bayesian predictor that in 298 of 669 linkage intervals correctly ranks
the known disease
-
causing protein as the top candidate, and in 870 intervals with no ident
ified disease
-
causing gene, provides novel candidates implicated in disorders such as retinitis pigmentosa, epithelial
ovarian cancer, inflammatory bowel disease, amyotrophic lateral sclerosis, Alzheimer disease, type 2
diabetes and coronary heart disease.

Our publicly available draft of protein complexes associated with
pathology comprises 506 complexes, which reveal functional relationships between disease
-
promoting
genes that will inform future experimentation.


Levitzki, A. and A. Gazit (1995). "Tyrosin
e kinase inhibition: an approach to drug development."
Science

267
(5205): 1782
-
8.



Mahner, M. and M. Kary (1997). "What exactly are genomes, genotypes and phenotypes? And what about
phenomes?"
J Theor Biol

186
(1): 55
-
63.


The fundamental concepts of genom
e, genotype and phenotype are not defined in a satisfactory manner
within the biological literature. Not only are there inconsistencies in usage between various authors, but
even individual authors do not use these concepts in a consistent manner within th
eir own writings. We
have found at least five different notions of genome, seven of genotype, and five of phenotype current in
the literature. Our goal is to clarify this situation by (a) defining clearly and precisely the notions of
genetic complement, ge
nome, genotype, phenetic complement, and phenotype; (b) examining that of
phenome; and (c) analysing the logical structure of this family of concepts.


Mirkovic, N., M. A. Marti
-
Renom, et al.
(2004). "Structure
-
based assessment of missense mutations in hum
an
BRCA1: implications for breast and ovarian cancer predisposition."
Cancer Res

64
(11): 3790
-
7.


Mooney, S.D. (2005) Bioinformatics
approaches and resources for si
n
g
le nucleotide polymorphism functional
analysis,
Brief Bioinform
,
6
, 44
-
56.



Nadeau, J. H.
, L. C. Burrage, et al.
(2003). "Pleiotropy, homeostasis, and functional networks based on assays
of cardiovascular traits in genetically randomized populations."
Genome Res

13
(9): 2082
-
91.


A major problem in studying biological traits is understanding ho
w genes work together to provide
organismal structures and functions. Conventional reductionist paradigms attribute functions to
particular proteins, motifs, and amino acids. An equally important but harder problem involves the
synthesis of data at fundame
ntal levels of biological systems to understand functionality at higher levels.
We used subtle, naturally occurring, multigenic variation of cardiovascular (CV) properties in a panel of
genetically randomized strains that are derived from the A/J and C57BL
/6J strains of mice to perturb CV
functions in nonpathologic ways. In this proof
-
of
-
concept study, computational analysis correctly
identified the known relations among CV properties and revealed functionality at higher levels of the
CV system. The network

was then used to account for pleiotropies and homeostatic responses in single
gene mutant mice and in mice treated with a pharmacologic agent (anesthesia). The CV network
accounted for functional dependencies in complementary ways to the insights obtained

from genetic
networks and biochemical pathways. These networks are therefore an important approach for defining
and characterizing functional relations in complex biological systems in health and disease.


Ofran, Y., M. Punta, et al.
(2005). "Beyond annot
ation transfer by homology: novel protein
-
function prediction
methods to assist drug discovery."
Drug Discov Today

10
(21): 1475
-
82.


Oti, M. and Brunner, H.G. (2007) The modular nature of genetic diseases,
Clin Genet
,
71
, 1
-
11.


Oti, M., Snel, B., Huynen,
M.A. and Brunner, H.G. (2006) Predicting disease genes using protein
-
protein
interactions,
J Med Genet
,
43
, 691
-
698.



Perez
-
Iratxeta, C., P. Bork, et al. (2002). "Association of genes to genetically inherited diseases using data
mining."
Nat Genet

31
(3):
316
-
9.


Although approximately one
-
quarter of the roughly 4,000 genetically inherited diseases currently
recorded in respective databases (LocusLink, OMIM) are already linked to a region of the human
genome, about 450 have no known associated gene. Finding

disease
-
related genes requires laborious
examination of hundreds of possible candidate genes (sometimes, these are not even annotated; see, for
example, refs 3,4). The public availability of the human genome draft sequence has fostered new
strategies to m
ap molecular functional features of gene products to complex phenotypic descriptions,
such as those of genetically inherited diseases. Owing to recent progress in the systematic annotation of
genes using controlled vocabularies, we have developed a scoring

system for the possible functional
relationships of human genes to 455 genetically inherited diseases that have been mapped to
chromosomal regions without assignment of a particular gene. In a benchmark of the system with 100
known disease
-
associated gene
s, the disease
-
associated gene was among the 8 best
-
scoring genes with a
25% chance, and among the best 30 genes with a 50% chance, showing that there is a relationship
between the score of a gene and its likelihood of being associated with a particular di
sease. The scoring
also indicates that for some diseases, the chance of identifying the underlying gene is higher.


Radivojac, P., Peng, K., Clark, W.T., Peters, B.J., Mohan, A., Boyle, S.M. and Mooney, S.D. (2008) An
integrated approach to inferring gene
-
disease associations in humans,
Proteins
.
72
,

1030
-
1037


Rossi, S., Masotti, D., Nardini, C., Bonora, E., Romeo, G., Macii, E., Benini, L. and Volinia, S. (2006) TOM: a
web
-
based integrated approach for identification of candidate disease genes,
Nucleic Ac
ids Res
,
34
, W285
-
292.


Rost, B., J. Liu, et al. (2003). "Automatic prediction of protein function."
Cell Mol Life Sci

60
(12): 2637
-
50.



Rual, J. F., K. Venkatesan, et al.
(2005). "Towards a proteome
-
scale map of the human protein
-
protein
interaction netw
ork."
Nature

437
(7062): 1173
-
8.



Ryan, D. P. and J. M. Matthews (2005). "Protein
-
protein interactions in human disease."
Curr Opin Struct Biol

15
(4): 441
-
6.



Salwinski, L. and D. Eisenberg (2003). "Computational methods of analysis of protein
-
protein int
eractions."
Curr Opin Struct Biol

13
(3): 377
-
82.



Saunders, C. T. and D. Baker (2002). "Evaluation of structural and evolutionary contributions to deleterious
mutation prediction."
J Mol Biol

322
(4): 891
-
901.



Schadt, E. E., S. A. Monks, et al.
(2003). "
Genetics of gene expression surveyed in maize, mouse and man."
Nature

422
(6929): 297
-
302.


Treating messenger RNA transcript abundances as quantitative traits and mapping gene expression
quantitative trait loci for these traits has been pursued in gene
-
spe
cific ways. Transcript abundances
often serve as a surrogate for classical quantitative traits in that the levels of expression are significantly
correlated with the classical traits across members of a segregating population. The correlation structure
bet
ween transcript abundances and classical traits has been used to identify susceptibility loci for
complex diseases such as diabetes and allergic asthma. One study recently completed the first
comprehensive dissection of transcriptional regulation in buddin
g yeast, giving a detailed glimpse of a
genome
-
wide survey of the genetics of gene expression. Unlike classical quantitative traits, which often
represent gross clinical measurements that may be far removed from the biological processes giving rise
to them
, the genetic linkages associated with transcript abundance affords a closer look at cellular
biochemical processes. Here we describe comprehensive genetic screens of mouse, plant and human
transcriptomes by considering gene expression values as quantitati
ve traits. We identify a gene
expression pattern strongly associated with obesity in a murine cross, and observe two distinct obesity
subtypes. Furthermore, we find that these obesity subtypes are under the control of different loci.


Stelzl, U., U. Worm,
et al.
(2005). "A human protein
-
protein interaction network: a resource for annotating the
proteome."
Cell

122
(6): 957
-
68.


Stenson, P.D., Ball, E.V., Mort, M., Phillips, A.D., Shiel, J.A., Thomas, N.S., Abeysinghe, S., Krawczak, M.
and Cooper, D.N. (2003)

Human Gene Mutation Database (HGMD): 2003 update,
Hum Mutat
,
21
, 577
-
581.


Turner, F.S., Clutterbuck, D.R. and Semple, C.A. (2003) POCUS: mining genomic sequence annotation to
predict disease genes,
Genome Biol
,
4
, R75.



Wang, Z. and J. Moult (2001). "SN
Ps, protein structure, and disease."
Hum Mutat

17
(4): 263
-
70.



Wang, Z. and J. Moult (2003). "Three
-
dimensional structural location and molecular functional effects of
missense SNPs in the T cell receptor Vbeta domain."
Proteins

53
(3): 748
-
57.



World Hea
lth Organization (2005).
International Statistical Classification of Diseases and Health Related
Problems
. Geneva.


Zerhouni, E. (2003) The NIH roadmap,
Science
,
302
, 63
-
65.