Proteomics: Integrating Interaction Networks

tripastroturfAI and Robotics

Nov 7, 2013 (3 years and 11 months ago)

111 views

Strategies for the analysis of genome
-
wide experiments

THINK ABOUT TITLES upto 6 words with a field

ning read, rj read

--

graph inc figure

--

network figure

--


hypernotes

--

proteomics: mining interaction networks

proteomics: combining interaction network
s

proteomics: networking the networks

genomics: combining hetergeneous information

bioinformatics:...

proteomics: combining interactomes

proteomics: integrating interactomes
--
combining interaction networks


Proteomics: Integrating Interaction Networks

(syn
ergy, whole/part, combining, interrelating, networks, wiring)


With the genome sequences as a practical scaffold and
????

inspiration, we are
increasingly seeing the advent of experiments providing information about complete
genomes. One of the major chal
lenges in post
-
genomic biology will be the integration of
all this information into a useful and comprehensive definition of gene function. Exactly
what shape such a definition of function will take is still an open question. However, one
valuable way of

partially defining and circumscribing gene function is through protein
-
protein interaction networks. On page ??? of this issue Tong and co
-
workers provide a
systematic strategy to identify protein
-
protein interaction networks for peptide
recognition domai
ns. Their strategy is of interest because it is a general one involving
combining computationally “orthogonal” datasets from different experimental
approaches.


Their procedure involves intersecting two protein
-
protein interaction networks. The first
is d
erived from ....


(i) Screen phage
-
display peptide libraries for binding to particular recognition domains
and use the result to define consensus
-
binding sequences. (ii) On the basis of these
consensus sequences, computationally derive a protein
-
protein i
nteraction network that
links each peptide recognition module to proteins containing a recognized peptide ligand.
The second....


(iii) Experimentally derive a protein
-
protein interaction network by testing each peptide
recognition module for association
with each possible protein using the yeast two
-
hybrid
system.



Tong et al apply this approach to determining the interacting partners of SH3 domains in
yeast. The SH3 domain is a good target for this approach because it is one of the most
common peptide
recognition domains and also involved in a number of important
biological process, such as cytoskeleton reorganization and signal transduction.



The strategy employed by Tong et al. can be understood in the context of the possibilities
to reduce the noise

in genomic datasets.


One of the fallacies in dealing with genomic datasets is trying to extract meaning from
individual measurements in large genomic datasets.[[emphasize fallacy]] In general in
many genome datasets (e.g. expression or interaction) the
noise on individual points is
very high making looking individual measurements impossible... one can look at this best
through aggregation.... there are a number of strategories for aggregationb.l..

1.

Use large, aggregate protein classes or categories to col
lect many measurements
(examples: translatome paper etc.)

2.

Repeat experiments, calculate averages (done in
µ
array experiments)

3.

Combine data from different, "orthogonal" sources (both computationally and
experimentally derived data)

The approach taken by To
ng et al. points to this third and very powerful way: combining
data from different experimental and computational procedures, preferably from
"orthogonal sources".


There have been a number of different attempts at interrelating information in whole
-
genom
e datasets before. There are a number of different types of information to
interrelate. Much of these data was first derived for yeast, reflecting its tractability for
genetic manipulation and the great interest in its eukaryote biology.

This ranges from

the original chip work on gene expression by Brown and colleagues.
[HN1] Now there is limited amounts of data on protein abundance as well with the
promise that whole
-
genome dataset will be available soon (ref). There are new second
generation chips and
other whole
-
genome techniques that directly try to characterize
biochemical or phenotypic function (Snyder) and measure the essentiality of a gene.
There are large datasets of protein
-
protein interactions and experimental characterizations
of the location
of proteins (e.g., whether or not they are secreted). Sequence searching
can produce for each protein large phylogenetic profiles of its occurrence in other
organisms. Much of this new whole genome data is associated with various 'omes: e.g.,
transcriptom
e, proteome, translatome, interactome.

There have been numerous attempts at interrelating

two (but not more than
two!)[[emphasize two]]
types of whole genome data. For instance, the expression chip
data was initially clustered by a variety of standard un
supervised approaches (e.g.,
hierarchical trees, K
-
means, self
-
organizing maps, etc.) [HN2] and compared to
functional categories. Similar types of clustering were also done on the results of
transposon
-
tagging and proteome chip experiments. Expression c
lustering was also
connected with binding site characterization. Less obviously, people have looked at the
relation of protein expression to chromosomal procedure, protein families and folds, and
subcellular localization. There has been some work relatin
g protein
-
protein interactions
and gene expression and also analyzing the relationship of protein abundance and mRNA
abundance.

However, there are few techniques that have synthesized more than two types of whole
genome data into a single result. One init
ial one was the synthesizing whole
-
genome
expression information, essentiality, an`d sequence motifs into a prediction for
subcellular localization using a Bayesian framework.[[can we find more here]]


The interrelation of different types of data is the st
rength of the Tong et al paper. In
particular, Tong et al's strategy [[shorten and rework]]

has several features that make it particularly effective in the identification of protein
-
protein interaction networks. First, both the phage display and two
-
hybri
d analysis take
full advantage of genomic information. Second, the two approaches are highly orthogonal
in their respective strengths and weaknesses. Phage display uses in vitro binding and
short synthetic peptides, whereas two
-
hybrid uses in vivo bindin
g and native proteins or
protein domains. The network derived from phage display is computationally predicted
but uses relatively unambiguous binding sites, whereas the two
-
hybrid network is
experimentally derived but subject to inherent false positives.


Interrelation and cross
-
referencing of different types of whole genome data is potentially
very powerful: If one relates not clearly connected information, one can sometimes
discover non
-
obvious correlations
--

e.g. amino acid usage and expression level [
[work
by others]]. Furthermore, many of the whole genome datasets represent complementary
and orthogonal information about protein function, hence the combination of them can
potentially provide information that one individual type could never provide.[[ei
senberg
paper]]


One of the specific powers of combining genomic data is that it allows the extraction of
useful information from noisy datasets. Given individual experiments and computations
where noise is a problem, the aggregate the dataset might still

contain useful information
that is less contaminated by false positives and negatives.

For instance, combining in the right way three data sources with false positive and false
negative rates of 10% each reduces the overall false positive and false negati
ve rates to
2.8%.


[[could be cut]] Because of the different (i.e., heterogeneous) quality of whole
-
genome
information synthesizing it is often tricky from a database and data
-
mining standpoint.
Hence, analyses synthesizing a diverse array of whole genome

information have to be
fairly computational by definition.


Consider for instance the case of (binary
-
type) protein
-
protein interactions. How should
one proceed in combining different datasets that measure protein
-
protein interactions?


There are two ext
remes in this procedure:


The other extreme is the case of datasets with very low FN rates, in other words, almost
no protein pair that is truly interacting is missed by the method. This high
coverage

would usually be associated with a high FP rate, that
is, a lot of protein pairs that are
actually
not

interacting would be found positive in the individual datasets. In this
situation the benefit of combining different datasets comes from looking at their
intersection
. Only protein pairs that are tested po
sitive in all datasets would then be
considered positive overall, thus yielding more reliable results. An example of this latter
approach is the paper by Tong et al.


Imagine that an experimental procedure exhibits very low false positive (FP) rates, in
o
ther words, the protein pairs with positive signals are almost certainly interacting; this is
usually achieved by interpreting the results of an interaction experiment in a
conservative

way such that the false negative (FN) rate is high, that is, the side
effect is that a lot of
actually interacting protein pairs would not be detected. In this situation the benefit of
combining different datasets of this type comes from looking at the
union

of these
datasets. As a general rule, any protein pair tested pos
itive in at least one of the datasets
would be considered a positive. See for example the paper by Schwiwkowski et al.
[HN3]

In most practical situations, the integration of datasets would be somewhere between
those two extremes. The task would be to com
bine datasets with varying degrees of
conservativeness and coverage. Accordingly, the rules for predicting which protein pairs
are positive would become more complicated. Instead of just looking for unions or
intersections of the individual positives (i.
e. at least one positive among conservative
datasets and all positive in datasets with high coverage) different combinations of
positive and negative signals from the datasets would have to be considered.


In the figure we provide a practical n illustratio
n of the power of interrelating data in the
context of yeast based on the second appraoch. It shows how the degree to which we can
find the known protein
-
protein associations in complexes based on genomic datasets
increases as we use more datasets that are

orthogonal in some respect. We first see how
many links are in the union of all the Y2H data sets. Then we add in the number of links
that could be found from expression clustering. Finally, we see how many further links
could be found based on an amalgam

of other functional genomics data: the location, the
essentiality of a gene, and its profile from large
-
scale transposon experiments. One can
see that each additional dataset provides more links of the total which we are trying to
define. Futhermore, the
number of doubly verified links increases, indicating that the
false positive rate is falling with the addition of more data


These more complicated rules could best be determined by machine learning algorithms
such as decision trees or Bayesian networks.

One of the challenges for the future will be devising sophisticated mathematical strategies
that allow the uniform and homogeneous integration of many disparate types of
information into a unified mathematical framework relating to protein function. One of

the problems is determining a relative weight for different datasets
--

e.g. how does one
weight expression information relative to that from transposon tagging. Bayesian
approaches and decision trees may be quite powerful because they provide a ways to
n
aturally combine different data types (e.g., Boolean variables and continuous vectors).
Other powerful machine learners such as support vector machines are not so powerful in
this respect. [[cut??]]Also, ranks might be quite powerful, since they provide v
ery robust
statistical information

To accomplish these goals we will need to develop better databases for storing and
comprehensively query heterogensous information. These databases need to be better in
their treatment of the errors and the false postive
rates.

Finally, we will also need to come up with a more systematic definition of gene function,
which is the target of these investigations. The implicit theme for this work has been
defining gene function in terms of protein
-
protein interactions. How fa
r are these
networks able to go. Currently, the function of a gene is often described by word or
phrase, often in non
-
systematic terminology. To really make the approach of
experimentally characterizing and predicting protein
-
protein interactions useful on
e will
have to systematize its relationship to gene function. One can envision other genomic
networks connecting proteins
--

e.g. those involved in pathways and regulatory networks.
[HN3]




REFNREE

church

eisenberg

amar

amar & rj

expression set

Y2H

transp
oson

Schwiwkowski et al.