Bioinformatics - Computer Science and Engineering - University at ...

abalonestrawBiotechnology

Oct 2, 2013 (3 years and 10 months ago)

116 views

University at Buffalo

The State University of New York

Young
-
Rae Cho

Department of Computer Science and Engineering

State University of New York at Buffalo

Identification of Functional Modules and Hub
Proteins in Protein Interaction Networks

Seminar 2009

University at Buffalo

The State University of New York


What is Bioinformatics?

Bioinformatics



Bioinformatics


Interdisciplinary research area to manage and analyze biological data

University at Buffalo

The State University of New York


What is Bioinformatics?


Computational


Techniques



Knowledge



Biomedical


Applications



Biological


Data



Genome

Proteome

Networks

Functional Characterization


Disease Diagnosis


Drug Development


Data Mining

Machine Learning

Networks

Data Mining


Functional Characterization


Bioinformatics

University at Buffalo

The State University of New York


Overview



Introduction


Protein Interaction Networks and Their Structural Properties




Preprocess
-

Network Weighting


Integration of Gene Ontology using Semantic Similarity Measures




Functional Module Identification


Weighted Interaction Networks
→ → Functional Modules




Hub Protein Identification


Weighted Interaction Networks
→ → Hub Proteins




Conclusion

University at Buffalo

The State University of New York


Biological Network



Definition


Directed or undirected graph representation


Biological molecules as nodes and


biochemical reactions or biophysical interactions as edges




Examples


Metabolic networks


Signal transduction networks


Gene regulatory networks


Protein interaction networks




Importance


Provide a global view of cellular organizations and biological processes


Applicable to systematic approaches for knowledge discovery

University at Buffalo

The State University of New York



Biological Meaning of PPI


Proteins interact with each other for stability and functionality


Most cellular functions are performed in a protein complex level


Interaction evidence is interpreted as functional coherence / consistency




Determination of PPIs


Experimental methods


Yeast two
-
hybrid systems, Mass spectrometry, Protein microarray


Computational methods




Homology search, Gene fusion analysis, Phylogenetic profiles




Problem of PPI data


Current PPI databases include a large amount of false positives / false negatives



Unreliability


Protein
-
Protein Interaction (PPI)

University at Buffalo

The State University of New York



Representation of Protein Interaction Networks


Undirected, un
-
weighted graph
G
(
V
,
E
),


a set of nodes
V

as proteins and a set of edges
E

as interactions




Problem of Protein Interaction Networks


Large scale


Complex connectivity


Protein Interaction Network

University at Buffalo

The State University of New York



Small
-
world Phenomenon
( Watts & Strogatz )


Appearance of networks in the middle of regular and random networks


Higher average clustering coefficient than expected by random chance


Significantly small average shortest path length




Scale
-
free Distribution
( Barabasi & Albert )


Network growth by preferential attachment


Power law degree distribution


a few high degree nodes, many low degree nodes


Clustering coefficient distribution independent to degree


Structural Properties

Protein Interaction Database

DIP

MIPS


density

0.0015

0.0015

average clustering coefficient

0.2283

0.2878

average shortest path length

4.14

4.43

degree distribution (
γ
)

1.77

1.64

high modularity

hub existence

University at Buffalo

The State University of New York


Overview



Introduction


Protein Interaction Networks and Their Structural Properties




Preprocess
-

Network Weighting


Integration of Gene Ontology using Semantic Similarity Measures




Functional Module Identification


Weighted Interaction Networks
→ → Functional Modules




Hub Protein Identification


Weighted Interaction Networks
→ → Hub Proteins




Conclusion

University at Buffalo

The State University of New York



Motivation


Unreliable protein interaction networks


Transforming un
-
weighted graph to weighted graph


by assigning the interaction reliability (or intensity) into each edge as a weight




Unsupervised Approaches


Using network connectivity, e.g., common neighbors, alternative paths


Problem: unreliable weights




Supervised Approaches


Using other resources verifying interactions, e.g., gene sequence, gene expression


Integrating Gene Ontology data in my works



the most comprehensive



well
-
curated


Network Weighting Schemes

University at Buffalo

The State University of New York



Structure


Terms (Concepts): well
-
defined biological description


Relationships: “is
-
a” / “part
-
of” (general
-
to
-
specific) between terms




Annotation


If a protein is annotated on a term, then it is also annotated on the terms on the


paths towards root.



Gene Ontology (GO)

DAG


→ Transitivity

cell growth & maintenance


cell organization


cytoplasm organization

mitochondrion organization


ribosome biogenesis


metabolism


nucleic acid metabolism


RNA metabolism


RNA processing


transcription

DNA
-
dependent transcription


rRNA processing

P1, P2, P4

P1

P2, P3

P1, P6

P5

P1, P2, P3

P2, P3

P1, P2, P3, P6

P1, P2, P3, P4

P2, P3, P5

P1, P2, P3, P6

P1, P2, P3, P5, P6

P1, P2, P3, P4

P1, P2, P3, P5, P6

P1, P2, P3, P4, P5, P6

University at Buffalo

The State University of New York



Reliability of Interacting Proteins


Average (or Maximum) semantic similarity of pair
-
wise terms


including the interacting proteins in annotations




Structure
-
based Approaches


Path length or Common parent terms


Problem: all edges should represent the uniform specificity




Information Content
-
based Approaches


Information content of a term
T

is defined as


log
(
P
(
T
))


sim
xy

=
-

log

(
P
i
(
x,y
) )


where
P
i
(
x,y
) is the proportion of the annotations of the term including
x

and
y



Normalized
sim
xy

=


Semantic Similarity

log

P
i
(
x
) +
log

P
j
(
y
)

2
×

log

P
ij
(
x,y
)

University at Buffalo

The State University of New York


Overview



Introduction


Protein Interaction Networks and Their Structural Properties




Preprocess
-

Network Weighting


Integration of Gene Ontology using Semantic Similarity Measures




Functional Module Identification


Weighted Interaction Networks
→ → Functional Modules





Hub Protein Identification


Weighted Interaction Networks
→ → Hub Proteins




Conclusion

University at Buffalo

The State University of New York



Functional Module


A set of molecules that participate in the same biological processes or functions


Sub
-
network with dense intra
-
connections and sparse interconnection




Functional Module Identification



Graph clustering problem




Previous Clustering Approaches


Density
-
based methods, e.g., maximum clique, quasi clique, clique percolation


Partition
-
based methods, e.g., restricted neighborhood search, Markov clustering


Hierarchical methods



Bottom
-
up approaches, e.g., distance
-
based, common neighbors



Top
-
down approaches, e.g., minimum cut, betweenness cut


Functional Module Identification

University at Buffalo

The State University of New York



Functional Influence










Influence factors: normalized weights, inverse of degree




Measurements


Single
-
path
-
based method :


O( |V| + |E| )


All
-
path
-
based method : NP


Random
-
walk
-
based method :


O( |V|
3

)
×

iteration ≈ O( |V|
4

)



Functional Influence Model

Improvement by an efficient algorithm

University at Buffalo

The State University of New York


Information Flow Simulation


Computation of functional influence
inf
s
(
x
) of
s

on
x



V

based on random walks


Input: a weighted interaction network and a source node
s


Output: functional influence pattern of
s



Algorithm

1.
Initialize
inf
s
(
s
)

2.
Compute initial flow
f
init
(
s

y
) by


3.
Update
inf
s
(
y
) by


4.
Compute flow
f
s
(
y

z
) by


5.
Repeat
3 and 4 until
f
s
(
y

z
) is less than a threshold
θ


Flow Simulation

University at Buffalo

The State University of New York


Schematic View

S

1.0

0.45

0.28

0.83

0.89

0.41

1.74

0.79

0.65

1.26

1.38

0.92

0.31

0.27

0.15

0.11

Pattern Clustering

University at Buffalo

The State University of New York



Efficiency


Traces only connecting nodes to calculate functional influence of a source


Removes trivial flow, being less than
θ
, as early as possible




Run Time


Theoretical upper bound is unknown ( not depends on the network diameter )


Test potential factors ( # nodes, density, average degree ) with synthetic networks


Time Complexity

University at Buffalo

The State University of New York



Experiment


Data: yeast protein interaction network from DIP


Pattern clustering: pCluster algorithm (Wang et al., SIGMOD 2002)




Evaluation


Functional categories and annotations from MIPS



Hyper
-
geometric
p
-
value




Result


Accuracy

University at Buffalo

The State University of New York


Overview



Introduction


Protein Interaction Networks and Their Structural Properties




Preprocess
-

Network Weighting


Integration of Gene Ontology using Semantic Similarity Measures




Functional Module Identification


Weighted Interaction Networks
→ → Functional Modules




Hub Protein Identification


Weighted Interaction Networks
→ → Hub Proteins




Conclusion

University at Buffalo

The State University of New York



Hub Protein


Centrally located node in the modular structure of a protein interaction network


( a structural hub )


Functionally essential protein




Previous Centrality Measurements


Closeness centrality


Betweenness centrality


Bridging centrality


Hub Protein Identification

University at Buffalo

The State University of New York



Functional Influence










Influence factors: normalized weights, inverse of degree




Measurements


Single
-
path
-
based method :


O( |V| + |E| )


All
-
path
-
based method : NP



Random
-
walk
-
based method :


O( |V|
3

)
×

iteration ≈ O( |V|
4

)



Functional Influence Model

Improvement by a heuristic algorithm

University at Buffalo

The State University of New York



Single
-
path
-
based path strength
:




All
-
path
-
based path strength
:


sums up the
k
-
length path strength for all possible
k










uses the threshold of maximum
k



Path Strength

University at Buffalo

The State University of New York



Network Conversion


Input: a protein interaction network / Output: a hierarchical tree structure




Algorithm


Centrality (weighted closeness) of a node
a
:



Set of ancestor nodes
T
(
a
) of
a
:



Parent node
p
(
a
) of
a
:




Hub Confidence Measurement


Set of child nodes
D
(
a
) of
a
:



Set of descendent nodes
L
a

of
a
:



Hub confidence
H
(
a
) of
a
:



Network Conversion

University at Buffalo

The State University of New York


Schematic View



Hub Confidence


How strongly a node plays a role as a structural hub


Not fully depends on the hierarchical level in the tree structure

University at Buffalo

The State University of New York



Top 10 Structural Hubs in the Yeast Protein Interaction Network


Not related to their degree


Each one has several different functions



Structural Hubs

University at Buffalo

The State University of New York



Biological Essentiality


Evaluated by comparing with lethal proteins


Lethality has been determined by protein knock
-
out experiments




Result



Lethality

University at Buffalo

The State University of New York



Problems


Complex and unreliable connectivity in protein interaction networks




Contributions


Reliable network generation by edge weighting


Hidden knowledge discovery, e.g., patterns or taxonomy


Collaboration with existing computational techniques




Future Works


Integration with multiple data sources


Comparative analysis across organisms




Conclusion

University at Buffalo

The State University of New York


Questions?