Biological Networks - Department of Computing - Imperial College ...

dasypygalstockingsΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

84 εμφανίσεις

341: Introduction to Bioinformatics

Dr. Nataša Pržulj

Department of
Comput
ing

Imperial College London

natasha@imperial.ac.uk


Winter 2011

1

2

2

Topics


Introduction to biology (cell, DNA, RNA, genes, proteins)


Sequencing and genomics (sequencing technology, sequence
alignment algorithms)


Functional genomics and microarray analysis (array technology,
statistics, clustering and classification)


Introduction to biological networks


Introduction to graph theory


Network properties


Network/node centralities


Network motifs


Network models


Network/node clustering


Network comparison/alignment


Software tools for network analysis


Interplay between topology and biology

2

Network Comparisons:

Properties of Large Networks



Large network comparison is computationally hard due to NP
-
completeness of the underlying
subgraph

isomorphism problem
:




Given 2 graphs G and H as input,
determine whether
G

contains a
subgraph

that
is
isomorphic

to
H
.




Thus, network comparisons rely on easily computable heuristics
(approximate solutions), called “network properties”




Network properties can roughly & historically be divided in two
categories:


1.
Global

network properties
:

give an overall view of the network, but might
not be detailed enough to capture complex topological characteristics of
large networks.


2.
Local

network properties
:

more detailed network descriptors which usually
encompass larger number of constraints, thus reducing degrees of freedom
in which the networks being compared can vary.

3

1. Global Network Properties

Readings: Chapter 3 of “Analysis of biological networks” by
Junker and Björn



Global Network Properties
:

1)
Degree distribution

2)
Average clustering coefficient

3)
Clustering spectrum

4)
Average Diameter

5)
Spectrum of shortest path lengths

6)
Centralities


4

1)
Degree Distribution


Definitions
:


degree

of a node is the number of edges incident to
the node.


Average degree
of a network:


average of the degrees over all nodes in the network.



However, avg. deg might not be representative, since
the distribution of degrees might be skewed.

5

1. Global Network Properties

x

deg(x)=5


Degree distribution
:


Let
P(k)

be the percentage of nodes of degree
k

in
the network. The degree distribution is the
distribution of
P(k)

over all
k
.


P(k)

can be understood as the probability that a
node has degree
k
.


6

1. Global Network Properties


1)
Degree Distribution



Example:


















(log
-
log plot)


Here
P(k)

~ k
-
γ

, where often 2 ≤ γ < 3.
This is a
power
-
law
, heavy
-
tailed distribution.


Networks with power
-
law degree distributions are called
scale
-
free

networks
.
In
them, most of the nodes are of low degree, but there is a small number of
highly
-
linked nodes (nodes of high degree) called “
hubs
.”

7

1. Global Network Properties

1)
Degree Distribution


Another Example:






















average degree is meaningful



Here
P(k)

is a Poisson distribution.

8

1. Global Network Properties

1)
Degree Distribution


However:

degree distribution (and global properties in
general) are weak predictors of network structure.


Illustration:












G and H are of the same size (
i.e.,|G
|=|H|
--

they have the same
number of nodes and edges) and they have same degree
distribution, but G and H have very different topologies (i.e.,
graph
stucture
).

9

1. Global Network Properties

1)
Degree Distribution

G

Examples:

11

11

Research debates…


Assortative

vs.
disassortative

mixing of degrees:


Do high
-
degree nodes interact with high
-
degree nodes?


Done by:


Pearson corr. coefficient between degrees of adjacent vertices


Average neighbor degree; then average over all nodes of degree
k


Structural robustness and attack tolerance:


“Robust, yet fragile”


Scale
-
free degree distribution:


“Party” vs. “date” hubs


J.D. Han et al.,
Nature
, 430:88
-
93, 2004


Bias in the data collection


sampling
?


M.
Stumpf

et al.,
PNAS
, 102:4221
-
4224, 2005


J. Han et al.,
Nature Biotechnology
, 23:839
-
844, 2005


High degree nodes:


Essential genes


H.
Jeong

at al.,
Nature
411, 2001.


Disease/cancer genes


Jonsson

and Bates,
Bioinformatics
, 22(18), 2006


Goh

et al.,
PNAS
, 104(21), 2007


Definition
:


clustering coefficient
C
v

of a node
v
:

C
v

=
|E(N(v))|/(
max possible number of edges in
N(v)
)

Where
N(v)

the neighborhood of
v,

i.e., all nodes adjacent to
v


C
v

can be viewed as the probability that two neighbors of
v

are
connected.

Thus
0 ≤ C
v
≤ 1
.


By definition:

For vertex
v

of degree
0

or
1
, by definition
C
v
=0.



12

1. Global Network Properties

2)
Average Clustering Coefficient


Example:












|N(v)|= 4
, since there are
4

nodes in
N(v)
, i.e.,
N(v)= {1, 2, 3, 4}


|E(N(v))|= 3
, since there are 3 edges between nodes in
N(v)


Max possible number of edges between nodes in
N(v)

is:
choose(4,2) = 6
.


Therefore
C
v
= 3/6 = 1/2

13

1. Global Network Properties

2)
Average Clustering Coefficient



Definition
:


average clustering coefficient,
C
,
of a
network

is the average
C
v

over all the nodes
v


V
.


14

1. Global Network Properties

2)
Average Clustering Coefficient


Definition
:


clustering spectrum,
C(k)
, is the distribution of
the average clustering coefficients of all nodes
of degree
k

in the network, over all
k
.


Example:


15

1. Global Network Properties

3)
Clustering Spectrum



C
v



Clustering coefficient of node
v


C
A
= 1/1 = 1


C
B

= 1/3 = 0.33


C
C

= 0


C
D

= 2/10 = 0.2








C = Avg. clust. coefficient of the whole network


= avg {
C
v
over all nodes v of G}




C(k)


Avg. clust. coefficient of all nodes


of degree k


E.g.: C(2) = (C
A

+ C
C
)/2 = (1+0)/2 = 0.5


=> Clustering spectrum


E.g.

(not for G
)


2) And 3) Clustering Coefficient and Spectrum

G

Need to evaluate whether the value of C (
or any other
property
) is statistically significant.

17



Definition
:
the
distance

between two nodes
is the smallest
number of links that have to be traversed to get from one
node to the other.



Definition
:
the
shortest path
is the path that achieves that
distance.



Definition
:
the
average network diameter
is the average of
shortest path lengths over all pairs of nodes in a network.


1. Global Network Properties

4)
Average Diameter



Definition
:


Let
S(d)

be the percentage of node pairs that are at
distance
d
. The
spectrum of shortest path lengths
is
the distribution of
S(d)

over
d
.


Example:

18

1. Global Network Properties

5)
Spectrum of shortest path lengths

4) and 5) Average Diameter and Spectrum of Shortest Path Lengths

G

u

v

E.g.

(not for G)



Distance between a pair of nodes
u

and
v:



D
u,
v

= min {length of all paths between
u

and
v
}


= min {3,4,3,2} = 2 = dist(
u,v
)




Average diameter of the whole network:



D = avg {D
u,v

for all pairs of nodes {
u,v
} in G}




Spectrum of the shortest path lengths





(
Readings: Chapter 3 of “Analysis of biological networks”
-
Junker,Björn
)



Rank nodes according to their “topological importance”



Definition
:


Centrality

quantifies the topological importance of a node (edge) in a network.


There are many different types of centralities.



There are many different types of centralities:


Degree centrality


Closeness centrality


Eccentricity centrality


Betweenness

centrality


Subgraph

centrality


Eigenvector centrality



Software tools:
Visone

(social nets) and
CentiBiN

(biological nets)

20

1. Global Network Properties

6)
Node Centralities



Definitions
:


1.
Degree centrality
,
C
d
(v)
: nodes with a large number of neighbors (i.e.,
edges) have high centrality. Therefore, we have
C
d
(v)=deg(v).


Example of a use of degree centrality:




In PPI networks, nodes with high degree centrality are considered to be

“biologically important.” We will learn later in the course what this means.




2.

Closeness centrality
,
C
c
(v)
: nodes with short paths to
all

other nodes in
the network have high closeness centrality








C
c
(v)=




21

1. Global Network Properties

6)
Node Centralities

22



Definitions
:


3.

Betweenness

centrality
,
C
b
(v)
: Nodes (or edges) which occur in many of
the shortest paths have high
betweeness

centrality.





C
b
(v)=






Above:




The above summation means that there is a sum on the top and on the
bottom of the fraction.







σ
st
(v)

= the number of shortest paths from
s

to
t

that pass through
v




σ
st

= the number of all shortest paths from
s

to
t

(they may or not pass



through node
v
)



22

1. Global Network Properties

6)
Node Centralities

23

23



Definitions:



4.

Eccentricity centrality
,
C
e
(v)
: nodes with short paths to
any

other node have high eccentricity centrality


Eccentricity

of a node v is defined as
ecc
(v) =





So it is the maximum shortest path length from node
u

to all
other nodes
v

in
V
.





Eccentricity centrality
of a node v:









Thus, central nodes have higher
C
e

since they have lower
ecc
.


There exist many other definitions of node centralities
.

23

1. Global Network Properties

6)
Node Centralities


Example:



24

Degree

Closeness

Betweeness

From highest

D

F, G

H

F, G

D, H

F, G

to

A, B

A, B

I

C, E, H

C, E

D

lowest

I

I

A, B

J

J

C, D, J

1. Global Network Properties

6)
Node Centralities



You need to know how to compute these
centralities (and all other network properties)
by hand on small networks.


For large real
-
world networks, you could use
software, e.g., CentiBiN.


http://centibin.ipk
-
gatersleben.de/

25

1. Global Network Properties

6)
Node Centralities

26

Network Properties

2.
Local Network Properties


(
Chapter 5 of the course textbook “Analysis of Biological
Networks” by Junker and Schreiber)



They encompass a larger number of constraints, thus reducing
degrees of freedom in which networks being compared can vary



How do we show that two networks are different?


How do we show that they are the same?


How do we quantify the level of similarity?


27

27

Network Properties

2.
Local Network Properties


(
Chapter 5 of the course textbook “Analysis of Biological
Networks” by Junker and Schreiber)


1)
Network motifs

2)
Graphlets



Two network comparison measures based on graphlets:


2.1) Relative Graphlet Frequence Distance between two networks


2.2) Graphlet Degree Distribution Agreement between two networks

(Uri
Alon’s

group, 2002
-
2004)



Definition
:

A
network motif
is a small over
-
represented
partial

subgraph

of real network.



Here, over
-
represented means that it is over
-
represented when compared to networks coming
from a
random graph model
.



Problem:

What is expected at random, i.e., which
network “null model” to use to identify motifs?



28

2. Local Network Properties

1)
Network Motifs

29

2. Local Network Properties

1)
Network Motifs

Example of a random graph model:


Erdos
-
Renyi (ER) random graphs



Definition:



A graph on
n

nodes (for some positive integer
n
)


Edges are added between pairs of nodes
uniformly at random with same probability
p



ER graphs usually have a small number of
dense (in term of number of edges) subgraphs


There will be no regions in the network that have
large density of edges. Why?



Example:








If motifs are identified when comparing the data with ER
model networks, every dense
subgraph

would come up as
a motif because they do not exist in our ER model
networks.

30

2. Local Network Properties

1)
Network Motifs

31


Small subgraphs that are overrepresented in a network when compared
to randomized networks





Network motifs:


Reflect the underlying evolutionary processes that generated the network


Carry functional information


Define superfamilies of networks




-

Z
i

is statistical significance of subgraph
i
,
SP
i

is a vector of numbers in 0
-
1


But:


Functionally important but not statistically significant patterns could be missed


The choice of the appropriate null model is crucial, especially across “families”

1) Network motifs

(Uri Alon’s group, ’02
-
’04)

Feed
-
forward loop

32


Small
subgraphs

that are overrepresented in a network when compared
to randomized networks





Network motifs:


Reflect the underlying evolutionary processes that generated the network


Carry functional information


Define
superfamilies

of networks




-

Z
i

is statistical significance of
subgraph

i
,
SP
i

is a vector of numbers in 0
-
1


But:


Functionally important but not statistically significant patterns could be missed


The choice of the appropriate null model is crucial, especially across “families”


Random

graphs with the same in
-

and out
-

degree distribution as data might not be the best
network null model


Motifs are partial
subgraphs
, while we use induced ones to understand network structure


1) Network motifs

(Uri
Alon’s

group, ’02
-
’04)

33

2. Local Network Properties

1) Network Motifs

Example:

Feed
-
forward loop

Shen
-
Orr, Milo, Mangan, and Alon, “Network motifs in the transcriptional

regulation network of Escherichia coli,”
Nature Genetics
, 2002

34

1) Network motifs

(Uri Alon’s group, ’02
-
’04)

http://www.weizmann.ac.il/mcb/UriAlon/

Also, see Pajek, MAVisto, and FANMOD










N. Przulj, D. G. Corneil, and I. Jurisica, “Modeling Interactome: Scale Free or Geometric?,”
Bioinformatics
, vol. 20, num. 18, pg. 3508
-
3515, 2004.


_____

Different from network motifs:



Induced
subgraphs



Of any frequency (don’t need to be over
-
represented)

2)
Graphlets

(
Przulj

group, ’04
-
’10)










N. Przulj, D. G. Corneil, and I. Jurisica,
“Modeling Interactome: Scale Free


or Geometric?,”
Bioinformatics
, vol. 20, num. 18, pg. 3508
-
3515, 2004.











N. Przulj, D. G. Corneil, and I. Jurisica,
“Modeling Interactome: Scale Free


or Geometric?,”
Bioinformatics
, vol. 20, num. 18, pg. 3508
-
3515, 2004.











N. Przulj, D. G. Corneil, and I. Jurisica,
“Modeling Interactome: Scale Free


or Geometric?,”
Bioinformatics
, vol. 20, num. 18, pg. 3508
-
3515, 2004.


2.1) Relative Graphlet Frequency (RGF) distance between networks G and H:












Generalize node
degree

2.2) Graphlet Degree Distributions












N. Przulj, “Biological Network Comparison Using Graphlet Degree
Distribution,”
ECCB
,
Bioinformatics
, vol. 23, pg. e177
-
e183, 2007.












N. Przulj, “Biological Network Comparison Using Graphlet Degree
Distribution,”
ECCB, Bioinformatics
, vol. 23, pg. e177
-
e183, 2007.












T. Milenkovic and N. Przulj, “Uncovering Biological Network Function via Graphlet
Degree Signatures”,
Cancer Informatics
, vol. 4, pg. 257
-
273, 2008
.

Network structure vs. biological function & disease

Graphlet Degree (GD) vectors, or “
node

signatures”

Similarity measure
between “
node

signature” vectors


T. Milenkovic and N. Przulj, “Uncovering Biological Network Function via Graphlet
Degree Signatures”,
Cancer Informatics
, vol. 4, pg. 257
-
273, 2008
.











T. Milenkovic and N. Przulj, “Uncovering Biological Network Function via Graphlet
Degree Signatures”,
Cancer Informatics
, vol. 4, pg. 257
-
273, 2008
.

Signature Similarity Measure between
nodes
u

and
v










T. Milenkovi
ć

and N. Pr
ž
ulj, “Uncovering Biological Network Function via Graphlet Degree Signatures,” Cancer
Informatics, 2008:6 257
-
273, 2008
(Highly Visible)
.










40%

SMD1

PMA1

YBR095C

T. Milenkovi
ć

and N. Pr
ž
ulj, “Uncovering Biological Network Function via Graphlet Degree Signatures,” Cancer
Informatics, 2008:6 257
-
273, 2008
(Highly Visible)
.










T. Milenkovi
ć

and N. Pr
ž
ulj, “Uncovering Biological Network Function via Graphlet Degree Signatures,” Cancer
Informatics, 2008:6 257
-
273, 2008
(Highly Visible)
.










90%*

SMD1

SMB1

RPO26

T. Milenkovi
ć

and N. Pr
ž
ulj, “Uncovering Biological Network Function via Graphlet Degree Signatures,” Cancer
Informatics, 2008:6 257
-
273, 2008
(Highly Visible)
.

*Statistically significant threshold at ~85%

Later we will see how to use this and other techniques

to link network structure with biological function












N. Przulj, “Biological Network Comparison Using Graphlet Degree Distribution,”
Bioinformatics
, vol. 23, pg. e177
-
e183, 2007.

Generalize Degree Distribution of a
network

The
degree distribution
measures:



the number of nodes
“touching”
k

edges

for each value of
k












N. Przulj, “Biological Network Comparison Using Graphlet Degree Distribution,”
Bioinformatics
, vol. 23, pg. e177
-
e183, 2007.


N. Przulj, “Biological Network Comparison Using Graphlet Degree Distribution,”
Bioinformatics
, vol. 23, pg. e177
-
e183, 2007.

/

sqrt(2) (


to make it between 0 and 1)

This is called
Graphlet Degree Distribution (GDD) Agreement


between
networks

G and H.

Software that implements many of these network

properties and compares networks with respect to them:

GraphCrunch

http://bio
-
nets.doc.ic.ac.uk/graphcrunch/

Software that implements many of these network

properties and compares networks with respect to them:

GraphCrunch

http://bio
-
nets.doc.ic.ac.uk/graphcrunch2/

56

56

Topics


Introduction to biology (cell, DNA, RNA, genes, proteins)


Sequencing and genomics (sequencing technology, sequence
alignment algorithms)


Functional genomics and microarray analysis (array technology,
statistics, clustering and classification)


Introduction to biological networks


Introduction to graph theory


Network properties


Network/node centralities


Network motifs


Network models


Network/node clustering


Network comparison/alignment


Software tools for network analysis


Interplay between topology and biology

56