# Presented by: Tal Saiag

AI and Robotics

Oct 23, 2013 (4 years and 6 months ago)

105 views

Presented by: Tal Saiag

Seminar in Algorithmic Challenges in Analyzing Big Data*
in Biology and
Medicine; With Prof. Ron Shamir @TAU

Basic
Terminology

Introduction

JointCluster: A simultaneous clustering algorithm

Results

Discussion

Conclusion

2

3

Cell

building block
of
life

Contains
nucleus

Chromosome

genetic material

Forms the genome

DNA

Gene

Stretch of DNA

Proteins

multi functional workers

4

Gene expression: from gene to protein

Transcription

RNA

Translation

Transcription factor

5

DNA
microarray / chips

Measures expression of genes

Condition specific

6

7

Genome
-
wide
datasets provide different views of the biology
of
a
cell.

Physical interactions (protein
-
protein
) and regulatory
interactions (protein
-
DNA)
maintain
and regulate the cell's
processes.

Expression
of molecules
(proteins
or transcripts of
genes)
provide a snapshot of the cell’s state.

Researchers have exploited the complementarity of both.

8

Integrating the physical and expression datasets.

Developing efficient solution
for combined
analysis of multiple
networks.

Goal
: find common clusters of genes supported by all of the
networks of interest.

Computationally intractable for large networks.

Theoretical guarantees (reasonably approximates the optimal
clustering
).

9

10

A
cut

refers to a partition of nodes in a graph into two sets
.

A cut is called
sparse
-
enough

in a graph if the ratio of edges
crossing the cut in the graph to the edges incident at the smaller
side of the cut is smaller than a threshold specific to the graph
.

Inter
-
cluster edges
: edges with endpoints in different
clusters.

Connectedness

of a cluster in a graph: the cost of a set of
edges is the ratio of their weight to the total edge weight in the
graph.

11

Approximate the sparsest cut in each input graph using a
spectral
method.

Choose
among them any cut that is sparse
-
enough in the
corresponding graph yielding the
cut.

Recurse

on the two node sets of the chosen
cut.

Until well connected node sets with no sparse
-
enough cuts are
obtained.

12

Graph

=
𝑉
,


,


0

for any node pair

,


𝑉
×
𝑉

Total
weight of any
edge set

:


=


,


,


Y

For
any node sets

,


𝑉
:

(

,

)
=


,



S
,


T

Define

=

,
𝑉

Total
edge weight in the graph: a(V)/
2

Singletons:


=



13

The conductance of a cut
(

,

=

\
S
)

in a node set

:
𝑎

,

min
𝑎

,
𝑎


The
inter
-
cluster
edges

:

,


where


and


belong to
different
clusters.

An
(
𝛼
,
𝜀
)

clustering of

is a partition of its nodes into clusters
such that:

The
conductance of the clustering is at least
𝛼
, and

The
total weight of the inter
-
cluster edges

is at most an
𝜀

fraction of the
total edge weight in the graph; i.e.,

𝜀
2

𝑉
.

14

Since finding the sparsest cut in a graph is a NP
-
hard problem,
an approximation algorithm for the problem is used.

Efficient
spectral
techniques.

Spectral
Algorithm:

Find the top


right singular vectors

1
,

2
,

,



(using SVD).

Let

be the matrix whose

th

column is given by
𝐴


.

Place row

in cluster


if



is the largest entry in the

th

row of

.

15

Consider


graphs

=
𝑉
,

=
1


An
𝛼

,
𝜀

simultaneous
clustering:

The
conductance of the clustering is at least
𝛼

in graph

for all

, and

The
total weight of the inter
-
cluster edges

is at most an
𝜀

fraction of the
total edge weight in all graphs; i.e.,

𝜀
2

𝑉

.

Inter
-
cluster edge
cost:
2

𝑎
𝑖
𝑋
𝑖

𝑎
𝑖
𝑉
𝑖
.

A cut in

is sparse enough if the conductance of the
cut
is at
most
𝛼

.

16

Mixture graph




for graph

at scale


has a weight
function:





,

=


,

+
2





,


!
=

The heuristic finds sparsest cuts in mixture graphs
.

The heuristic starts with the sum graph to control edges lost in all
graphs, and transitions through a series of mixture graphs that
approach the individual graphs to refine the clusters.

17

Combine a cut selection
heuristic.

Choose the
cut that is sparse
-
enough in the most number of input
graphs.

18

19

20

The
modularity score

of a cluster in a graph
is: fraction
of
edges contained within the cluster minus the fraction expected
by
chance.

The partition of
𝑉

that respects the clustering tree and
optimizes the min

modularity score can be found by dynamic
programming.




=
arg
max
min

 𝑑 𝑟 𝑦

,
min

 𝑑 𝑟 𝑦




OPT

𝑟

Ordered
by the min
-
modularity scores of
clusters.

21

We desire an unsupervised method for learning the related
conductance threshold
𝛼

for each network of interest

.

Algorithm

for each graph

:

Cluster only

using JointCluster, without loss of generality set
𝛼

to
maximum possible value
1
.

Set
𝛼

to the minimum conductance threshold that would result in the same
set of clusters

.

Goal
: automatically choose a threshold
that is sufficiently low

and sufficiently high.

22

23

Alternative algorithms:

Tree:

Choose one of the input graphs

as a reference, cluster this
single graph using an efficient spectral clustering method
to
obtain a
clustering tree, and parse this tree into clusters using the min
-
modularity
score computed from all graphs.

Coassociation
:

Cluster each graph separately using
a
spectral
method,
combine the resulting clusters from different graphs into a
coassociation

graph,
and cluster this graph using the same spectral
method.




Parametr

24

Intra
-
cluster

are a pair
of elements that belong to a single
cluster.

Jaccard

Index =

#element

pairs

that

are

intra

cluster

wrt

both

clusterings
#element

pairs

that

are

intra

cluster

wrt

either

of

the

clusterings

25

26

Two yeast strains grown under two conditions where glucose or
ethanol was the predominant carbon source
.

Coexpression
networks using all
4
,
482
profiled genes as
nodes.

Weight
of an
edge as
the absolute value of the Pearson's correlation
coefficient between
the
expression profiles
of the two genes.

Physical gene
-
protein interactions (
from various interaction
databases):
total of
41
,
660
non
-
redundant
interactions.

27

GO Process:

Genes in each reference set in this class are
annotated to the same GO Biological Process term.

TF (Transcription Factor) Perturbations:

Genes in each set have
altered expression when a TF is deleted or overexpressed.

Compendium of Perturbations:

Genes in each set have altered
expression under deletions of specific genes, or chemical
perturbations.

TF Binding Sites:

Genes in a set have binding sites of the same TF
in their upstream genomic regions, with sites predicted using
ChIP

binding data.

eQTL

Hotspots:

Certain genomic regions exhibit a significant
excess of linkages of expression traits to genotypic
variations.

28

Intra
-
cluster

are a pair of elements that belong to a single
cluster.

Jaccard

Index =

#element

pairs

that

are

intra

cluster

wrt

both

clusterings
#element

pairs

that

are

intra

cluster

wrt

either

of

the

clusterings

Sensitivity:

the
fraction of reference sets that are enriched for
genes belonging to some cluster output by the method
.
[
coverage]

Specificity:

the
fraction of clusters that are enriched for genes
belonging to some reference set
.
[
accuracy
]

29

30

Comparing
JointCluster

against methods that integrate only a
single coexpression network with a physical
network.

Combined
<
glucose+ethanol
> coexpression network and the physical
network.

Comparing on fair
terms
for
all
algorithms:

Setting
minimum cluster size parameter in Matisse to
10
.

Size
limit of
100
genes
for

JointCluster

.

Co
-
clustering didn't have a parameter to directly limit cluster
size.

31

32

33

Heterogeneous large
-
scale datasets are accumulating at a
rapid pace.

Efforts to integrate them are intensifying.

JointCluster

provides a versatile approach to integrating any
number of heterogeneous datasets.

Natural progression from clustering of single to multiple datasets.

34

Testing
JointCluster

algorithm on simulated
datasets.

Testing
JointCluster

on yeast empirical datasets
.

More
flexible than two
-
network clustering
methods.

Consistent with known biology, extend our knowledge
.

JointCluster

can handle multiple heterogeneous network.

Enables better coverage of genes
especialy

when knowledge of physical
interactions is less complete
.

Unsupervised
and exploratory approach to data
integration.

35

36

The challenge: integrating multiple datasets in order to study
different aspects of biological
systems.

Proposed simultaneous
clustering of multiple
networks.

Efficient solution that permits
certain theoretical
guarantees

Effective scaling heuristic

Flexibility to handle multiple heterogeneous
networks

Results of
JointCluster
:

More
robust,
and can
handle high false positive
rates.

More
consistently enriched for various reference
classes.

Yielding
better
coverage.

Agree with known biology of
yeast.

37

38

39

Bibliography:

2010
), PLoS
Computational
Biology.

Simultaneous
Clustering of Multiple Gene Expression and Physical Interaction
Datasets.

Supplementary
Text for “Simultaneous clustering of multiple gene expression and physical
interaction datasets
”.

CPP Source code

Kannan

R,
Vempala

S,
Vetta

A (
2000
), Proceedings Annual IEEE Symposium on
Foundations of Computer
Science (
FOCS).
pp

367

377
.

On
clusterings
-

.

Shi J, Malik J (
2000
), IEEE Transactions on Pattern Analysis and Machine
Intelligence
(
TPAMI)
22
:
888

905
.

Normalized cuts and image
segmentation.

Andersen R, Lang KJ (
2008
), Proceedings Annual
ACM
-
SIAM
Symposium
on
Discrete
Algorithms (SODA).
pp

651

660
.

An algorithm for improving graph
partitions.

40