Development of Bioinformatics Software

websterhissBiotechnology

Oct 1, 2013 (3 years and 11 months ago)

83 views


1

Aim 2c. Graph Algorithms

Introduction.

The primary goals of this
research
section

are to harness advances in
algorithmic

tools and
computational
technolog
ies
, and
to employ

them on computational bottlenecks relevant to other
components of this project. Th
e analysis of
the
massive amount and multidimensionality of data that this
temporal gene expression analysis project will produce presents a formidable problem to
the
average
biologist. One such bottleneck, the well
-
known clique problem, is directly applic
able to this biological
problem of identifying related genetic networks. In the biological datasets, cliques are suggestive of
groups of genes that are often affected in a similar manner by multiple conditions. In this section of the
project, graph theor
etical approaches will be used to developed methods for the identification of
regulatory networks of genes and key controllers of these networks that are involved in modulated gene
expression as well as cellular processes during cerebellar development. Thi
s undirected approach will
identify genes that appear nodal to larger sets of molecules.

O
ur preliminary work in this area
is reported
i
n
[1]
.

Motivation.

Current high
-
throughput molecular assays generate immense

numbers of phenotypic values.
Billions of individual hypotheses can be tested

from a single BXD RI transcriptome profiling experiment.
QTL mapping, however
,

tends to be highly focused on small sets of traits and genes. Many public users
of our

data resources approach the data with specific questions of particular gene
-
gene and/or

gene
-
phenotype relationships
[2]
. These high
-
dimensional data sets are best unders
tood

when the correlated
phenotypes are determined and analyzed simultaneously.

Data reduction via automated extraction of co
-
regulated gene sets from transcriptome

QTL data is a challenge. Given the need to analyze efficiently tens
of thousands of genes

and traits, it is essential to develop tools to extract and characterize large aggregates
of

genes, QTLs, and highly variable traits.

There are advantages to placing our work in a graph
-
theoretic
framework. This representation

is known to be appropriate f
or probing and determining the structure of

biological networks including the extraction of evolutionarily conserved modules of

co
-
exp
ressed genes.
See, for example
[3, 4]
. In our effort

to identify se
ts of putatively co
-
regulated genes
, we take advantage
of novel, new methods to solve clique
, a

classic graph
-
theoretic problem. Here a gene is denoted by a
vertex, and a co
-
expression

value is represented by the weight placed on an edge joining a pair of
vertices.
Clique is

widely known for its application in a variety of combinatorial settings, a great number

of which
are relevant to computational molecular biology. See, for example,
[5]
. A

considerable amount of effort
has been devoted to solving clique efficiently. An excellent

survey can be found in
[6]
.

In the context of
microarray analysis, our approach can be viewed as a form of clustering.

A wealth of clustering
approaches has been proposed. See

[7
-
11]

to list

just a few. Here the usual goal is to partition vertices into
disjoint subsets, so that the

genes that correspond to the vertices within each subset display some measure
of homogeneity.

An advantage clique holds over most traditional clustering methods is

that

cliques need
not be disjoint. A vertex can reside in more than one (maximum or maximal) clique, just as a gene
product can be involved in more than one regulatory network.

There are recent clustering techniques, for
example those employing factor an
alysis

[12]
, that do not require exclusive cluster membership for single
genes. Unfortunately,

these tend to produce biologically uninterpretable factors without the incorporation
of

prior biological information
[13]
. Clique makes no such demand.

Another advantage of

clique is the
purity of the categories it generates.

There is considerabl
e interest in solving

the dense k
-
subgraph
problem
[14]
. Here the focus is on a cluster’
s edge density,

also referred to as clustering coefficient,
curvature, and even cliquishness
[15, 16]
.

In

this respect, clique is the “gold standard.” A cluster’s edge
density is maximized with

clique by defini
tion.


Clique formalities.
The inputs to clique are an undirected graph
G

with
n

vertices, and a parameter

k
<=

n
. The question asked is whether
G

contains a clique of size
k
, that is, a subgraph

isomorphic to
K
k
, the
complete graph on
k

vertices. The imp
ortance of
K
k

lies in the

fact that each and every pair of its vertices
is joined by an edge. Subgraph isomorphism,

clique in particular, is
NP
-
complete. From this it follows

2

that there is no known algorithm

for deciding clique that runs in time polynomial

in the size of the input.
One

could solve clique by generating and checking all
(n choose k)

candidate solutions.

But this brute
force approach requires
O(n
k
)

time, and is thus prohibitively slow.

C
lique has application in a variety of
settings relevant

to computational molecular biology. We focus on clique here

mainly because it is by far
the most challenging step in our toolchain for the analysis of microarray

data (see
the accompanying
figure
)
. The outputs of our toolchain, namely sets of putatively c
o
-
regulated genes,

are in turn inputs to
other research components in this proposal.

We shall concentrate our discussion

on the classic maximum
clique problem. Of course we also must handle the

related problem of generating all maximal cliques
once a suit
able threshold has been

chosen, a task which is itself
often a function of maximum clique size.
There are a variety

of other issues dealing with pre
-

and post
-
processing

but these
are dwarfed by the
computational complexity of the fundamental clique proble
m at the

heart of our method.


Figure 1.
T
he clique
-
centric toolkit, and its use in microarray analysis.


Our methods are employed as illustrated in Figure
1
. We shall concentrate our discussion

on the classic
maximum clique p
roblem. Of course we also must handle the

related problem of generating all maximal
cliques once a suitable threshold has been

chosen, which is itself often a function of maximum clique
size. There are a variety

of other issues dealing with pre
-

and post
-
p
rocessing. Although we do not
explicitly

deal with them in the present paper, they are for the most part quite easily handled and

are
dwarfed by the computational complexity of the fundamental clique problem at the

heart of our method.

Fixed
-
Parameter Trac
tability.
The origins of
fixed parameter tractability
can be

traced at least as far
back as the work done to show, via the Graph Minor Theorem,

that a variety of parameterized problems
are tractable when the relevant input parameter

is fixed. See, for exam
ple,
[17, 18]
. Formally, a problem
is FPT if it has an algorithm

that runs in
O(f(k)n
c
)
, where
n

is the problem size,
k

is the input parameter,
and
c

is a

constant independent of both
n

and
k

[19]
. Unfortunately, clique is not FPT unless the

W
hierarchy collapses. (The

W hierarchy, whose lowest level is FPT, can be viewed as

a fixed
-
parameter
analog of the polynomial hierarchy,

whose lowest level is
P
.) Thus

we focus instead on clique’s

3

complementary dual, the
vertex cover
problem. Consider

G
, the complement of
G
. (
G

has the same vertex
set as
G
, but edges present in
G

are

absent in
G

and vice versa.)


As with clique, the inputs

to vertex cover
are an undirected

graph
G

with
n

vertices, and a parameter
k<=

n
. The question now asked is whether

G

contains a set
C
of
k

vertices that covers every edge in
G
, where an edge is said

to be covered if either or
both of its endpoints are in

C
. Like clique, vertex cover is

NP
-
complete. Unlike clique, however, vertex
cover is also FPT. The crucial observation

here is this: a vertex cover of size
k

in
G

turns out to be exactly
the complement of

a clique of size
n
<=

k

in
G.

Thus, we search for
a minimum vertex cover in
G
,
thereby

finding the desired maximum clique in
G
. Currently, the fastest known vertex cover

algorithm
runs in
O(1:2852
k

+ kn)

time [18]. Contrast this with
O(n
k
)
. The requisite

exponential growth (assuming
P



NP
) is therefore reduced to a mere
additive
term.


Kernelization, Branching, Parallelization and Load Balancing.
The initial goal is

to reduce an
arbitrary input instance down to a relatively small computational kernel,

then decomposing it so that an
eff
icient, systematic search can be conducted. Attaining

a kernel whose size is quadratic in
k

is relatively
straightforward. Ensuring a

kernel whose size is linear in
k

has until recently required much more
powerful and

considerably slower methods that rely
on linear programming relaxation
[20, 21]
.

We have
recently
introduced and analyzed

a new technique, termed
crow
n reduction
, which produces linear
-
sized
crowns much faster. See
[22, 23]
.
The problem
next

becomes one of exploring the kernel efficiently. A
branching process

is carried ou
t using a binary search tree. Internal nodes represent choices; leaves

denote
candidate solutions. Subtrees spawned off at each level can be explored in parallel.

The best results have
generally been obtained with minimal intervention, in the

extreme case

launching secure shells (SSHs)
[24]
. To maintain scalability as datasets

grow in size and as more machines are brought on line, some
form of dynamic load

balancing is generally required.

We have implemented such a scheme using sockets

and

process
-
independent forking. Results on 32
-
64 processors in the context of motif discovery

are
reported in
[25]
. Large
-
scale testing using immense genomic and proteomic

datasets are r
eported in
[26]
.

Sample Results
.
We are now able to solve real, non
-
synthetic instances of clique on graphs whose
vertices

number in the thousands. (Just imagine a straightforward
O(n
k
)

algorithm on

problems of that
size!)

A good example is

provided with
Mus musculus

neurogenetic microarray

data
from Dr. Williams
lab. Here we face a graph with
12,422 vertices

(probe
-
sets). With expression

values normalized to [0,1]
and

the threshold set

at 0.5, the

clique we returned (via vertex

cover) denoted a set of 369

genes that
appear experimentally

to be co
-
regulated. This

took a few days to solve even

with our best current
methods.

Yet solving it at all was

probably unthinkable just a

short time

ago. After iterating

across
several threshold

choices, a value of .85 was selected

for detailed study. For

this graph,
G
, the maximum

clique size is 17. Because it

is difficult to visualize
G
, we

employ a clique intersection graph,
C
G
, as
follows. Each ma
ximal clique of size 15 or

more in
G

is represented by a vertex in
C
G
. An edge connects
a pair of vertices in
C
G

if

and only if the intersection of the corresponding cliques in
G

contains at least 13
members.

C
G

is depicted in
the accompanying figure
, wit
h vertices representing cliques of size 15 (in
green),

cliques of size 16 (in black) and cliques of size 17 (in red). One rather surprising result

is that the
gene found most often across large maximal cliques is
Veli3
(aka
Lin7c
). This

appears not to be d
ue to
some so
-
called “housekeeping” function, but instead because

the relatively unstudied
Veli3
is in fact
central to neurological function

[27, 28]
.



4


Figure 2.
A clique intersection graph for a large microarray dataset.

Provisions for the Research Community at Large.
We operate in an o
pen source mode. Our software
is freely available to oth
ers in the research community. M
any of our codes are
installed now in Clustal
XP,
a high
-
performance, parallel

version of the well
-
known Clustal
W package.

In these,

vertex cover
is

used
to resolve co
nflict graphs encountered in phylogenic applications. See
[29]
. Other codes we have
released are from our papers accepted at the 2003 and 2004 CAMDA competitions. These are designed
to employ algorithms for clique, do
minating set and other combinatorial problems
for use in disease
prediction and screening. See
[30, 31]
.


References


[1]

N. E. Baldwin, E. J. Chesler, S. Kirov, M. A. Langston, J. R. Snoddy, R. W. Williams,
and B. Zhang, Computational, Integrative and Comparative Methods for the Elucidation of
Genetic Co
-
Expression Networks,
Journal of Biomedicine and Bi
otechnology
, 2004, accepted
for publication.

[2]

L. L. E. J. Chesler, J. Wang, R. W. Williams, and K. F. Manly, WebQTL: rapid
exploratory analysis of gene expression and genetic networks for brain and behavior.,
Nature
Neuroscience
, vol. 7, 2004, 485

486.

[3]

U. Alon, Biological networks: the tinkerer as an engineer,
Science
, vol. 301, 2003, 1866

1867.

[4]

A.
-
L. Barab´asi and Z. N. Oltvai, Network biology: Understanding the cell’s functional
organization,
Nature Reviews Genetics
, vol. 5, 2004, 101

113.

[5]

J. C. Setubal and J. Meidanis,
Introduction to Computational Molecular Biology
. Boston:
PWS Publishing Company) 1997.

[6]

I. Bomze, M. Budinich, P. Pardalos, and M. Pelillo, "The maximum clique problem," in
Handbook of Combinatorial Optimization
, vol. 4, D
.
-
Z. Du and P. M. Pardalos, Eds.: Kluwer
Academic Publishers, 1999).

[7]

P. Hansen and B. Jaumard, Cluster analysis and mathematical programming.,
Mathematical Programming
, vol. 79, 1997, 191

215.


5

[8]

A. Bellaachia, D. Portnoy, Y. Chen, and A. G. Elkahloun
, E
-
CAST: A data mining
algorithm for gene expression data,
Proceedings, Workshop on Data Mining in Bioinformatics
,
2002, 49

54.

[9]

A. Ben
-
Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini, Tissue
classification with gene expression prof
iles,
Journal of Computational Biology
, 2000, 54

64.

[10]

A. Ben
-
Dor, R. Shamir, and Z. Yakhini, Clustering gene expression patterns,
Journal of
Computational Biology
, vol. 6, 1999, 281

297.

[11]

E. Hartuv, A. Schmitt, J. Lange, S. Meier
-
Ewert, H. Lehrachs
, and R. Shamir, An
algorithm for clustering cDNAs for gene expression analysis,
Proceedings, RECOMB
, 1999,
188

197.

[12]

O. Alter, P. O. Brown, and D. Botstein, Singular value decomposition for genome
-
wide
expression data processing and modeling,
Proceedi
ngs of the National Academy of Sciences,
,
2000, 10101

10106.

[13]

M. Girolami and R. Breitling, Biologically valid linear factor models of gene expression,
Bioinformatics
, 2004, to appear.

[14]

U. Feige, D. Peleg, and G. Kortsarz, The dense k
-
subgraph prob
lem,
Algorithmica
, vol.
29, 2001, 410

421.

[15]

D. J. Watts and S. H. Strogatz, Collective dynamics of ’small
-
world’ networks,
Nature
,
vol. 393, 1998, 440

442.

[16]

J. Rougemont and P. Hingamp, DNA microarray data and contextual analysis of
correlation gra
phs,
BMC Bioinformatics
, vol. 4, 2003.

[17]

M. R. Fellows and M. A. Langston, Nonconstructive Tools for Proving Polynomial
-
Time
Decidability,
Journal of the ACM
, vol. 35, 1988, 727
-
739.

[18]

M. R. Fellows and M. A. Langston, On Search, Decision and the Eff
iciency of
Polynomial
-
Time Algorithms,
Journal of Computer and Systems Sciences
, vol. 49, 1994, 769
-
779.

[19]

R. G. Downey and M. R. Fellows,
Parameterized complexity
. New York: Springer) 1999.

[20]

S. Khuller, The Vertex Cover Problem,
ACM SIGACT News
, vo
l. 33, June, 2002, 31
-
33.

[21]

G. L. Nemhauser and L. E. Trotter, Vertex Packings: Structural Properties and
Algorithms,
Mathematical Programming
, vol. 8, 1975, 232
-
248.

[22]

F. N. Abu
-
Khzam, R. L. Collins, M. R. Fellows, M. A. Langston, W. H. Suters, and
C. T.
Symons, Kernelization Algorithms for the Vertex Cover Problem: Theory and Experiments,
Proceedings, Workshop on Algorithm Engineering and Experiments (ALENEX)
, 2004.

[23]

F. N. Abu
-
Khzam, M. A. Langston, and W. H. Suters, Effective Vertex Cover
Kern
elization: A Tale of Two Algorithms,
Proceedings, ACS/IEEE International Conference on
Computer Systems and Applications
, 2005, accepted for publication.

[24]

F. N. Abu
-
Khzam, M. A. Langston, and P. Shanbhag, Scalable Parallel Algorithms for
Difficult Comb
inatorial Problems: A Case Study in Optimization,
Proceedings, International
Conference on Parallel and Distributed Computing and Systems
, 2003.

[25]

N. E. Baldwin, R. L. Collins, M. A. Langston, M. R. Leuze, C. T. Symons, and B. H.
Voy., High Performance
Computational Tools for Motif Discovery,
Proceedings, IEEE
International Workshop on High Performance Computational Biology (HiCOMB)
, 2004.

[26]

F. N. Abu
-
Khzam, M. A. Langston, P. Shanbhag, and C. T. Symons, "Scalable Parallel
Algorithms for FPT Problems,
" University of Tennessee CS Technical Report,
http://www.cs.utk.edu/~library/TechReports/2004/ut
-
cs
-
04
-
524.pdf
, 2004.


6

[27]

G. A. C. Becamel, N. Galeotti, E. Demey, P. Jouin,

C. Ullmer, A. Dumuis, J. Bockaert,
and a. P. Marin., Synaptic multiprotein complexes associated with 5
-
HT(2C) receptors: a
proteomic approach,
EMBO Journal
, vol. 21, 2002, 2332

2342.

[28]

M. O. S. Butz, and T. C. Sudhof., A tripartite protein complex with

the potential to
couple synaptic vesicle exocytosis to cell adhesion in brain,
Cell
, vol. 94, 1998, 773

782.

[29]

F. N. Abu
-
Khzam, F. Cheetham, F. Dehne, M. A. Langston, S. Pitre, A. Rau
-
Chaplin, P.
Shanbhag, and P. J. Taillon, "ClustalXP,"
http://ClustalXP.cgmlab.org/
.

[30]

M. A. Langston, L. Lan, X. Peng, N. E. Baldwin, C. T. Symons, B. Zhang, and J. R.
Snoddy, "A Combinatorial Approach to the Analysis of Differential Gene Expression Data: The
Use of Graph Al
gorithms for Disease Prediction and Screening," in
Methods of Microarray Data
Analysis IV, Papers from CAMDA '03
, K. F. Johnson and S. M. Lin, Eds. (Boston: Kluwer
Academic Publishers, 2004) To Appear.

[31]

X. Peng, M. A. Langston, A. Saxton, N. E. Baldwin, and J. R. Snoddy, Detecting
Network Motifs in Gene Co
-
expression Networks,
Proceedings, International Conference on the
Critical Assessment of Microarray Data Analysis
, 2004.