Prediction of Protein Interactions, Complexes, and Function

plantationscarfΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 6 μήνες)

141 εμφανίσεις

School of Computing

Research Report 200
9/2010


Page
1

of
12





C
OMPUTATIONAL

B
IOLOGY

P
resent
-
day biomedical researchers are confronted by vast amounts of data from genome
sequencing; proteomics; microscopy; high
-
throughput analytical techniques for DNA, RNA, and
proteins; and a host of other new experimental techn
ologies. Coupled with advances in computing
power, this flow of information enables scientists to computationally model and analyze biological
systems in novel ways.


Accordingly, the
Co
mputational Biology Lab
in SOC
works towards

fundamental advances in
knowledge discovery, database
management
, combinatorial algorithms, and modeling and simulation,
as well as in the applications of these technologies to problems in biology and medicine.
Some of o
ur
research

projects

and activities
are described below
.



S
IMULATION AND ANALYS
IS OF PROTEIN
MOTION


Protein
motion and
conformational flexibility plays a critical role in vital biological functions, such as
immune protection and enzymatic catalysis. An example is the “flap” motion of HIV protease, a major
inhibit
ory drug target for AIDS therapy. The flaps, located near the reactive site of HIV protea
se, must
open to allow a ligand

to bind and then close to form direct contacts with the ligand. Such motion is an
essential means which proteins rely on to perform

the
ir functions, and therefore

provides an important
link between structure and function, a central relationship in molecular biology.


We envision that future protein databases will contain not only static structures of proteins, as the
Protein Data Bank (PD
B) does currently, but also the motion of proteins. Users can submit queries on
protein motions to study the relationship between structure and function, just as they submit queries
on sequences and static structures today. Such information will greatly en
hance our ability to
understand and predict ligand
-
protein and protein
-
protein interactions during processes such as
pharmaceutical drug design.

Towards this goal, w
e are developing geometric
and probabilistic
computation tools to explore, analyze, and mod
el protein
motion and
conformational flexibility.


Due to noise in data, determining salient conformational changes accurately and efficiently is a
challenging problem.
We developed an algorithm and related software called pFlexAna for detection
of protei
n conformational changes from experimental data obtained through, e.g., X
-
ray
crystallography. A key element of pFlexAna is a statistical test that determines the similarity of two
protein structures in the presence of noise. Using data from the Protein Da
ta Bank and the
Macromolecular Movements Database, we tested the algorithm on proteins that exhibit a range of
different conformational changes. Results show that our algorithm can reliably detect salient
conformational changes, including well
-
known exampl
es such as hinge and shear.


School of Computing

Research Report 200
9/2010


Page
2

of
12


Molecular dynamics simulation is a well
-
established method for studying protein motion at
the atomic scale. However, it is computationally
intensive and generates massive amounts of data,
the sheer size of which often becomes
an
obstacle to biological insights.
We proposed using
Markov models with hidden states to construct
simplified models of protein motion at long
timescales, as many important kinetic and
dynamic properties of proteins ultimately depend
on such motions. The
Markovian states in such
models represent potentially overlapping
probabilistic distributions over a protein’s
conformation space.
Our method

produced a 3
-
state model that
was as equal

good
as,
and yet
simpler than
,

a widely accepted 6
-
state model for pred
icting the long
-
timescale motions of alanine
dipeptide. We
also

used the constructed Markov models to estimate important kinetic and dynamic
quantities for protein folding, in particular, mean first
-
passage time.


Representative p
ublications:

1.

A. Nigham, D
. Hsu.
Protein Conformational Flexibility Analysis with Noisy Data
.
Proceedings
of 11th Annual International Conference on Research in Computational Molecular Biology
(RECOMB)
, pages
396
-
-
411
, San Francisco, CA, April 2007.

2.

T.
-
H. Chiang, M. S. Apaydin, D.
L. Brutlag, D. Hsu, and J.
-
C. Latombe.
Using Stochastic
Roadmap Simulation to Predict Experimental Quantities in Protein Folding Kinetics:
Folding Rates and Phi
-
Value
.
Journal of Computational Biology
, 14(5):578
--
593, June 2007.

3.

A. Nigham, L. Tucker
-
Kellog
g, I. Mihalek, C. Verma, D. Hsu.
pFlexAna: Detecting
conformational changes in remotely related proteins
.
Nucleic Acids Reseach
, 36(Web Server
issue):W246
-
W251, July 2008.

4.

A. Nigham and D. Hsu.
Protein conformational flexibility analysis with noisy data
.
Journal of
Computational Biology
, 15(7):813
--
828, September 2008.

5.

T
.
-
H
.

Chiang, D
.

Hsu, J
.
-
C
.

Latombe.
Markov dynamic models for long
-
timescale protein
motion
.
Proceedings of 18th International Conference on Intelligent Systems for Molecular
Biology (ISMB
)
, Boston, Mass, July 2010. In press.


PREDICTION OF PROTEI
N INTERACTIONS,
COMPLEXES, AND FUNCT
ION


Progress in high
-
throughput experimental techniques in the past decade has resulted in a rapid
accumulation of protein
-
protein interaction (PPI) data. Howe
ver, interaction data obtained by popular
high
-
throughput assays such as yeast
-
two
-
hybrid experiments contain as much as 50% false
positives and false negatives. Furthermore, the PPI networks resulting from these assays are still
essentially an
in vitro

sc
affold. Further progress in computational analyses techniques and
experimental methods is needed to reliably deduce in vivo protein interactions, to distinguish between
permanent and transient interactions, to distinguish between direct
proteins

binding fr
om membership
in the same protein complex and to distinguish protein complexes from functional modules.


We aim to advance computational techniques for: (a)
assessing the reliability of PPIs reported in high
-
throughput assays; (b) identifying false
-
negati
ve PPIs; (c) deducing new protein complexes from high
-
throughput assays
; and (d) inferring protein function for previously unannotated proteins.


To deal with noise in PPI networks produced by high
-
throughput assays, w
e developed
the “iterated
CD distance”

method

for assessing the reliability of a
PPI
, given a
PPI network

derived from high
-
throughput protein interaction experiments.
The method estimate
d

the likelihood of the interaction of a
pair of proteins by applying
expectation

maximization on their sha
red neighbourhood

in the PPI
network
. The
new PPI’s predicted by this
method
were

superior

to those made by other PPI
-
network
-
School of Computing

Research Report 200
9/2010


Page
3

of
12


based
methods (
e.g.,
30
-
50% better correlated with localization coherence).

Moreover, we

showed
that protein complex prediction me
thods would benefit significantly from having their input PPI
networks cleans
ed

by this method.


In order to build on the numerous previous works on PPI prediction, w
e

further

proposed a
probabilistic framework to integrate multiple types of information to

predict PPI’s.

A

variety of existing
PPI
prediction

techniques are integrated, including

domain

domain interactions, interaction motifs,
paralogous interactions, and protein

function similarity.

We also
used

closed itemset mining to identify

domain
combin
ation pairs

associated with interacting proteins to derive

complex interaction rules that
are not covered

by the existing approaches. Evaluation
demonstrated

that integrating multiple
predictions from the

different approaches using our framework significan
tly

outperform
ed

any
individual prediction.

We participated in the DREAM
2 protein

protein subnetwork prediction challenge
;
and o
ur entry outperformed those of other participants

by a clear margin.


The large amount of PPI’s produced by high
-
throughput assa
ys makes it possible to predict protein
complexes from PPI networks. W
e developed an algorithm called CMC (clustering
-
based on maximal
cliques) to discover

protein

complexes from PPI network
s weighted by our iterated CD distance
. CMC
first generates all th
e maximal cliques from the PPI networks, and then removes or merges highly
overlapped clusters based on their interconnectivity.
Tests

showed that CMC is an effective approach
to protein complex prediction from protein interaction network
, achieving higher

precision while
retaining recall at levels comparable to previous methods
.


Besides cleaning PPI networks and predicting protein complexes, we also tr
ied to discover motifs of
protein interaction sites.
We developed a fast heuristic algorithm to predict i
nteraction motif pairs from
a set of protein sequences based on PPI network. Existing algorithms could take days to process a
set of 5000 protein sequences with about 20,000 interactions. Our algorithm
was

able to locate the
correct motif pair in such a da
taset
(actually the yeast interactome)
in 45 minutes. Moreover, we
derived a lower bound result for the p
-
value of a motif pair in order for it to be distinguishable from
random motif pairs.


Lastly, we developed a scalable

and

efficient

function predicti
on framework
, called IWA (Integrated
Weighted Averaging
)
,

that

integrates diverse information using simple weighting strategies. The
simplicity of the approach makes it possible to make predictions based on on
-
the
-
fly information
fusion. In addition to its

great efficiency, IWA performs exceptionally well against existing approaches.
In the presence of cross
-
genome information, IWA makes even better protein function predictions.


Representative p
ublications

1.

H. N. Chua, W.
-
K. Sung, L. Wong.
An efficient
str
ategy for extensive integration of diverse
biological data for protein function
prediction
.
Bioinformatics
, 23(24):3364
-
3373,
December 2007.

2.

H. N. Chua, L. Wong.
Increasing the
Reliability of Protein Interactomes
.
Drug
Discovery Today
, 13(15/16):652
--
658,

August
2008.

3.

H. N. Chua, H. Willy, G. Liu, X. Li, L. Wong, S.
-
K. Ng.
A Probabilistic Graph
-
Theoretic Approach to Integrate Multiple Predictions for the
Protein
-
Protein Subnetwork Prediction Challenge
.
Annals of New York Academy of Sciences
,
1158:224
-
233,

March 2009.

4.

G. Liu, L. Wong, H. N. Chua.
Complex Discovery from Weighted PPI Networks
.
Bioinformatics
, 25(15):1891
--
1897, August 2009.

5.

H. C. M. Leung, M. H. Siu, S. M. Yiu, F. Y. L. Chin, W. K. Sung.
Clustering
-
Based Approach for
Predicting Motif Pairs f
rom Protein Interaction Data
.
Journal of Bioinformatics and
Computational Biology
, 7(4):701
--
716, August 2009.



School of Computing

Research Report 200
9/2010


Page
4

of
12


IDENTIFYING
REGULATORY SIGNALS

IN
GENOMES


Interactions between macromolecules play many essential roles
---
e.g., metabolic reactions and signal

transduction
---
and occur in many combinations, such as protein
-
protein, protein
-
DNA, and protein
-
RNA. Protein interactions with DNA and RNA are the primary mechanisms for controlling gene
expression
s
. What is needed is a recognition code that maps from th
e protein sequence to a pattern
that describes the family of DNA binding sites
---
the functional elements. Identification of functional
elements in the human genome is fundamental to our understanding of cell functions
---
how these
codes orchestrate the comp
lex network of gene transcription, the transcriptome, and interactions in
distinct locations.


We aim to:
a) Develop methods for accurate identification of transcription factor binding sites and
other regulatory sites. (b) Develop methods for inferring th
e interactions of transcription factors and
other functional elements.


The performance of current
de novo

motif f
inders
is far from satisfactory

and no motif finder performs
consistently well over all datasets. We identified
several key observations on h
ow to utilize the
results from individual motif finders and designed a
novel ensemble method, MotifVoter, to predict the
motifs and binding sites.
In terms of sensitivity and
precision, MotifVoter outperform
ed

standalone

motif finders and ensemble methods
significantly
on Tompa’s

benchmark, Escherichia coli, and
ChIP
-
Chip datasets.


We also developed an algorithm called LocalMotif to detect nucleotide motifs that are localized with
respect to a
given

biological landmark. In LocalMotif, a novel score functi
on
was

combined with other
scoring measures including Z
-
score and relative entropy to detect the motif. The approach
successfully discovered biologically relevant motifs and their intervals of localization in scenarios
where the motifs could not be discove
red by general motif finding tools. It
was

useful for discovering
multiple co
-
localized motifs in a set of regulatory sequences, such as those identified by ChIP
-
Seq.


ChIP
-
seq is becoming the main approach to the genome
-
wide

study of protein

DNA interacti
ons and
histone modifications. However, it is difficult to identify weak

ChIP signals from background noise.

We
propose
d

a linear signal

noise model, in which a noise

rate was introduced to represent the fraction
of noise in a ChIP library.

We
then
develo
ped an iterative algorithm to estimate the noise rate using a

control library, and derived a library
-
swapping strategy for the false

discovery rate estimation. These
approaches were integrated in a

general
-
purpose framework, named CCAT (Control
-
based ChIP
-
seq

Analysis Tool), for the significance analysis of ChIP
-
seq. CCAT predicted

significantl
y more ChIP
-
enriched sites than the previous methods
.


Among different epigenetic modifications, the differential histone modification sites (DHMSs) are of
great inte
rest to study the dynamic nature of epigenetic and gene expression regulat
ions among cell
types, stages and

environmental responses. We proposed an approach called ChIPDiff for the
genome
-
wide comparison of histone modification sites identified by ChIP
-
se
q. ChIPDiff used a hidden
Markov model (HMM) to infer the states of histone modification changes at each genomic location.
We evaluated the performance of ChIPDiff by comparing the H3K27me3 modification sites between
mouse embryonic stem cell (ESC) and neu
ral progenitor cell (NPC). We demonstrated that the
H3K27me3 DHMSs identified were of high sensitivity, specificity and technical reproducibility. ChIPDiff
was further applied to uncover the differential H3K4me3 and H3K36me3 sites between different cell
st
ates. Interesting biological discoveries were achieved from such comparison

in our study.


Lastly, we developed a computational model of promoters of human histone genes, and used the
model to identify regions across the human genome having similar structu
re as promoters of histone
genes. Such regions
were predicted as

potential genomic regulatory regions of genes that might be
School of Computing

Research Report 200
9/2010


Page
5

of
12


co
-
regulated with histone genes.

Our
study represent
ed

one of the most comprehensive
computational analyses conducted thus far on a

genome
-
wide scale of promoters of human histone
genes.
The study

suggest
ed

a number of other human genes that share a high similarity of promoter
structure with the histone genes and thus are highly likely to be co
-
regulated, and consequently co
-
expressed
, with the histone genes. We also found that there are a large number of intergenic regions
across the genome with their structures similar to promoters of histone genes. These regions may be
promoters of yet unidentified genes, or may represent remote con
trol regions that participate in
regulation of histone and histone
-
coregulate
d gene transcription initiation
.

Representative p
ublications

1.

H. Xu, C. L. Wei, F. Lin, W. K. Sung.
An HMM approach to genome
-
wide identification of
differential histone modificati
on sites from ChIP
-
seq data
.
Bioinformatics
, 24(20):2344
-
2349,
October 2008.

2.

E. Wijaya, S. M. Yiu, N. T. Son, R. Kanagasabai, W. K. Sung.
MotifVoter: A novel ensemble
method for fine
-
grained integration of generic motif finders
.
Bioinformatics
, 24(20):2288
-
2295, October 2008.

3.

W. H. Lee, V. Narang, H. Xu, F. Lin, K. C. Chin, W. K. Sung.
DREAM2 Challenge: Integrated
Multi
-
Array Supervised Learning Algorithm for BCL6 Transcriptional Targets Prediction
.
Annals of New York Academy of Sciences
, 1158:196
-
204, Marc
h 2009.

4.

G. Li, M. J. Fullwood, F. H. Mulawadi, S. Velkov, H. Xu, V. B. Vega, P. N. Ariyaratne, Y. Bin
Mohamed, H.
-
S. Ooi, C. Tennakoon, C.
-
L. Wei, Y. Ruan, W.
-
K. Sung.
ChIA
-
PET tool for
comprehensive chromatin interaction analysis with paired
-
end tag seque
ncing
.
Genome
Biology
, 11(2):R22, February 2010.

5.

V. Narang, A. Mittal, W. K. Sung.
Localized Motif Discovery in Gene Regulatory Sequences
.
Bioinformatics
, 26(9):1152
--
1159, May 2010.

6.

H. Xu, L. Handoko, X. Wei, C. Ye, J. Sheng, C. L. Wei, F. Lin, W. K. Sung
.
A Signal
-
Noise
Model for Significance Analysis of ChIP
-
seq with Negative Control
.
Bioinformatics
,
26(9):1199
--
1204, May 2010.



MANAGEMENT AND ANALY
SIS OF GENE
EXPRESSION DATA


The development of microarray technology has made possible the simultaneous m
onitoring of the
expression
s

of thousands of genes.
This development offers great opportunities in advancing the
diagnosis of diseases, the treatment of diseases, and the understanding of gene functions.
However,
existing works fall shor
t on several issues
: they

provide little information on the interplay between
selected genes; the collection of pathways that can be used, evaluated, and ranked against the
observed expression data is limited; and a comprehensive set of rules for reasoning about relevant
mol
ecular events has not been compiled and formalized.


Our main goals are

to
:
(a) develop techniques to extract and known pathways from multiple sources;
(
b
) develop techniques to derive gene regulatory networks from gene expression data;

(
c
)

develop
technol
ogies for the design of microarrays; (
d
) develop tools for optimization of disease treatment

based on gene expression profiles.



In an era of increasingly complex biological datasets, one of the key steps in gene functional analysis
comes from clustering
genes based on co
-
expression. Biclustering algorithms can identify gene
clusters with local co
-
expressed patterns, which are more likely to define genes functioning together
than global clustering methods. However, these algorithms are not effective in unc
overing gene
regulatory networks because the mined biclusters lack genes that may be critical in the function but
may not be co
-
expressed with the clustered genes. We introduced a biclustering method called
SKeleton Biclustering (SKB), which builds high qu
ality biclusters from microarray data, creates
relationships among the biclustered genes based on Gene Ontology annotations, and identifies genes
that are missing in the biclusters. SKB thus defines inter
-
bicluster and intra
-
bicluster functional
relationsh
ips. The delineation of functional relationships and incorporation of such missing genes may
School of Computing

Research Report 200
9/2010


Page
6

of
12


help biologists to discover biological processes that are important in a given study and provides clues
for how the processes may be functioning together.



A sec
ond issue in the clustering of gene expression data is the large number of genes
---
as the
number of dimensions in a dataset increase, distance measures used by many clustering algorithms
become increasingly meaningless.
We proposed to use a sliding
-
window
approach to partition the
dimensions

in gene expression data

to preserve significant clusters. We call this model nCluster
model. The sliding
-
window approach generates more bins than the grid
-
based approach, thus it incurs
higher mining cost. We developed
a deterministic algorithm, called Maxn
-
Cluster, to mine nClusters
efficiently. Our experiments showed that (a) the nCluster model could indeed preserve clusters that
were

shattered by the grid
-
based approach on synthetic datasets; (b) the nCluster model pr
oduced
more significant clusters than the grid
-
based approach on two real gene expression datasets; and (c)
MaxnCluster was efficient in mining maximal nClusters.


Another issue is that existing 3D clustering algorithms on gene x sample x time expression d
ata do
not consider the time lags between correlated gene expression patterns. Besides, they either ignore
the correlation on time subseries, or disregard the continuity of the time series, or only validate pure
shifting or pure scaling coherent patterns i
nstead of the general shifting
-
and
-
scaling patterns.
We
proposed an efficient algorithm, LagMiner, to identify time
-
lagged co
-
regulated gene clusters. The
algorithm could identify interesting clusters satisfying the constraints of regulation, coherence,
mi
nimum gene number, minimum sample subspace size and minimum time periods length.
Experiments showed that LagMiner was effective, scalable and parameter
-
robust.


Representative p
ublications

1.

X. Han, W.
-
K. Sung, L. Feng.
Identifying Differentially Expressed
Genes in Time
-
Course
Microarray Experiment without Replicate
.
Journal of Bioinformatics and Computational Biology
,
5(2A):281
-
296, April 2007.

2.

D. Soh, D. Dong, Y. Guo, L. Wong.
Enabling More Sophisticated Gene Expression Analysis
for Understanding Diseases
and Optimizing Treatments
.
ACM SIGKDD Explorations
, 9(1):3
--
14, June 2007.

3.

J
.

Chen, L
.

Ji, W
.

Hsu, K
.
-
L
.

Tan, S
.

Y. Rhee.
Exploiting Domain Knowledge to Improve
Biological Significance of Biclusters with Key Missing Genes
.
Proceedings of 25th
Internationa
l Conference on Data Engineering (ICDE 2009)
, pages 1219
--
1222, Shanghai, China,
March 2009.

4.

X. Xu, Y. Lu, K.
-
L. Tan, A. K. H. Tung.
Finding Time
-
Lagged 3D Clusters
.
Proceedings of 25th
International Conference on Data Engineering (ICDE 2009)
, pages 445
--
4
56, Shanghai, China,
March 2009.

5.

G. Liu, K. Sim, J. Li, L. Wong.
Efficient Mining of Distance
-
Based Subspace Clusters
.
Statistical Analysis and Data Mining
, 2(5):427
--
444, December 2009.



ANALYZING MASS SPECT
RA FOR IDENTIFYING
PROTEINS


Proteomics is usef
ul for understanding the expression of proteins in cells at different levels, at
different time points, and in different forms. Such an understanding is critical to drug discovery and
medical advances. Mass spectrometers are the predominant tool to accompl
ish some of the primary
goals of proteomics. For example, identification of proteins, determination of expression level of
proteins, and determination of post
-
translational modifications, sites, and types. Due to limitations in
mass spectrometers and assoc
iated software tools, most of the MS/MS data generated by mass
spectrometers are rejected because they are not interpretable by currently available software.
Furthermore, the remaining data usually contain many false positives. In addition, the sensitivity

and
precision of mass spectrometers vary greatly.


School of Computing

Research Report 200
9/2010


Page
7

of
12


Our goals are to:
(a) Develop efficient, accurate, and robust algorithms for protein identification from
tandem mass spectra. (b) Develop efficient, accurate, and robust algorithms for alignment of noisy

mass spectra.

Tandem mass spectrometry is one of the most
commonly used techniques in peptide sequencing.
We studied general issues in
de novo

sequencing,
and proposed a general preprocessing scheme
that performed binning, pseudo peak introduction,
and
noise removal. We also studied the anti
-
symmetry problem and assumptions related to it,
and proposed a more realistic way to handle it.
We integrated our findings on preprocessing and
the anti
-
symmetry problem with some current
models for peptide sequencin
g, resulting in
improved accuracies for
de novo

peptide
sequencing.


Another category of approaches for peptide identification consists of database search

algorithms.
They return peptide sequences that match the parent mass of the experimental spectrum via

some
scoring functions. They generally do not perform well for peptides with post
-
translational modifications
(PTMs). We improved on our previous algorithm PepSOM by incorporating a
de novo

algorithm to
generate multi
-
charge strong tags, introducing
bette
r

scoring

functions to score candidate peptides
based on these tags
. Experiments on spectra with

simulated and real PTMs confirmed that our
algorithm
was fast and

accurate for identifying PTMs.
.

Representative p
ublications

1.

K. Ning, H. K. Ng, H. W. Leong.
An Accurate and Efficient Algorithm for Peptide and PTM
Identification by Tandem Mass Spectrometry
.
Proceedings of 18th International Conference
on Genome Informatics (GIW)
, pages 119
--
130, Singapore, 3
-
5 December 2007.

2.

K.Ning, N. Ye, H. W. Leong.
On prepr
ocessing and anti
-
symmetry in de novo peptide
sequencing: Improving efficiency and accuracy
.
Journal of Bioinformatics and Computational
Biology
, 6(3):467
--
492, June 2008.




BIOINFORMATICS FOR V
IROLOGY RESEARCH



Viruses have the potential to spread rapi
dly and infect a large proportion of the human population.
They constitute one of the most serious health threat
s

to the human population. High
-
throughput
sequencing technologies have brought about an explosion of the number of complete genomes of
viruses.

Yet despite this wealth of viral genome sequence data, these data has been largely used for
evolution and epidemiology studies with little direct clinical application. So it is timely to use these data
to develop technologies and bioinformatics tools that

have a greater impact on clinical decision
making.


We aim to address important issues in (a) detection of viruses, (b) resequencing of viruses, and (c)
recombination breakpoint analysis of viruses.


Random
-
tagged primers are a novel technique for amplifi
cation of unknown virus sequences.
However, blindly using such primers without checking their suitability on the target genome does not
guarantee genome
-
wide amplification of the virus in the patient sample. We developed a model and
an associated algorithm

(LOMA) to predict amplification efficiency of random
-
tagged primer. LOMA
was much more sensitive and efficient than previous methods in generating random
-
tagged primers
---
the resulting coverage by LOMA approache
d

90%.


School of Computing

Research Report 200
9/2010


Page
8

of
12


We further developed a

statistics
-
ba
sed pathogen dete
ction algorithm to be used with
DNA
microarrays. This algorithm (PDA) analyze
d

the distribution of probe signal intensities relative to
in
silico

r
-
signatures (recognition signature probe sets) based on a novel weighted Kullback
-
Leibler
di
vergence that
was

more sensitive to tail of the distribution. This algorithm
was

able to detect and
identify the presence of multiple viruses. In
our

validation experiments, it even correctly detected
viruses in the test samples that were initially missed

by gold
-
standard RT
-
PCR assays.


Resequencing microarrays have been used to
obtain whole
-
genome

primary sequences for
orthopoxviruses, biothreat viruses, etc. The
reported studies mainly use platform
accompanying

software that employs
probabilistic base
-
calling algorithms.
However,
these methods are susceptible to hybridization
noise

caused by factors such as poor probe
quality, poor amplification

or mutations.
We
developed a

novel resequencing microarray that
is able to accommodate mutation hotspots and

an algorithm to resolve multiple mutations in
probe sequence in the microarray. The algorithm
(EvolSTAR) exploits the characteristic profile of
mutations from wild
-
type sequence calls.
Validation experiments show
ed

it perform
ed

far
better than popular com
peting method
s
.


Recombination detection is important before
inferring phylogenetic relationships. This will lead to a better understanding of pathogen evolution,
more accurate genotyping and advancements in vaccine development. We developed a
n algorithm
(
RB
-
Finder)
to detect recombination breakpoint.
It

use
d

the novel idea of longest common
subsequences (ie. contigs) assembled from (k,m)
-
mers as the unit of similarity measurement . This
definition of contigs
was

not the usual definition, but it
was

the rig
ht one to use in this context as it
enforce
d

an even distribution of mismatches in the contigs. This simple but intuitive idea overc
a
me
inaccuracy of earlier methods that used base
-
by
-
base comparison. In particular, it
was

able to
distinguish regions of hi
gh mutation rates from recombination breakpoints, which previous methods
were

unable to

do
.


Furthermore, we achieved strong practical impact in these works. For instance, almost all H1N1
Singapore strains have been sequenced using our approach. Presently,

Mexico is also trying our
approach.

Representative p
ublications

1.

C. Wong, C. W. H. Lee, W. Y. Leong, S. Soh, C. Kartasasmita, E. Simoes, M. Hibberd, W.
-
K.
Sung, L. Miller.
Optimization and clinical validation of a pathogen detection microarray
.
Genome Bi
ology
, 8(5):R93, 2007.

2.

W. H. Lee, W. K. Sung.
RB
-
Finder: An Improved Distance
-
Based Sliding Window Method to
Detect Recombination Breakpoints
.
Journal of Computational Biology
, 15(7):881
-
898,
September 2008.

3.

W. H. Lee, C. W. Wong, W. Y. Leong, L. D. Mill
er, W. K. Sung.
LOMA: A fast method to
generate efficient tagged
-
random primers despite amplification bias of random PCR on
pathogens
.
BMC Bioinformatics
, 9:368, September 2008.

4.

W. H. Lee, C. W. Koh, Y. S. Chan, P. P. K. Aw, K. H. Loh, B
.

L
.

Han, P
.

L
.

Th
ien, G
.

Y
.
W
.

Nai, M
.

Hibberd, C
.

Wong, W
.
-
K
.

Sung.
Large
-
Scale Evolutionary Surveillance of the 2009 H1N1
Influenza A Virus Using Resequencing Arrays
.
Nucleic Acids Research
, accepted.

5.

V. J. Lee, J. Yap, A. R. Cook, M. I. Chen, J. Tay, B. H. Tan, J. P. L
oh, S. W. Chew, W. H. Koh, R.
Lin, L. Cui, W. H. Lee, W. K. Sung, C. W. Wong, M. L. Hibberd, W. L. Kang, B. Seet, P. A.
Tambyah.
Oseltamivir ring prophylaxis for containment of Influenza A (H1N1
-
2009)
outbreaks
.
New England Journal of Medicine
, accepted.



School of Computing

Research Report 200
9/2010


Page
9

of
12


SEQUENCE
I
NDEXING AND
DATABASE S
EARCH


One of the important daily tasks of biologists is perform
ing

homology search. Currently, heuristic
methods like BLAST are used for this task, whose running time is linear to the size of the database
(for instance, hu
man genome is of size 3 billion

bases
).
As the size of
typical sequence

database
s

is
increasing

rapidly
, the performance of th
e
se tools is going to
deteriorate
.
Thus i
t is important
to
have
solutions

to improve the performance of database search.


Our aim

is to improve performance

by utilizing sequence indexing techniques. We consider two
directions. The first direction is to create compre
ssed index. The second one

is to rely on disk
-
based
indexing.


For compressed index
ing
, we devoted our effort on compre
ssed suffix array, which is a compressed
index of biological sequences. We have developed a number of approximate string matching
algorithms utilizing the compressed suffix array to speedup the database search.
Our solution has
shown advantage
s

in
some spe
cialized homology searching problem in the biology domain. For
disk
-
based indexing, we
have
developed CPS
-
tree, which is a compact partitioned suffix tree on disk. For
exact pattern searching, the performance of CPS
-
tree is very good and the pattern search
ing time is
independent o
f

the genome length.

Representative
Publications

1.

S
.
-
S
.

Wong, W
.
-
K
.

Sung, L
.

Wong.
CPS
-
tree: A Compact Partitioned Suffix Tree for Disk
-
Based Indexing on Large Genome Sequences
.
Proceedings of 23rd IEEE International
Conference on D
ata Engineering
, pages 1350
--
1354, Istanbul, Turkey, April 2007.

2.

W.
-
K. Hon, T.
-
W. Lam, K. Sadakane, W.
-
K. Sung.
A Space and Time Efficient Algorithm for
Constructing Compressed Suffix Arrays
.
Algorithmica
, 48(1):23
-
36, June 2007.

3.

T. W. Lam, W. K. Sung, S.
L. Tam, C. K. Wong, and S. M. Yiu.
Compressed indexing and local
alignment of DNA
.
Bioinformatics
, 24:791
-
797, March 2008.

4.

T
.
-
W
.

Lam, W
.
-
K
.

Sung, S
.
-
S
.

Wong.
Improved Approximate String Matching Using
Compressed Suffix Data Structures
.
Algorithmica
, 51(3):
298
-
314, July 2008.

5.

W.
-
K. Hon, K. Sadakane, W.
-
K. Sung.
Breaking a Time
-
and
-
Space Barrier in Constructing
Full
-
Text Indices
.
SIAM J. Comput.
, 38(6):2162
--
2178, February 2009.



TAG

SNP

S
ELECTION AND
A
SSOCIATION
S
TUDY


Human genome harbors millions of commo
n single nucleotide polymorphisms (SNPs) and other types
of genetic variations. These genetic variations play an important role in understanding the correlation
between genetic variations and human diseases and the body's responses to prescribed drugs. The

discovery of such genetic factors contributing to variations in drug response, efficiency, and toxicity
has come to be known as pharmacogenomics.


There are a number of bioinformatics problem related to SNPs. We
target

two
particular

problems:
(a)
t
ag SN
P selection problem and
(b) a
ssociation
s
tudy. Since there are millions of SNPs, it is
difficult to
investigate all the SNPs.
The
t
ag SNP selection problem

is the

select
ion of

a subset of
informative
SNPs to represent the major variation
s

among the
individ
ual
s.
Given a set of SNPs for a set of
individuals,
an
a
ssociation study

is the finding of

a combination of SNPs which
has the potential to

cause
a
genetic disease.


Tag
-
SNP selection algorithms based on the r2 LD statistic have gained popularity because r
2 is
directly related to statistical power to detect disease associations. Most of existing r2 based
algorithms use pairwise LD. We proposed an efficient algorithm called FastTagger to calculate multi
-
marker tagging rules and select tag SNPs based on multi
-
marker LD. FastTagger used several
techniques to reduce running time and memory consumption. Results showed that FastTagger was
several times faster than existing multi
-
marker
-
based tag SNP selection algorithms, and it consumed
School of Computing

Research Report 200
9/2010


Page
10

of
12


much less memory at the sam
e time. As a result, FastTagger could work on chromosomes containing
more than 100 k SNPs using length
-
3 tagging rules. FastTagger also produced smaller sets of tag
SNPs than existing multi
-
marker based algorithms. The generated tagging rules could also be

used
for genotype imputation at >96% accuracy when r2 ≥ 0.9.


We also

developed a method for association study. Unlike most of the previous works which
perform
ed

single SNP association, our method consider
ed

t
he association of variable
-
size

haplotypes.
Th
r
ough regularized regression analysis, we tackle
d

the problem of multiple degrees of freedom in
haplotype test.
Our
method could

handle a large number of haplotypes in association analyses more
efficiently and effectively than currently available approache
s.

Representative
Publications

1.

Y
.

Li, W
.
-
K
.

Sung, J
.

J
.

Liu.
Association Mapping via Regularized Regression Analysis of
Single
-
Nucleotide

Polymorphism Haplotypes in Variable
-
Sized Sliding Windows
.
American
Journal of Human Genetics
, 80(4):705
-
-
715, April 2
007.

2.

G. Liu, Y. Wang, L. Wong.
FastTagger: An Efficient Algorithm for Genome
-
wide Tag SNP
Selection Using Multi
-
marker Linkage Disequilibrium
.
BMC Bioinformatics
, 11:66, February
2010.



RECONSTRUCT
ING

PHYLOGEN
ETIC

T
REE,
N
ETWORK
, AND GENE CLUSTER
S


A phylo
genetic tree, also known as a cladogram or a dendrogram, is a tree describing several life
forms and their relations.
Phylogenetic network

is a

gene
ralization of phylogenetic tree
. In addition to
the ancestral
-
descendent mutational relationship,
phylogenet
ic networks

capture evolutionary events
such as horizontal gene transfer

and
hybridization.
Constructing a phylogenetic tree
/network

is helpful
to understand
ing

the history of life and
to analyz
ing

rapidly mutating viruses like HIV. Phylogenetic
tree is al
so an important tool
in

the comparative genomic. Through comparing multiple species, it
helps to predict protein structure, gene expression pattern, etc. Hence, it is important to have efficient
and accurate tool
s

for reconstructing phylogeny.

Our aim is t
o
develop

methods for constructing phylogenetic tree and network

and
for
detecting gene
clusters
.


The identification of conserved gene clusters is an important step towards understanding genome
evolution and predicting the function of genes. Gene team is

a model for conserved gene clusters that
takes into account the position of genes on a genome. Existing algorithms for finding gene teams
require the user to specify the maximum distance between adjacent genes in a team. However,
determining suitable valu
es for this parameter, δ, is non
-
trivial. Instead of trying to determine a single
best value, we proposed constructing the

gene team tree (GTT)
,

which is a compact representation of
all gene teams for every possible value of δ. Our algorithm for computing
the GTT extended existing
gene team mining algorithms without increasing their time complexity. We computed the GTT for
E.
coli

and
B. subtilis
. We also described how to compute the GTT for multi
-
chromosomal genomes and
illustrated this using the GTT for t
he human and mouse genomes
.


Moreover, w
e also proposed a novel pairwise gene cluster model that combined the notion of
bidirectional best hits with the r
-
window model introduced in 2003 by Durand and Sankoff. The
bidirectional best hit (BBH) constraint re
moved the need to specify the minimum number of shared
genes in the r
-
window model and improved the relevance of the results. We designed a subquadratic
time algorithm to compute the set of BBH r
-
window gene clusters efficiently. We applied our cluster
mod
el to the comparative analysis of
E. coli

K
-
12 and
B. subtilis
. An analysis of the most significant
BBH r
-
window gene cluster showed that they were known operons.


W
e also developed accurate and efficient methods for constructing phylogenetic tree and netw
ork.
For phylogenetic network, we gave the first polynomial time algorithm for combining a dense set of
triplets into a g
alled phylogenetic network. Further
, we generalized the method to reconstruct a
School of Computing

Research Report 200
9/2010


Page
11

of
12


network by combining a set of phylogenetic trees. For p
hylogenetic tree, we studied the supertree
problem. We developed a fixed
-
parameter polynomial time algorithm to construct a maximum
agreement supertree of a set of phylogenetic trees.


Representative
Publications

1.

N. B. Nguyen, C. T. Nguyen, W.
-
K. Sung.
Fas
t Algorithms for Computing the Tripartition
-
Based Distance between Phylogenetic Networks
.
Journal of Combinatorial Optimization
,
13(3):223
--
242, April 2007.

2.

C. T. Nguyen, N. B. Nguyen, W.
-
K. Sung, L. Zhang.
Reconstructing Recombination Network
from Sequen
ce Data: The Small Parsimony Problem
.
IEEE/ACM Transactions on
Computational Biology and Bioinformatics
, 4(3):394
-
402, July
-
Sept 2007.

3.

V. T. Hoang, W. K. Sung.
Fixed Parameter Polynomial Time Algorithms for Maximum
Agreement and Compatible Supertrees
.
Pro
ceedings of 25th Annual Symposium on
Theoretical Aspects of Computer Science (STACS)
, pages 361
-
372, Bordeaux, France, 21
-
23
February 2008.

4.

M
.

Zhang, H
.

W
.

Leong.
Gene Team Tree: A Hierarchical Representation of Gene Teams of
All Gap Lengths
.
Journal of C
omputational Biology
, 16(10):1383
--
1398, October 2009.

5.

M
.

Zhang, H
.
W
.

Leong.
Bidirectional Best Hit r
-
Window Gene Clusters
.
BMC Bioinformatics
,
11(Suppl 1):S63, January 2010
.



MODELING, SIMULATION

AND ANALYSIS OF
BIOLOGICAL PATHWAYS


Mathematical modeli
ng is being

increasing
ly

recognized as a useful tool
in the biomedical sciences.
They can uncover new phenomena to explore, identify key factors of a system, link different level of
details, enable a formalization of intuitive understanding, screen unpromi
sing hypotheses, predict
variable inaccessible to measurement, and expand the range of questions that can meaningfully be
asked.


We
study
models of bio
-
pathways consisting of large networks of bio
-
chemical reactions whose
kinetics are described by Ordina
ry Differential equations (ODEs). Our goal is to develop
computational techniques using which the dynamics defined by such models can be studied in a
scalable and efficient manner. In particular, we have been developing techniques for tackling basic
analys
is problems such as parameter estimation and sensitivity analysis.


Parameter estimation of large bio
-
pathway models is an important and difficult problem. To reduce the
prohibitive computational cost, one approach is to decompose a large model into comp
onents and
estimate their parameters separately. However, the decomposed components often share common
parts that may have conflicting parameter estimates. We proposed to use belief propagation to
reconcile these independent estimates in a principled manne
r and computed new estimates that were
globally consistent and fitted well with data. An important advantage of our approach in practice

was
that it naturally handled incomplete or noisy data.
Preliminary results based on synthetic data
showed
good perform
ance
.


Systems of ordinary differential equations (ODEs)
are often used to model the dynamics of complex
biological pathways.
However, these ODEs
system will generally not admit

closed form
solutions. Instead, one will have to resort to a
large number of n
umerically generated
trajectories to study the dynamics. This motivated
us to
construct

a

discrete state model as a
probabilistic approximation of the ODE

dynamics
by discretizing the value space and the time
School of Computing

Research Report 200
9/2010


Page
12

of
12


domain. We then

sample
d

a representative set of

trajectories and exploit
ed

the discretization

and the
structure of the signaling pathway to encode these trajectories

compactly as a dynamic Bayesian
network. As a result, many interesting

pathway properties
could

be analyzed efficiently through
standard

Bayesian inference techniques. We tested our method on a model

of EGF
-
NGF signaling
pathway

and the results
were

very promising
.

This method is being been applied

in collaboration
with biologists
-

to study the innate immune response system in a comprehens
ive manner as well DNA
damage
-
repair pathways.


Lastly,
p
athway model construction

is often an inherently incremental process, with new pathway
players and interactions

continuously being discovered and additional experimental data being

generated.
So

we

have also

focus
ed

on the problem of performing model parameter estimation

incrementally by integrating new experimental data into an existing model.

We used a

probabilistic
graphical model known as the factor graph
to
represent

pathway parameter estimates.

When
new data arrive
d
, the parameter estimates
were
refined efficiently by belief

propagation to the
factor graph. A key advantage of our approach
was

that the

factor graph model contain
ed

enough information

about the old data, and use
d

only

new data to
refine the parameter estimates
without requiring explicit access to the

old data.
To test this approach, we appl
ied it to the Akt
-
MAPK pathways
. The results show
ed

that our
new approach
could

obtain parameter estimates

that fit
ted

the data well and refine
d

them
inc
rementally when new data arrived
.


Representative
Publications

1.

G
.

Koh, L
.

Tucker
-
Kellogg, D
.

Hsu, and P.

S. Thiagarajan.
Globally consistent pathway
parameter estimates through belief propagation
.
Proceedings of 7th Workshop on Algorithms
in Bioin
formatics (WABI)
, pages 420
--
430, Philadelphia, September 2007.

2.

B
.

Liu, P. S. Thiagarajan, D
.

Hsu.
Probabilistic approximations of signaling pathways
dynamics
.
Proc. 7th Conference on Computational Methods in Systems Biology (CMSB)
, pages
251
--
265, Bologna
, Italy, August 2009.

3.

G
.

Koh, D
.

Hsu, P.

S. Thiagarajan.
Incremental Signaling Pathway Modeling by Data
Integration
.
Proceedings of 14th Annual International Conference on Research in Computational
Molecular Biology (RECOMB)
, Lisbon, Portugal, April 2010.

In press.






The faculty members involved in
computational biology

research are:




David HSU



Wynne HSU



LEE Mong LI



LEONG Hon Wai



SUNG Wing
-
Kin



TAN Kian Lee



P. S. Thiagarajan



Anthony TUNG



WONG Limsoon