TITLES, AUTHORS, ABSTRACTS FOR ISMB02 Paper01, Presentation02 Title: Mining viral protease data to extract cleavage knowledge Authors:

fancyfantasicAI and Robotics

Nov 7, 2013 (4 years and 8 months ago)



Paper01, Presentation02


Mining viral protease data to extract cleavage knowledge


Ajit Narayanan, Xikun Wu and Z. Rong Yang


Motivation: The motivation is to identify, through machine learni
ng techniques,
specific patterns in HIV and HCV viral polyprotein amino acid residues where
viral protease cleaves the polyprotein as it leaves the ribosome. An
understanding of viral protease specificity may help the development of future
viral drugs

involving protease inhibitors by identifying specific features
of protease activity for further experimental investigation. While viral
sequence information is growing at a fast rate, there is still comparatively
little understanding of how viral polyprot
eins are cut into their functional
unit lengths. The aim of the work reported here is to investigate whether it is
possible to generalise from known cleavage sites to unknown cleavage sites for
two specific viruses

HIV and HCV. An understanding of prote
olytic activity
for specific viruses will contribute to our understanding of viral protease
function in general, thereby leading to a greater understanding of protease
families and their substrate characteristics.

Results: Our results show that artific
ial neural networks and symbolic learning
techniques (See5) capture some fundamental and new substrate attributes, but
neural networks outperform their symbolic counterpart.

Availability: Publicly available software was used (Stuttgart Neural Network

tuebingen.de/SNNS/ , and See5

http://www.rulequest.com). The datasets used (HIV, HCV) for See5 are available
at: http://www.dcs.ex.ac.uk/~anarayan/bioinf/ismbdatasets/

Contact: a.narayanan@ex.ac.uk, z.r.yang@ex.ac.

Paper02, Presentation03


The metric space of proteins

comparative study of clustering algorithms


Ori Sasson, Nati Linial, Michal Linial


Motivation: A large fraction of biological research concentrates on individual
s and on small families of proteins. One of the current major challenges
in bioinformatics is to extend our knowledge also to very large sets of proteins.
Several major projects have tackled this problem. Such undertakings usually
start with a process that

clusters all known proteins or large subsets of this
space. Some work in this area is carried out automatically, while other attempts
incorporate expert advice and annotation.

Results: We propose a novel technique that automatically clusters protein
ences. We consider all proteins in SWISSPROT, and carry out an all
all BLAST similarity test among them. With this similarity measure in hand we
proceed to perform a continuous bottom
up clustering process by applying
alternative rules for merging
clusters. The outcome of this clustering process
is a classification of the input proteins into a hierarchy of clusters of
varying degrees of granularity. Here we compare the clusters that result from
alternative merging rules, and validate the results aga
inst InterPro.

Our preliminary results show that clusters that are consistent with several
rather than a single merging rule tend to comply with InterPro annotation. This
is an affirmation of the view that the protein space consists of families that
r markedly in their evolutionary conservation.

Availability: The outcome of these investigations can be viewed in an
interactive Web site at http://www.protonet.cs.huji.ac.il.

Supplementary information: Biological examples for comparing the performance of
the different algorithms used for classification are presented in

Contact: ori@cs.huji.ac.il

Paper03, Presentation04



Nicholas Steffen, Scott Murphy, Lorenzo Tolleri, Wesley Hatfield, Richa


Motivation: Direct recognition, or direct readout, of DNA bases by a DNA
protein involves amino acids that interact directly with features specific to
each base. Experimental evidence also shows that in many cases the protein

achieves partial sequence specificity by indirect recognition, i.e., by
recognizing structural properties of the DNA. (1) Could threading a DNA
sequence onto a crystal structure of bound DNA help explain the indirect
recognition component of sequence spe
cificity? (2) Might the resulting pure
structure computational motif manifest itself in familiar sequence
computational motifs?

Results: The starting structure motif was a crystal structure of DNA bound to
the integration host factor protein (IHF) o
f {
it E.~coli}. IHF is known to
exhibit both direct and indirect recognition of its binding sites. (1)
Threading DNA sequences onto the crystal structure showed statistically
significant partial separation of 60 IHF binding sites from random and
nic sequences and was positively correlated with binding affinity. (2)
The crystal structure was shown to be equivalent to a linear Markov network, and
so, to a joint probability distribution over sequences, computable in linear
time. It was transformed
algorithmically into several common pure
representations, including (a) small sets of short exact strings, (b) weight
matrices, (c) consensus regular patterns, (d) multiple sequence alignments, and
(e) phylogenetic trees. In all cases the pure
quence motifs retained
statistically significant partial separation of the IHF binding sites from
random and intragenic sequences. Most exhibited positive correlation with
binding affinity. The multiple alignment showed some conserved columns, and the
logenetic tree partially mixed low
energy sequences with IHF binding sites
but separated high
energy sequences. The conclusion is that deformation energy
explains part of indirect recognition, which explains part of IHF sequence
specific binding.

lity: Code and data on request.

Contact: Nick Steffen for code and Lorenzo Tolleri for data. nsteffen@uci.edu ,

Paper04, Presentation05


Beyond tandem repeats: complex pattern structures and distant regions of


Amy M. Hauth, Deborah A. Joseph


Motivation: Tandem repeats (TRs) are associated with human disease, play a role
in evolution and are important in regulatory processes. Despite their
importance, locating and characterizing these patterns withi
n anonymous DNA
sequences remains a challenge. In part, the difficulty is due to imperfect
conservation of patterns and complex pattern structures. We study recognition
algorithms for two complex pattern structures: variable length tandem repeats

and multi
period tandem repeats (MPTRs).

Results: We extend previous algorithmic research to a class of regular tandem
repeats (RegTRs). We formally define RegTRs, as well as, two important
subclasses: VLTRs and MPTRs. We present algorithms for identifi
cation of TRs in
these classes. Furthermore, our algorithms identify degenerate VLTRs and MPTRs:
repeats containing substitutions, insertions and deletions. To illustrate our
work, we present results of our analysis for two difficult regions in cattle an
human which reflect practical occurrences of these subclasses in GenBank
sequence data. In addition, we show the applicability of our algorithmic
techniques for identifying Alu sequences, gene clusters and other distant
regions of similarity. We illust
rate this with an example from yeast chromosome

Availability: Algorithms can be accessed at http://www.cs.wisc.edu/areas/theory.

Contact: Amy M. Hauth kryder@cs.wisc.edu, 608
2164) or Deborah A. Joseph
joseph@cs.wisc.edu 608
8022), FAX: 608

Paper05, Presentation06


The POPPs: Clustering and Searching Using Peptide Probability Profiles


Michael J. Wise


The POPPs is a suite of inter
related software tools which allow the user to
discover what is statistically "
unusual" in the composition of an unknown
protein, or to automatically cluster proteins into families based on peptide
composition. Finally, the user can search for related proteins based on peptide
composition. Statistically based peptide composition pr
ovides a view of
proteins that is, to some extent, orthogonal to that provided by sequence. In a
test study, the POPP suite is able to regroup into their families sets of
approximately 100 randomised Pfam protein domains. The POPPs suite is used to
re the diverse set of late embryogenesis abundant (LEA) proteins.

Availability: Contact the author.

Contact: mw263@cam.ac.uk

Paper06, Presentation08


A sequence profile based HMM for predicting and discriminating beta barrel
membrane proteins


Pier Luigi Martelli, Piero Fariselli, Anders Krogh, Rita Casadio


Motivation: Membrane proteins are an abundant and functionally relevant subset
of proteins that putatively include from about 15 up to 30% of the proteome of
organisms fully
sequenced. These estimates are mainly computed on the basis of
sequence comparison and membrane protein prediction. It is therefore urgent to
develop methods capable of selecting membrane proteins especially in the case of
outer membrane proteins, barely t
aken into consideration when proteome wide
analysis is performed. This will also help protein annotation when no homologous
sequence is found in the database. Outer membrane proteins solved so far at
atomic resolution interact with the external membrane of

bacteria with a
characteristic _ barrel structure comprising different even numbers of _ strands
(_ barrel membrane proteins). In this they differ from the membrane proteins of
the cytoplasmic membrane endowed with alpha helix bundles (all alpha membrane
proteins) and need specialised predictors.

Results: We develop a HMM model, which can predict the topology of _ barrel
membrane proteins, using as input evolutionary information. The model is cyclic
with 6 types of states: two for the _ strand transmembran
e core, one for the _
strand cap on either side of the membrane, one for the inner loop, one for the
outer loop and one for the globular domain state in the middle of each loop. The
development of a specific input for HMM based on multiple sequence alignme
nt is
novel. The accuracy per residue of the model is 82% when a jack knife procedure
is adopted. With a model optimisation method using a dynamic programming
algorithm seven topological models out the twelve proteins included in the
testing set are also c
orrectly predicted. When used as a discriminator, the
model is rather selective. At a fixed probability value, it retains 84% of a
redundant set comprising 145 sequences of well
annotated outer membrane
proteins. Concomitantly, it correctly rejects 90%

of a set of globular proteins
including about 1200 chains with low sequence identity (< 30%) and 90% of a set
of all alpha membrane proteins, including 188 chains.

Availability: The program will be available on request from the authors.

Contact: gigi@lip
id.biocomp.unibo.it, www.biocomp.unibo.it

Paper07, Presentation09


Fully automated ab initio protein structure prediction using I


Christopher Bystroff, Yu Shao


Motivation: The Monte Carlo fragment in
sertion method for protein tertiary
structure prediction (ROSETTA) of Baker and others, has been merged with the I
SITES library of sequence structure motifs and the HMMSTR model for local
structure in proteins, to form a new public server for the ab initi
o prediction
of protein structure. The server performs several tasks in addition to tertiary
structure prediction, including a database search, amino acid profile generation,
fragment structure prediction, and backbone angle and secondary structure
ion. Meeting reasonable service goals required improvements in the
efficiency, in particular for the ROSETTA algorithm.

Results: The new server was used for blind predictions of 40 protein sequences
as part of the CASP4 blind structure prediction experim
ent. The results for 31
of those predictions are presented here. 61% of the residues overall were found
in topologically correct predictions, which are defined as fragments of 30
residues or more with a root
square deviation in superimposed alpha car
of less than 6Å. HMMSTR 3
state secondary structure predictions were 73% correct
overall. Tertiary structure predictions did not improve the accuracy of
secondary structure prediction.

Availability: The server is accessible through the web at
sites.bio.rpi.edu/hmmstr/index.html Programs are available upon
requests for academics. Licensing agreements are available for commercial

Supplementary information: http://isites.bio.rpi.edu,

Contacts: by
strc@rpi.edu, shaoy@rpi.edu

Paper08, Presentation10


Prediction of Contact Maps by GIOHMMs and Recurrent Neural Networks Using
Lateral Propagation From All Four Cardinal Corners


Gianluca Pollastri, Pierre Baldi


Motivation: Accu
rate prediction of protein contact maps is an important step in
computational structural proteomics. Because contact maps provide a translation
and rotation invariant topological representation of a protein, they can be used
as a fundamental intermediary s
tep in protein structure prediction.

Results: We develop a new set of flexible machine learning architectures for the
prediction of contact maps, as well as other information processing and pattern
recognition tasks. The architectures can be viewed as recu
rrent neural network
parameterizations of a class of Bayesian networks we call generalized input
output HMMs. For the specific case of contact maps, contextual information is
propagated laterally through four hidden planes, one for each cardinal corner.
e show that these architectures can be trained from examples and yield contact
map predictors that outperform previously reported methods. While several
extensions and improvements are in progress, the current version can accurately
predict 60.5% of contac
ts at a distance cutoff of 8Å and 45% of distant contacts
at 10Å, for proteins of length up to 300.

Availability and Contact: The contact map predictor will be made available

PRED/ as part of an existing suite of
roteomics predictors.

Email: {gpollast,pfbaldi}@ics.uci.edu

Paper09, Presentation11


Rate4Site: An Algorithmic Tool for the Identification of Functional Regions on
Proteins by Surface Mapping of Evolutionary Determinants within Their Homologues


Tal Pupko, Rachel Bell, Itay Mayrose, Fabian Glaser, Nir Ben


Motivation: A number of proteins of known three
dimensional (3D) structure exist,
with yet unknown function. In light of the recent progress in structure
determination metho
dology, this number is likely to increase rapidly. A novel
method is presented here: "Rate4Site", which maps the rate of evolution among
homologous proteins onto the molecular surface of one of the homologues whose
structure is known. Functionally impor
tant regions correspond to surface
patches of slowly evolving residues.

Results: Rate4Site estimates the rate of evolution of amino acid sites using the
maximum likelihood (ML) principle. The ML estimate of the rates considers the
topology and branch leng
ths of the phylogenetic tree, as well as the underlying
stochastic process. To demonstrate its potency, we study the Src SH2 domain.
Like previously established methods, Rate4Site detected the SH2 peptide
groove. Interestingly, it also detected in
domain interactions between the
SH2 domain and the rest of the Src protein that other methods failed to detect.

Availability: Rate4Site can be downloaded at: http://ashtoret.tau.ac.il/

Contact: tal@ism.ac.jp; rebell@ashtoret.tau.ac.il; fabian@ashtoret

Supplementary Information: multiple sequence alignment of homologous domains
from the SH2 protein family, the corresponding phylogenetic tree and additional
examples are available at http://ashtoret.tau.ac.il/~rebell

Paper10, Presentation12


Inferring sub
cellular localization through automated lexical analysis


Rajesh Nair, Burkhard Rost


Motivation: The SWISS
PROT sequence database contains keywords of functional
annotations for many protein
s. In contrast, information about the sub
localization is only available for few proteins. Experts can often infer
localization from keywords describing protein function. We developed LOCkey, a
fully automated method for lexical analysis of SWISS
PROT keywords that assigns
cellular localization. With the rapid growth in sequence data, the
biochemical characterisation of sequences has been falling behind. Our method
may be a useful tool for supplementing functional information already
lly available.

Results: The method reached a level of more than 82% accuracy in a full cross
validation test. Due to a lack of functional annotations, we could infer
localization for less than half of all proteins in SWISS
PROT. We applied LOCkey
to annot
ate five entirely sequenced proteomes, namely Saccharomyces cerevisiae
(yeast), Caenorhabditis elegans (worm), Drosophila melanogaster (fly),
Arabidopsis thaliana (plant) and a subset of all human proteins. LOCkey found
about 8000 new annotations of sub
llular localization for these eukaryotes.

Availability: Annotations of localization for eukaryotes at:

Contact: rost@columbia.edu

Paper11, Presentation14


Support vector regression applied to the d
etermination of the developmental age
of a Drosophila embryo from its segmentation gene expression patterns


Ekaterina Myasnikova, Anastassia Samsonova, John Reinitz, Maria Samsonova


Motivation: In this paper we address the problem of
the determination of
developmental age of an embryo from its segmentation gene expression patterns in

Results: By applying support vector regression we have developed a fast method
for automated staging of an embryo on the basis of its gene exp
ression pattern.
Support vector regression is a statistical method for creating regression
functions of arbitrary type from a set of training data. The training set is
composed of embryos for which the precise developmental age was determined by
the degree of membrane invagination. Testing the quality of
regression on the training set showed good prediction accuracy. The optimal
regression function was then used for the prediction of the gene expression
based age of embryos in which the precise
age has not been measured by membrane
morphology. Moreover, we show that the same accuracy of prediction can be
achieved when the dimensionality of the feature vector was reduced by applying
factor analysis. The data reduction allowed us to avoid over
tting and to
increase the efficiency of the algorithm.

Availability: This software may be obtained from the authors.

Contact: samson@fn.csa.ru

Paper12, Presentation15


Variance stabilization applied to microarray data calibration and to the
ication of differential expression


Wolfgang Huber, Anja von Heydebreck, Holger Sueltmann, Annemarie Poustka, Martin


We introduce a statistical model for microarray gene expression data that
comprises data calibration, the quant
ification of differential expression, and
the quantification of measurement error. In particular, we derive a
transformation $h$ for intensity measurements, and a difference statistic *h
whose variance is approximately constant along the whole intensity ra
nge. This
forms a basis for statistical inference from microarray data, and provides a
rational data pre
processing strategy for multivariate analyses. For the
transformation $h$, the parametric form h(x) = arsinh(a+bx) is derived from a
model of the var
mean dependence for microarray intensity data,
using the method of variance stabilizing transformations. For large intensities,
h coincides with the logarithmic transformation, and *h with the log
ratio. The
parameters of h together with thos
e of the calibration between experiments are
estimated with a robust variant of maximum
likelihood estimation. We
demonstrate our approach on data sets from different experimental platforms,
including two
color cDNA arrays and a series of Affymetrix oligo

Availability: Software is freely available for academic use as an R package at

Contact: w.huber@dkfz.de

Paper13, Presentation16


Binary Tree
Structured Vector Quantization Approach to Clustering
Visualizing Microarray Data


M. Sultan, D. Wigle, CA Cumbaa, M. Maziarz, Janice Glasgow, M.
S. Tsao and Igor


Motivation: With the increasing number of gene expression databases, the need
for more powerful analysis and visua
lization tools is growing. Many techniques
have successfully been applied to unravel latent similarities among genes and/or
experiments. Most of the current systems for microarray data analysis use
statistical methods, hierarchical clustering, self
izing maps, support
vector machines, or k
means clustering to organize genes or experiments into
"meaningful" groups. Without prior explicit bias almost all of these clustering
methods applied to gene expression data not only produce different results, bu
may also produce clusters with little or no biological relevance. Of these
methods, agglomerative hierarchical clustering has been the most widely applied,
although many limitations have been identified.

Results: Starting with a systematic comparison of
the underlying theories behind
clustering approaches, we have devised a technique that combines tree
vector quantization and partitive k
means clustering (BTSVQ). This hybrid
technique has revealed clinically relevant clusters in three large pub
available data sets. In contrast to existing systems, our approach is less
sensitive to data preprocessing and data normalization. In addition, the
clustering results produced by the technique have strong similarities to those
of self
organizing map
s (SOMs). We discuss the advantages and the mathematical
reasoning behind our approach.

Availability: The BTSVQ system is implemented in Matlab R12 using the SOM
toolbox for the visualization and preprocessing of the
mtoolbox/}. BTSVQ is available for non
commercial use(http://www.uhnres.utoronto.ca/ta3/BTSVQ}.

Contact: ij@uhnres.utoronto.ca

Paper14, Presentation17


A Variance
Stabilizing Transformation for Gene
Expression Microarray Data


Blythe Durbi
n, Johanna Hardin, Douglas Hawkins, David Rocke


Motivation: Standard statistical techniques often assume that data are normally
distributed, with constant variance not depending on the mean of the data. Data
that violate these assumptions can
often be brought in line with the assumptions
by application of a transformation. Gene
expression microarray data have a
complicated error structure, with a variance that changes with the mean in a
linear fashion. Log transformations, which are often

applied to microarray
data, can inflate the variance of observations near background.

Results: We introduce a transformation that stabilizes the variance of
microarray data across the full range of expression. Simulation studies also
suggest that this tr
ansformation approximately symmetrizes microarray data.

Contact: bpdurbin@wald.ucdavis.edu

Paper15, Presentation18


Linking Gene Expression Data with Patient Survival Times Using Partial Least


Peter Park, Lu Tian, Isaac Kohane


There is an increasing need to link the large amount of genotypic data, gathered
using microarrays for example, with various phenotypic data from patients. The
classification problem in which gene expression data serve as predictors and a
class la
bel phenotype as the binary outcome variable has been examined
extensively, but there has been less emphasis in dealing with other types of
phenotypic data. In particular, patient survival times with censoring are often
not used directly as a response var
iable due to the complications that arise
from censoring.

We show that the issues involving censored data can be circumvented by
reformulating the problem as a standard Poisson regression problem. The
procedure for solving the transformed problem is a com
bination of two approaches:
partial least squares, a regression technique that is especially effective when
there is severe collinearity due to a large number of predictors, and
generalized linear regression, which extends standard linear regression to dea
with various types of response variables. The linear combinations of the
original variables identified by the method are highly correlated with the
patient survival times and at the same time account for the variability in the
covariates. The algorithm

is fast, as it does not involve any matrix
decompositions in the iterations. We apply our method to data sets from lung
carcinoma and diffuse large B
cell lymphoma studies to verify its effectiveness.

Contact: peter_park@harvard.edu

Paper16, Presentati


Microarray Synthesis through Multiple
Use PCR Primer Design


Rohan Fernandes, Steven Skiena


A substantial percentage of the expense in constructing full
genome spotted
microarrays comes from the cost of synthesizing the PCR

primers to amplify the
desired DNA. We propose a computationally
based method to substantially reduce
this cost. Historically, PCR primers are designed so that each primer occurs
uniquely in the genome. This condition is unnecessarily strong for select
amplification, since only the primer pair associated with each amplification
need be unique. We demonstrate that careful design in a genome
amplification project permits us to save the cost of several thousand primers
over conventional approache

Contact: {skiena, rohan}@cs.sunysb.edu

Paper17, Presentation21


Discovering Statistically Significant Biclusters in Gene Expression Data


Amos Tanay, Roded Sharan, Ron Shamir


In gene expression data, a bicluster is a subset o
f the genes exhibiting
consistent patterns over a subset of the conditions. We propose a new method to
detect significant biclusters in large expression datasets. Our approach is
graph theoretic coupled with statistical modeling of the data. Under plausib
assumptions, our algorithm is polynomial and is guaranteed to find the most
significant biclusters. We tested our method on a collection of yeast
expression profiles and on a human cancer dataset. Cross validation results
show high specificity in assi
gning function to genes based on their biclusters,
and we are able to annotate in this way 196 uncharacterized yeast genes. We also
demonstrate how the biclusters lead to detecting new concrete biological
associations. In cancer data we are able to detect

and relate finer tissue
types than was previously possible. We also show that the method outperforms
the biclustering algorithm of Cheng and Church (2000).

Contact: {amos,roded,rshamir}@tau.ac.il.

Availability: www.cs.tau.ac.il/~rshamir/biclust.html.

per18, Presentation22


Clustering of Biological Networks and Gene Expression Data


Daniel Hanisch, Alexander Zien, Ralf Zimmer, Thomas Lengauer


Motivation: Large scale gene expression data are often analyzed by clustering
s based on gene expression data alone, though a
priori knowledge in the form
of biological networks is available. The use of this additional information
promises to improve exploratory analysis considerably.

Results: We propose to construct a distance func
tion which combines information
from expression data and biological networks. Based on this function, we compute
a joint clustering of genes and vertices of the network. This general approach
is elaborated for metabolic networks. We define a graph distance

function on
such networks and combine it with a correlation
based distance function for gene
expression measurements. A hierarchical clustering and an associated statistical
measure is computed to arrive at a reasonable number of clusters. Our method is
alidated using expression data of the yeast diauxic shift. The resulting
clusters are easily interpretable in terms of the biochemical network and the
gene expression data and suggest that our method is able to automatically
identify processes that are rel
evant under the measured conditions.

Contact: Daniel.Hanisch@scai.fhg.de

Paper19, Presentation23


Statistical process control for large scale microarray experiments


Fabian Model, Thomas Koenig, Christian Piepenbrock, Peter Adorjan


Motivation: Maintaining and controlling data quality is a key problem in large
scale microarray studies. Especially systematic changes in experimental
conditions across multiple chips can seriously affect quality and even lead to
false biological conc
lusions. Traditionally the influence of these effects can
only be minimized by expensive repeated measurements, because a detailed
understanding of all process relevant parameters seems impossible.

Results: We introduce a novel method for microarray proce
ss control that
estimates quality solely based on the distribution of the actual measurements
without requiring repeated experiments. A robust version of principle component
analysis detects single outlier microarrays and only thereby enables the use of
echniques from multivariate statistical process control. In particular, the
T^2 control chart reliably tracks undesired changes in process relevant
parameters. This can be used to improve the microarray process itself, limits
necessary repetitions only t
o affected samples and therefore maintains quality
in a cost effective way. We prove the power of the approach on 3 large sets of
DNA methylation microarray data.

Contact: Fabian.Model@epigenomics.com

Paper20, Presentation24


Evaluating Machine Le
arning Approaches for Aiding Probe Selection for Gene
Expression Arrays


John Tobler, Mike Molla, Jude Shavlik


Motivation: Microarrays are a fast and cost
effective method of performing
thousands of DNA hybridization experiments simult
aneously. DNA probes are
typically used to measure the expression level of specific genes. Because
probes greatly vary in the quality of their hybridizations, choosing good probes
is a difficult task. If one could accurately choose probes that are li
kely to
hybridize well, then fewer probes would be needed to represent each gene in a
expression microarray, and, hence, more genes could be placed on an array
of a given physical size. Our goal is to empirically evaluate how successfully
three stand
ard machine
learning algorithms

naïve Bayes, decision trees, and
artificial neural networks

can be applied to the task of predicting good
probes. Fortunately it is relatively easy to get training examples for such a
learning task: place various probes

on a gene chip, add a sample where the
corresponding genes are highly expressed, and then record how well each probe
measures the presence of its corresponding gene. With such training examples,
it is possible that an accurate predictor of probe quality
can be learned.

Results: Two of the learning algorithms we investigate

naïve Bayes and neural

learn to predict probe quality surprisingly well. For example, in
the top ten predicted probes for a given gene not used for training, on average
about five rank in the top 2.5% of that gene's hundreds of possible probes.
tree induction and the simple approach of using predicted melting
temperature to rank probes perform significantly worse than these two algorithms.
The features we use t
o represent probes are very easily computed and the time
taken to score each candidate probe after training is minor. Training the naïve
Bayes algorithm takes very little time, and while it takes over 10 times as long
to train a neural network, that time
is still not very substantial (on the order
of a few hours on a desktop workstation). We also report the information
contained in the features we use to describe the probes. We find the fraction
of cytosine in the probe to be the most informative feature.

We also find, not
surprisingly, that the nucleotides in the middle of the probes sequence are more
informative than those at the ends of the sequence.

Contact: molla@cs.wisc.edu

Paper21, Presentation26


The Degenerate Primer Design Problem


Chaim Linhart, Ron Shamir


A PCR primer sequence is called degenerate if some of its positions have several
possible bases. The degeneracy of the primer is the number of unique sequence
combinations it contains. We study the problem of de
signing a pair of primers
with prescribed degeneracy that match a maximum number of given input sequences.
Such problems occur when studying a family of genes that is known only in part,
or is known in a related species. We prove that various simplified v
ersions of
the problem are hard, show the polynomiality of some restricted cases, and
develop approximation algorithms for one variant. Based on these algorithms, we
implemented a program called HYDEN for designing highly
degenerate primers for a
set of g
enomic sequences. We report on the success of the program in an
experimental scheme for identifying all human olfactory receptor (OR) genes. In
that project, HYDEN was used to design primers with degeneracies up to 10^10
that amplified with high specific
ity many novel genes of that family, tripling
the number of OR genes known at the time.

Availability: Available on request from the authors

Contact: {chaiml,rshamir}@tau.ac.il

Paper22, Presentation27


Splicing Graphs and EST Assembly Problem


Steffen Heber, Max Alekseyev, Sing
Hoi Sze, Haixu Tang, Pavel Pevzner


Motivation: The traditional approach to annotate alternative splicing is to
investigate every splicing variant of the gene in a case
case fashion. This
approach, while
useful, has some serious shortcomings. Recent studies indicate
that alternative splicing is more frequent than previously thought and some
genes may produce tens of thousands of different transcripts. A list of
alternatively spliced variants for such gen
es would be difficult to build and
hard to analyze. Moreover, such a list does not show the relationships between
different transcripts and does not show the overall structure of all transcripts.
A better approach would be to represent all splicing variant
s for a given gene
in a way that captures the relationships between different splicing variants.

Results: We introduce the notion of the splicing graph that is a natural and
convenient representation of all splicing variants. The key difference with the
xisting approaches is that we abandon the linear (sequence) representation of
each transcript and substitute it with a graph representation where each
transcript corresponds to a path in the graph. We further design an algorithm to
assemble EST reads into
the splicing graph rather than assembling them into each
splicing variant in a case
case fashion.

Availability: http://www

Contact: sheber@ucsd.edu

Paper23, Presentation28


Exact Genetic Linkage
Computations for General Pedigrees


Ma'ayan Fishelson, Dan Geiger


Motivation: Genetic linkage analysis is a useful statistical tool for mapping
disease genes and for associating functionality of genes to their location on
the chromosom
e. There is a need for a program that computes multipoint
likelihood on general pedigrees with many markers that also deals with two
disease models.

Results: In this paper we present algorithms for performing exact multipoint
likelihood calculations

on general pedigrees with a large number of highly
polymorphic markers, taking into account a variety of disease models. We have
implemented these algorithms in a new computer program called SUPERLINK which
outperforms leading linkage software with regar
ds to functionality, speed,
memory requirements and extensibility.

Availability: SUPERLINK is available at

Contact: {fmaayan, dang}@cs.technion.ac.il}

Paper24, Presentation29


Assigning Probes into A Small

Number of Pools Separable by Electrophoresis


Teemu Kivioja, Mikko Arvas, Kari Kataja, Merja Penttil‰, Hans Sˆderlund, Esko


Motivation: Measuring transcriptional expression levels (transcriptional
profiling) has become one of t
he most important methods in functional genomics.
Still, new measuring methods are needed to obtain more reliable, quantitative
data about transcription on a genomic scale. In this paper we concentrate on
certain computational optimization problems arisi
ng in the design of one such
novel method. From a computational point of view the key feature of the new
method is that the hybridized probes are distinguished from each other based on
their different size. Therefore the probes have to be assigned into po
ols such
that the probes in the same pool have unique sizes different enough from each
other. Identification of expressed RNA is given by probe pool and probe size
while quantification is given by the label of the probe, e.g. fluorescence

s: We show how to computationally find the probes and assign them into
pools for a whole genome such that i) each gene has a specific probe suitable
for amplification and hybridization, and ii) the expression level measurement
can be done in a minimal numb
er of pools separable by electrophoresis in order
to minimize the total experiment cost of the measurement. Our main result is a
time approximation algorithm for assigning the probes into pools. We
demonstrate the feasibility of the procedure by

selecting probes for the yeast
genome and assigning them into less than 100 pools. The probe sequences and
their assignment into pools are available for academic research on request from
the authors.

Contact; Teemu.Kivioja@cs.Helsinki.FI

Paper25, Present


Representing Genetic Sequence Data for Pharmacogenomics: An Evolutionary
APproach Using Ontological and Relational Models


Daniel Rubin, Farhad Shafa, Diane Oliver, Micheal Hewett, Teri Klein, Russ


Motivation: Th
e information model chosen to store biological data affects the
types of queries possible, database performance, and difficulty in updating that
information model. Genetic sequence data for pharmacogenetics studies can be
complex, and the best information

model to use may change over time. As
experimental and analytical methods change, and as biological knowledge advances,
the data storage requirements and types of queries needed may also change.

Results: We developed a model for genetic sequence and po
lymorphism data, and
used XML schema to specify the elements and attributes required for this model.
We implemented this model as an ontology in a frame
based representation and as
a relational model in a database system. We collected genetic data from t
pharmacogenetics resequencing studies, and formulated queries useful for
analyzing these data. We compared the ontology and relational models in terms
of query complexity, performance, and difficulty in changing the information
model. Our results demo
nstrate benefits of evolving the schema for storing
pharmacogenetics data: ontologies perform well in early design stages as the
information model changes rapidly and simplify query formulation, while
relational models offer improved query speed once the
information model and
types of queries needed stabilize.

Availability: Our ontology and relational models are available at

Contact: rubin@smi.stanford.edu, russ.altman@stanford.edu, help@pharmgkb.org

Paper26, Presen


Evaluating Functional Network Inference Using Simulations of Complex Biological


V. Anne Smith, Erich D. Jarvis, Alexander J. Hartemink


Motivation: Although many network inference algorithms have been presented
the bioinformatics literature, no suitable approach has been formulated for
evaluating their effectiveness at recovering models of complex biological
systems from limited data. To overcome this limitation, we propose an approach
to evaluate network inf
erence algorithms according to their ability to recover a
complex functional network from biologically reasonable simulated data.

Results: We designed a simulator to generate data from a complex biological
system at multiple levels of organization: behavio
r, neural anatomy, brain
electrophysiology, and gene expression of songbirds. About 90% of the simulated
variables are unregulated by other variables in the system and are included
simply as distracters. We sampled the simulated data at intervals as one
sample from a biological system in practice, and then used the sampled data to
evaluate the effectiveness of an algorithm we developed for functional network
inference. We found that our algorithm is highly effective at recovering the
functional net
work structure of the simulated system
including the irrelevance
of unregulated variables
from sampled data alone. To assess the
reproducibility of these results, we tested our inference algorithm on
50separately simulated sets of data and it consiste
ntly recovered almost
perfectly the complex functional network structure underlying the simulated data.
To our knowledge, this is the first approach for evaluating the effectiveness of
functional network inference algorithms at recovering models from limi
ted data.
Our simulation approach also enables researchers a priori to design experiments
and data
collection protocols that are amenable to functional network inference.

Availability: Source code and simulated data are available upon request.

Contact: am
ink@cs.duke.edu, asmith@neuro.duke.edu, jarvis@neuro.duke.edu

Paper27, Presentation33


The Pathway Tools Software


Peter Karp, Suzanne Paley, Pedro Romero


Motivation: Bioinformatics requires reusable software tools for creating

organism databases (MODs).

Results: The Pathway Tools is a reusable, production
quality software
environment for creating a type of MOD called a Pathway/Genome Database (PGDB).
A PGDB such as EcoCyc (see ecocyc.org) integrates our evolving understa
nding of
the genes, proteins, metabolic network, and genetic network of an organism.
This paper provides an overview of the four main components of the Pathway Tools:
The PathoLogic component supports creation of new PGDBs from the annotated
genome of an
organism. The Pathway/Genome Navigator provides query,
visualization, and Web
publishing services for PGDBs. The Pathway/Genome
Editors support interactive updating of PGDBs. The Pathway Tools ontology
defines the schema of PGDBs. The Pathway Tools mak
es use of the Ocelot object
database system for data management services for PGDBs. The Pathway Tools has
been used to build PGDBs for 13 organisms within SRI and by external users.

Availability: The software is freely available to academics and is availa
ble for
a fee to commercial institutions. Contact ptools
support@ai.sri.com for
information on obtaining the software.

Contact: pkarp@ai.sri.com

Paper28, Presentation34


Discovering regulatory and signaling circuits in molecular interaction networ


Trey Ideker, Owen Ozier, Benno Schwikowski, Andrew Siegel


Motivation: In model organisms such as yeast, large databases of protein
and protein
DNA interactions have become an extremely important resource for the
study of pro
tein function, evolution, and gene regulatory dynamics. In this
paper we demonstrate that by integrating these interactions with widely
available mRNA expression data, it is possible to generate concrete hypotheses
for the underlying mechanisms governing t
he observed changes in gene expression.
To perform this integration systematically and at large scale, we introduce an
approach for screening a molecular interaction network to identify active
subnetworks, i.e., connected regions of the network that show s
changes in expression over particular subsets of conditions. The method we
present here combines a rigorous statistical measure for scoring subnetworks
with a search algorithm for identifying subnetworks with high score.

Results: We evaluated ou
r procedure on a small interaction network of 332 genes
and a large network of 4160 genes containing all 7462 protein
protein and
DNA interactions in the yeast public databases. In the case of the small
network, we identified five significant subne
tworks that covered 41 out of 77
(53%) of all significant changes in expression. Both network analyses returned
several top
scoring subnetworks with good correspondence to known regulatory
mechanisms in the literature. These results demonstrate how large
cale genomic
approaches may be used to uncover signaling and regulatory pathways in a
systematic, integrative fashion.

Availability: The methods presented in this paper are implemented in the
Cytoscape software package which is available to the academic co
mmunity at

Contact: trey@wi.mit.edu

Paper29, Presentation35


Modelling Pathways in E.coli from Time Series Expression Profiles


Irene Ong, David Page, Jeremy Glasner


Motivation: Cells continuously reprogram t
heir gene expression network as they
move through the cell cycle or sense changes in their environment. In order to
understand the regulation of cells, time series expression profiles provide a
more complete picture than single time point expression profi
les. Few analysis
techniques, however, are well suited to modeling such time series data.

Results: We describe an approach that naturally handles time series data with
the capabilities of modeling causality, feedback loops, and environmental or
hidden va
riables using a Dynamic Bayesian network. We also present a novel way
of combining prior biological knowledge and current observations to improve the
quality of analysis and to model interactions between sets of genes rather than
individual genes. Our ap
proach is evaluated on time series expression data
measured in response to physiological changes that affect tryptophan metabolism
in E. coli. Results indicate that this approach is capable of finding
correlations between sets of related genes.

Contact: o

Paper30, Presentation36


Of Truth and Pathways: Chasing Bits of Information through Myriads of Articles


Michael Krauthammer, Carol Friedman, George Hripcsak, Ivan Iossifov, Andrey


Knowledge on interacti
ons between molecules in living cells is indispensable for
theoretical analysis and practical applications in modern genomics and molecular
biology. Building such networks relies on the assumption that the correct
molecular interactions are known or can be

identified by reading a few research
articles. However, this assumption does not necessarily hold, as truth is rather
an emerging property based on many potentially conflicting facts. This paper
explores the processes of knowledge generation and publishin
g in the molecular
biology literature using modeling and analysis of real molecular interaction
data. The data analyzed in this article was automatically extracted from 50,000
research articles in molecular biology using a computer system called GeneWays
containing a natural language processing module. The paper indicates that
truthfulness of statements is associated in the minds of scientists with the
relative importance (connectedness) of substances under study, revealing a
potential selection bias in th
e reporting of research results. Aiming at
understanding the statistical properties of the life cycle of biological facts
reported in research articles, we formulate a stochastic model describing
generation and propagation of knowledge about molecular inte
ractions through
scientific publications. We hope that in the future such a model can be useful
for automatically producing consensus views of molecular interaction data.

Contact: ar345@columbia.edu

Paper31, Presentation37


A Fast and Robust Metho
d to Infer and Characterize an Active Regulator Set for
Molecular Pathways


Dana Pe'er, Aviv Regev, Amos Tanay


Regulatory relations between genes are an important component of molecular
pathways. Here, we devise a novel global method t
hat uses a set of gene
expression profiles to find a small set of relevant active regulators,
identifies the genes that they regulate, and automatically annotates them. We
show that our algorithm is capable of handling a large number of genes in a
short t
ime and is robust to a wide range of parameters. We apply our method to a
combined dataset of S. cerevisiae expression profiles, and validate the
resulting model of regulation by cross
validation and extensive biological
analysis of the selected regulators

and their derived annotations.

Paper32, Presentation39


Marginalized Kernels for Biological Sequences


Koji Tsuda, Taishin Kin, Kiyoshi Asai


Motivation: Kernel methods such as support vector machines require a kernel
between objects to be defined a priori. Several works have been done
to derive kernels from probability distributions, e.g. the Fisher kernel.
However, a general methodology to design a kernel is not fully developed.

Results: We propose a reasonable way
of designing a kernel when objects are
generated from latent variable models (e.g. HMM).First of all, a joint kernel is
designed for complete data which include both visible and hidden variables.
Then a marginalized kernel for visible data is obtained by
taking the
expectation with respect to hidden variables. We will show that the Fisher
kernel is a special case of marginalized kernels, which gives another viewpoint
to the Fisher kernel theory. Although our approach can be applied to any object,
we part
icularly derive several marginalized kernels useful for biological
sequences (e.g. DNA and proteins).The effectiveness of marginalized kernels is
illustrated in the task of classifying bacterial gyrase subunit B (gyrB} amino
acid sequences.

Contact]: koji.

Paper33, Presentation40


A tree kernel to analyze phylogenetic profiles


Philippe Vert


Motivation: The phylogenetic profile of a protein is a string that encodes the
presence or absence of the protein in ev
ery fully sequenced genome. Because
proteins that participate in a common structural complex or metabolic pathway
are likely to evolve in a correlated fashion, the phylogenetic profiles of such
proteins are often "similar" or at least "related" to each ot
her. The question
we address in this paper is the following: how to measure the "similarity"
between two profiles, in an evolutionarily relevant way, in order to develop
efficient function prediction methods?

Results: We show how the profiles can be mappe
d to a high
dimensional vector
space which incorporates evolutionarily relevant information, and we provide an
algorithm to compute efficiently the inner product in that space, which we call
the tree kernel. The tree kernel can be used by any kernel
analysis method
for classification or data mining of phylogenetic profiles. As an application a
Support Vector Machine (SVM) trained to predict the functional class of a gene
from its phylogenetic profile is shown to perform better with the tree kernel
an with a naive kernel that does not include any information about the
phylogenetic relationships among species. Moreover a kernel principal component
analysis (KPCA) of the phylogenetic profiles illustrates the sensitivity of the
tree kernel to evolution
arily relevant variations.

Availability: All data and software used are freely and publicly available upon

Contact: Jean

Paper34, Presentation41


Statistically Based Postprocessing of Phylogenetic Analysis by Cluste


Cara Stockham, Li
San Wang, Tandy Warnow


Motivation: Phylogenetic analyses often produce thousands of candidate trees.
Biologists resolve the conflict by computing the consensus of these trees.
tree consensus as postproce
ssing methods can be unsatisfactory due to
their inherent limitations.

Results: In this paper we present an alternative approach by using clustering
algorithms on the set of candidate trees. We propose bicriterion problems, in
particular using the concept

of information loss, and new consensus trees called
characteristic trees that minimize the information loss. Our empirical study
using four biological datasets shows that our approach provides a significant
improvement in the information content, while a
dding only a small amount of
complexity. Furthermore, the consensus trees we obtain for each of our large
clusters are more resolved than the single
tree consensus trees. We also
provide some initial progress on theoretical questions that arise in this

Availability: Software available upon request from the authors. The
agglomerative clustering is implemented using Matlab (MathWorks, 2000) with the
Statistics Toolbox. The Robinson
Foulds distance matrices and the strict
consensus trees are compu
ted using PAUP (Swofford, 2001) and the Daniel Huson's
tree library on Intel Pentium workstations running Debian Linux.

Contact: lisan@cs.utexas.edu,telephone: +1 (512) 232
7455, fax: +1 (512) 471

Supplementary Information: http://www.cs.utexas.edu/u

Paper35, Presentation42


Efficiently Detecting Polymorphisms During The Fragment Assembly Process


Daniel Fasulo, Aaron Halpern, Ian Dew, Clark Mobarry


Motivation: Current genomic sequence assemblers assume th
at the input data is
derived from a single, homogeneous source. However, recent whole
genome shotgun
sequencing projects have violated this assumption, resulting in input fragments
covering the same region of the genome whose sequences differ due to polym
variation in the population. While single
nucleotide polymorphisms (SNPs) do
not pose a significant problem to state
art assembly methods, these
methods do not handle insertion/deletion (indel) polymorphisms of more than a
few bases.

: This paper describes an efficient method for detecting sequence
discrepencies due to polymorphism that avoids resorting to global use of more
costly, less stringent affine sequence alignments. Instead, the algorithm uses
based methods to determine

the small set of fragments involved in each
polymorphism and performs more sophisticated alignments only among fragments in
that set. Results from the incorporation of this method into the Celera
Assembler are reported for the D. melanogaster, H. sapiens
, and M. musculus

Availability; The method described herein does not constitute a stand
software application, but is laid out in sufficient detail to be implemented as
a component of any genomic sequence assembler.

Contact: daniel.fasulo@ce

Paper36, Presentation43


Multiple Genome Rearrangement: A General Approach via the Evolutionary Genome


Dmitry Korkin, Lev Goldfarb


Motivation: In spite of a well
known fact that genome rearrangements are

to be viewed in the light of the evolutionary relationships within and
between the species involved, no formal underlying framework based on the
evolutionary considerations for treating the questions arising in the area has
been proposed. If such an under
lying framework is provided, all the basic
questions in the area can be posed in a biologically more appropriate and useful
form: e.g., the similarity between two genomes can then be computed via the
nearest ancestor, rather than "directly", ignoring the e
volutionary connections.

Results: We outline an evolution
based general framework for answering questions
related to the multiple genome rearrangement. In the proposed model, the
evolutionary genome graph (EG
graph) encapsulates an evolutionary history of
genome family. For a set of all EG
graphs, we introduce a family of similarity
measures each defined via a set of genome transformations associated with a
particular EG
graph. Given a set of genomes and restricting ourselves to the
transpositions, an alg
orithm for constructing an EG
graph is presented. We also
present the experimental results in the form of an EG
graph for a set of
concrete genomes (for several species). This EG
graph turns out to be very close
to the corresponding known phylogenetic tree

Contact: dkorkin@unb.ca

Paper37, Presentation44


Efficient multiple genome alignment


Michael Hoehl, Stefan Kurtz, Enno Ohlebusch


Motivation: To allow a direct comparison of the genomic DNA sequences of
sufficiently similar
organisms, there is an urgent need for software tools that
can align more than two genomic sequences.

Results: We developed new algorithms and a software tool "Multiple Genome
Aligner" (MGA for short) that efficiently computes multiple genome alignments of

large, closely related DNA sequences. For example, it can align 85% percent of
the complete genomes of six human adenoviruses (average length 35,305 bp.) in
159 seconds. An alignment of 74% of the complete genomes of three strains of E.
coli (lengths: 5
,528,445; 5,498,450; 4,639,221 bp.) is produced in 30 minutes.

Availability: The software MGA is available free of charge for non
research institutions. For details see http://bibiserv.techfak.uni

Contact]: {kurtz, enno}@techfa

Paper38, Presentation46


PSEUDOVIEWER: Automatic Visualization of RNA Pseudoknots


Kyungsook Han, Yujin Lee, Wootaek Kim


Motivation: Several algorithms have been developed for drawing RNA secondary

however none of these can be used to draw RNA pseudoknot structures.
In the sense of graph theory, a drawing of RNA secondary structures is a tree,
whereas a drawing of RNA pseudoknots is a graph with inner cycles within a
pseudoknot as well as possible o
uter cycles formed between a pseudoknot and
other structural elements. Thus, RNA pseudoknots are more difficult to visualize
than RNA secondary structures. Since no automatic method for drawing RNA
pseudoknots exists, visualizing RNA pseudoknots relies on
significant amount of
manual work and does not yield satisfactory results. The task of visualizing RNA
pseudoknots by hand becomes more challenging as the size and complexity of the
RNA pseudoknots increase.

Results: We have developed a new representation
and an algorithm for drawing H
type pseudoknots with RNA secondary structures. Compared to existing
representations of H
type pseudoknots, the new representation ensures uniform
and clear drawings with no edge crossing for all H
type pseudoknots. To the be
of our knowledge, this is the first algorithm for automatically drawing RNA
pseudoknots with RNA secondary structures. The algorithm has been implemented in
a Java program, which can be executed on any computing system. Experimental
results demonstrate
that the algorithm generates an aesthetically pleasing
drawing of all H
type pseudoknots. The results have also shown that the drawing
has high readability, enabling the user to quickly and easily recognize the
whole RNA structure as well as the pseudoknot
s themselves.

Availability: available on request from the corresponding author.

Contact: khan@inha.ac.kr

Paper39, Presentation47


A powerful non
homology method for the prediction of operons in prokaryotes


Gabriel Moreno
Hagelsieb, Juli
o Collado


Motivation: The prediction of the transcription unit organization of genomes is
an important clue in the inference of functional relationships of genes, the
interpretation and evaluation of transcriptome experiments, and the over
inference of the regulatory networks governing the expression of genes in
response to the environment. Though several methods have been devised to predict
operons, most need a high characterization of the genome analyzed. Log
likelihoods derived from i
genic distance distributions work surprisingly
well to predict operons in Escherichia coli and are available for any genome as
soon as the gene sets are predicted.

Results: Here we provide evidence that the very same method is applicable to any
ryotic genome. First, the method has the same efficiency when evaluated
using a collection of experimentally known operons of Bacillus subtilis. Second,
operons among most if not all prokaryotes seem to have the same tendencies to
keep short distances betw
een their genes, the most frequent distances being the
overlaps of four and one base pairs. The universality of this structural feature
allows us to predict the organization of transcription units in all prokaryotes.
Third, predicted operons contain a high
er proportion of genes with related
phylogenetic profiles and conservation of adjacency than predicted borders of
transcription units.

Contact: moreno@cifn.unam.mx

Supplementary information: Additional materials and graphs, are available at:

Paper40, Presentation48


Identifying Operons and Untranslated Regions of Transcripts Using Escherichia
coli RNA Expression Analysis


Brian Tjaden, David Haynor, Sergey Stolyar, Carsten Rosenow, Eugene



Microarrays traditionally have been used to assay the transcript expression of
coding regions of genes. Here, we use Escherichia coli oligonucleotide
microarrays to assay transcript expression of both open reading frames (ORFs)
and inter
genic regions. We then use hidden Markov models to analyze this
expression data and estimate transcription boundaries of genes. This approach
allows us to identify 5( untranslated regions (UTRs) of transcripts as well as
genes which are likely to be operon

members. The operon elements we identify
correspond to documented operons with 99% specificity and 63% sensitivity.
Similarly we find that our 5( UTR results accurately coincide with
experimentally verified promoter regions for most genes.

Contact: tjaden

Paper41, Presentation49


Detecting Recombination with MCMC


Dirk Husmeier, Grainne McGuire


Motivation: We present a statistical method for detecting recombination, whose
objective is to accurately locate the

recombinant breakpoints in DNA sequence
alignments of small numbers of taxa (4 or 5). Our approach explicitly models
the sequence of phylogenetic tree topologies along a multiple sequence alignment.
Inference under this model is done in a Bayesian way, u
sing Markov chain Monte
Carlo (MCMC). The algorithm returns the site
dependent posterior probability of
each tree topology, which is used for detecting recombinant regions and locating
their breakpoints.

Results: The method was tested on a synthetic and t
hree real DNA sequence
alignments, where it was found to outperform the established detection methods

Availability: The algorithm has been implemented in the C++ program package
BARCE, which is freely available from http://www.bi

Contact: dirk@bioss.ac.uk

Paper42, Presentation50


Finding Composite Regulatory Patterns in DNA Sequences


Eleazar Eskin, Pavel Pevzner


Motivation: Pattern discovery in unaligned DNA sequences is a
problem in computational biology with important applications in finding
regulatory signals. Current approaches to pattern discovery focus on monad
patterns that correspond to relatively short contiguous strings. However, many
of the actual r
egulatory signals are composite patterns that are groups of monad
patterns that occur near each other. A difficulty in discovering composite
patterns is that one or both of the component monad patterns in the group may be
"`too weak". Since the tradition
al monad
based motif finding algorithms usually
output one (or a few) high scoring patterns, they often fail to find composite
regulatory signals consisting of weak monad parts. In this paper, we present a
MITRA (MIsmatch TRee Algorithm)} approach for di
scovering composite signals. We
demonstrate that MITRA performs well for both monad and composite patterns by
presenting experiments over biological and synthetic data.

Availability: MITRA is available at http://www.cs.columbia.edu/compbio/mitra/