TITLES, AUTHORS, ABSTRACTS FOR ISMB02 Paper01, Presentation02 Title: Mining viral protease data to extract cleavage knowledge Authors:

fancyfantasicAI and Robotics

Nov 7, 2013 (3 years and 8 months ago)

90 views

TITLES, AUTHORS, ABSTRACTS FOR ISMB02


Paper01, Presentation02


Title:

Mining viral protease data to extract cleavage knowledge


Authors:

Ajit Narayanan, Xikun Wu and Z. Rong Yang


Abstract:

Motivation: The motivation is to identify, through machine learni
ng techniques,
specific patterns in HIV and HCV viral polyprotein amino acid residues where
viral protease cleaves the polyprotein as it leaves the ribosome. An
understanding of viral protease specificity may help the development of future
anti
-
viral drugs

involving protease inhibitors by identifying specific features
of protease activity for further experimental investigation. While viral
sequence information is growing at a fast rate, there is still comparatively
little understanding of how viral polyprot
eins are cut into their functional
unit lengths. The aim of the work reported here is to investigate whether it is
possible to generalise from known cleavage sites to unknown cleavage sites for
two specific viruses
-

HIV and HCV. An understanding of prote
olytic activity
for specific viruses will contribute to our understanding of viral protease
function in general, thereby leading to a greater understanding of protease
families and their substrate characteristics.



Results: Our results show that artific
ial neural networks and symbolic learning
techniques (See5) capture some fundamental and new substrate attributes, but
neural networks outperform their symbolic counterpart.


Availability: Publicly available software was used (Stuttgart Neural Network
Simu
lator
-

http://www
-
ra.informatik.uni
-
tuebingen.de/SNNS/ , and See5
-

http://www.rulequest.com). The datasets used (HIV, HCV) for See5 are available
at: http://www.dcs.ex.ac.uk/~anarayan/bioinf/ismbdatasets/




Contact: a.narayanan@ex.ac.uk, z.r.yang@ex.ac.
uk


Paper02, Presentation03


Title:

The metric space of proteins
-

comparative study of clustering algorithms


Authors:

Ori Sasson, Nati Linial, Michal Linial


Abstract:

Motivation: A large fraction of biological research concentrates on individual
protein
s and on small families of proteins. One of the current major challenges
in bioinformatics is to extend our knowledge also to very large sets of proteins.
Several major projects have tackled this problem. Such undertakings usually
start with a process that

clusters all known proteins or large subsets of this
space. Some work in this area is carried out automatically, while other attempts
incorporate expert advice and annotation.

Results: We propose a novel technique that automatically clusters protein
sequ
ences. We consider all proteins in SWISSPROT, and carry out an all
-
against
-
all BLAST similarity test among them. With this similarity measure in hand we
proceed to perform a continuous bottom
-
up clustering process by applying
alternative rules for merging
clusters. The outcome of this clustering process
is a classification of the input proteins into a hierarchy of clusters of
varying degrees of granularity. Here we compare the clusters that result from
alternative merging rules, and validate the results aga
inst InterPro.

Our preliminary results show that clusters that are consistent with several
rather than a single merging rule tend to comply with InterPro annotation. This
is an affirmation of the view that the protein space consists of families that
diffe
r markedly in their evolutionary conservation.

Availability: The outcome of these investigations can be viewed in an
interactive Web site at http://www.protonet.cs.huji.ac.il.

Supplementary information: Biological examples for comparing the performance of
the different algorithms used for classification are presented in
http://www.protonet.cs.huji.ac.il/examples.html.

Contact: ori@cs.huji.ac.il


Paper03, Presentation04


Title:


Authors:

Nicholas Steffen, Scott Murphy, Lorenzo Tolleri, Wesley Hatfield, Richa
rd
Lathrop


Abstract:

Motivation: Direct recognition, or direct readout, of DNA bases by a DNA
-
binding
protein involves amino acids that interact directly with features specific to
each base. Experimental evidence also shows that in many cases the protein

achieves partial sequence specificity by indirect recognition, i.e., by
recognizing structural properties of the DNA. (1) Could threading a DNA
sequence onto a crystal structure of bound DNA help explain the indirect
recognition component of sequence spe
cificity? (2) Might the resulting pure
-
structure computational motif manifest itself in familiar sequence
-
based
computational motifs?

Results: The starting structure motif was a crystal structure of DNA bound to
the integration host factor protein (IHF) o
f {
\
it E.~coli}. IHF is known to
exhibit both direct and indirect recognition of its binding sites. (1)
Threading DNA sequences onto the crystal structure showed statistically
significant partial separation of 60 IHF binding sites from random and
intrage
nic sequences and was positively correlated with binding affinity. (2)
The crystal structure was shown to be equivalent to a linear Markov network, and
so, to a joint probability distribution over sequences, computable in linear
time. It was transformed
algorithmically into several common pure
-
sequence
representations, including (a) small sets of short exact strings, (b) weight
matrices, (c) consensus regular patterns, (d) multiple sequence alignments, and
(e) phylogenetic trees. In all cases the pure
-
se
quence motifs retained
statistically significant partial separation of the IHF binding sites from
random and intragenic sequences. Most exhibited positive correlation with
binding affinity. The multiple alignment showed some conserved columns, and the
phy
logenetic tree partially mixed low
-
energy sequences with IHF binding sites
but separated high
-
energy sequences. The conclusion is that deformation energy
explains part of indirect recognition, which explains part of IHF sequence
-
specific binding.

Availabi
lity: Code and data on request.

Contact: Nick Steffen for code and Lorenzo Tolleri for data. nsteffen@uci.edu ,
Tolleri@chiron.it


Paper04, Presentation05


Title:

Beyond tandem repeats: complex pattern structures and distant regions of
similarity


Authors:

Amy M. Hauth, Deborah A. Joseph


Abstract:

Motivation: Tandem repeats (TRs) are associated with human disease, play a role
in evolution and are important in regulatory processes. Despite their
importance, locating and characterizing these patterns withi
n anonymous DNA
sequences remains a challenge. In part, the difficulty is due to imperfect
conservation of patterns and complex pattern structures. We study recognition
algorithms for two complex pattern structures: variable length tandem repeats
(VLTRs)

and multi
-
period tandem repeats (MPTRs).

Results: We extend previous algorithmic research to a class of regular tandem
repeats (RegTRs). We formally define RegTRs, as well as, two important
subclasses: VLTRs and MPTRs. We present algorithms for identifi
cation of TRs in
these classes. Furthermore, our algorithms identify degenerate VLTRs and MPTRs:
repeats containing substitutions, insertions and deletions. To illustrate our
work, we present results of our analysis for two difficult regions in cattle an
d
human which reflect practical occurrences of these subclasses in GenBank
sequence data. In addition, we show the applicability of our algorithmic
techniques for identifying Alu sequences, gene clusters and other distant
regions of similarity. We illust
rate this with an example from yeast chromosome
I.

Availability: Algorithms can be accessed at http://www.cs.wisc.edu/areas/theory.

Contact: Amy M. Hauth kryder@cs.wisc.edu, 608
-
831
-
2164) or Deborah A. Joseph
joseph@cs.wisc.edu 608
-
262
-
8022), FAX: 608
-
262
-
9777.


Paper05, Presentation06


Title:

The POPPs: Clustering and Searching Using Peptide Probability Profiles


Author:

Michael J. Wise


Abstract:

The POPPs is a suite of inter
-
related software tools which allow the user to
discover what is statistically "
unusual" in the composition of an unknown
protein, or to automatically cluster proteins into families based on peptide
composition. Finally, the user can search for related proteins based on peptide
composition. Statistically based peptide composition pr
ovides a view of
proteins that is, to some extent, orthogonal to that provided by sequence. In a
test study, the POPP suite is able to regroup into their families sets of
approximately 100 randomised Pfam protein domains. The POPPs suite is used to
explo
re the diverse set of late embryogenesis abundant (LEA) proteins.

Availability: Contact the author.

Contact: mw263@cam.ac.uk


Paper06, Presentation08


Title:

A sequence profile based HMM for predicting and discriminating beta barrel
membrane proteins


Auth
ors:

Pier Luigi Martelli, Piero Fariselli, Anders Krogh, Rita Casadio


Abstract:


Motivation: Membrane proteins are an abundant and functionally relevant subset
of proteins that putatively include from about 15 up to 30% of the proteome of
organisms fully
sequenced. These estimates are mainly computed on the basis of
sequence comparison and membrane protein prediction. It is therefore urgent to
develop methods capable of selecting membrane proteins especially in the case of
outer membrane proteins, barely t
aken into consideration when proteome wide
analysis is performed. This will also help protein annotation when no homologous
sequence is found in the database. Outer membrane proteins solved so far at
atomic resolution interact with the external membrane of

bacteria with a
characteristic _ barrel structure comprising different even numbers of _ strands
(_ barrel membrane proteins). In this they differ from the membrane proteins of
the cytoplasmic membrane endowed with alpha helix bundles (all alpha membrane
proteins) and need specialised predictors.

Results: We develop a HMM model, which can predict the topology of _ barrel
membrane proteins, using as input evolutionary information. The model is cyclic
with 6 types of states: two for the _ strand transmembran
e core, one for the _
strand cap on either side of the membrane, one for the inner loop, one for the
outer loop and one for the globular domain state in the middle of each loop. The
development of a specific input for HMM based on multiple sequence alignme
nt is
novel. The accuracy per residue of the model is 82% when a jack knife procedure
is adopted. With a model optimisation method using a dynamic programming
algorithm seven topological models out the twelve proteins included in the
testing set are also c
orrectly predicted. When used as a discriminator, the
model is rather selective. At a fixed probability value, it retains 84% of a
non
-
redundant set comprising 145 sequences of well
-
annotated outer membrane
proteins. Concomitantly, it correctly rejects 90%

of a set of globular proteins
including about 1200 chains with low sequence identity (< 30%) and 90% of a set
of all alpha membrane proteins, including 188 chains.

Availability: The program will be available on request from the authors.

Contact: gigi@lip
id.biocomp.unibo.it, www.biocomp.unibo.it


Paper07, Presentation09


Title:

Fully automated ab initio protein structure prediction using I
-
SITES, HMMSTR
and ROSETTA


Authors:

Christopher Bystroff, Yu Shao


Abstract:

Motivation: The Monte Carlo fragment in
sertion method for protein tertiary
structure prediction (ROSETTA) of Baker and others, has been merged with the I
-
SITES library of sequence structure motifs and the HMMSTR model for local
structure in proteins, to form a new public server for the ab initi
o prediction
of protein structure. The server performs several tasks in addition to tertiary
structure prediction, including a database search, amino acid profile generation,
fragment structure prediction, and backbone angle and secondary structure
predict
ion. Meeting reasonable service goals required improvements in the
efficiency, in particular for the ROSETTA algorithm.

Results: The new server was used for blind predictions of 40 protein sequences
as part of the CASP4 blind structure prediction experim
ent. The results for 31
of those predictions are presented here. 61% of the residues overall were found
in topologically correct predictions, which are defined as fragments of 30
residues or more with a root
-
mean
-
square deviation in superimposed alpha car
bons
of less than 6Å. HMMSTR 3
-
state secondary structure predictions were 73% correct
overall. Tertiary structure predictions did not improve the accuracy of
secondary structure prediction.

Availability: The server is accessible through the web at
http://i
sites.bio.rpi.edu/hmmstr/index.html Programs are available upon
requests for academics. Licensing agreements are available for commercial
interests.

Supplementary information: http://isites.bio.rpi.edu,
http://predictioncenter.llnl.gov/casp4/

Contacts: by
strc@rpi.edu, shaoy@rpi.edu


Paper08, Presentation10


Title:

Prediction of Contact Maps by GIOHMMs and Recurrent Neural Networks Using
Lateral Propagation From All Four Cardinal Corners


Authors:

Gianluca Pollastri, Pierre Baldi


Abstract:

Motivation: Accu
rate prediction of protein contact maps is an important step in
computational structural proteomics. Because contact maps provide a translation
and rotation invariant topological representation of a protein, they can be used
as a fundamental intermediary s
tep in protein structure prediction.

Results: We develop a new set of flexible machine learning architectures for the
prediction of contact maps, as well as other information processing and pattern
recognition tasks. The architectures can be viewed as recu
rrent neural network
parameterizations of a class of Bayesian networks we call generalized input
-
output HMMs. For the specific case of contact maps, contextual information is
propagated laterally through four hidden planes, one for each cardinal corner.
W
e show that these architectures can be trained from examples and yield contact
map predictors that outperform previously reported methods. While several
extensions and improvements are in progress, the current version can accurately
predict 60.5% of contac
ts at a distance cutoff of 8Å and 45% of distant contacts
at 10Å, for proteins of length up to 300.

Availability and Contact: The contact map predictor will be made available
through

http://promoter.ics.uci.edu/BRNN
-
PRED/ as part of an existing suite of
p
roteomics predictors.

Email: {gpollast,pfbaldi}@ics.uci.edu


Paper09, Presentation11


Title:

Rate4Site: An Algorithmic Tool for the Identification of Functional Regions on
Proteins by Surface Mapping of Evolutionary Determinants within Their Homologues


Au
thors:

Tal Pupko, Rachel Bell, Itay Mayrose, Fabian Glaser, Nir Ben
-
Tal


Abstract:

Motivation: A number of proteins of known three
-
dimensional (3D) structure exist,
with yet unknown function. In light of the recent progress in structure
determination metho
dology, this number is likely to increase rapidly. A novel
method is presented here: "Rate4Site", which maps the rate of evolution among
homologous proteins onto the molecular surface of one of the homologues whose
3D
-
structure is known. Functionally impor
tant regions correspond to surface
patches of slowly evolving residues.

Results: Rate4Site estimates the rate of evolution of amino acid sites using the
maximum likelihood (ML) principle. The ML estimate of the rates considers the
topology and branch leng
ths of the phylogenetic tree, as well as the underlying
stochastic process. To demonstrate its potency, we study the Src SH2 domain.
Like previously established methods, Rate4Site detected the SH2 peptide
-
binding
groove. Interestingly, it also detected in
ter
-
domain interactions between the
SH2 domain and the rest of the Src protein that other methods failed to detect.

Availability: Rate4Site can be downloaded at: http://ashtoret.tau.ac.il/

Contact: tal@ism.ac.jp; rebell@ashtoret.tau.ac.il; fabian@ashtoret
.tau.ac.il;
bental@ashtoret.tau.ac.il

Supplementary Information: multiple sequence alignment of homologous domains
from the SH2 protein family, the corresponding phylogenetic tree and additional
examples are available at http://ashtoret.tau.ac.il/~rebell


Paper10, Presentation12


Title:

Inferring sub
-
cellular localization through automated lexical analysis


Authors:

Rajesh Nair, Burkhard Rost


Abstract:

Motivation: The SWISS
-
PROT sequence database contains keywords of functional
annotations for many protein
s. In contrast, information about the sub
-
cellular
localization is only available for few proteins. Experts can often infer
localization from keywords describing protein function. We developed LOCkey, a
fully automated method for lexical analysis of SWISS
-
PROT keywords that assigns
sub
-
cellular localization. With the rapid growth in sequence data, the
biochemical characterisation of sequences has been falling behind. Our method
may be a useful tool for supplementing functional information already
automatica
lly available.

Results: The method reached a level of more than 82% accuracy in a full cross
-
validation test. Due to a lack of functional annotations, we could infer
localization for less than half of all proteins in SWISS
-
PROT. We applied LOCkey
to annot
ate five entirely sequenced proteomes, namely Saccharomyces cerevisiae
(yeast), Caenorhabditis elegans (worm), Drosophila melanogaster (fly),
Arabidopsis thaliana (plant) and a subset of all human proteins. LOCkey found
about 8000 new annotations of sub
-
ce
llular localization for these eukaryotes.

Availability: Annotations of localization for eukaryotes at:
http://cubic.bioc.columbia.edu/services/LOCkey.

Contact: rost@columbia.edu



Paper11, Presentation14


Title:

Support vector regression applied to the d
etermination of the developmental age
of a Drosophila embryo from its segmentation gene expression patterns


Authors:

Ekaterina Myasnikova, Anastassia Samsonova, John Reinitz, Maria Samsonova


Abstract:

Motivation: In this paper we address the problem of
the determination of
developmental age of an embryo from its segmentation gene expression patterns in
Drosophila.

Results: By applying support vector regression we have developed a fast method
for automated staging of an embryo on the basis of its gene exp
ression pattern.
Support vector regression is a statistical method for creating regression
functions of arbitrary type from a set of training data. The training set is
composed of embryos for which the precise developmental age was determined by
measuring
the degree of membrane invagination. Testing the quality of
regression on the training set showed good prediction accuracy. The optimal
regression function was then used for the prediction of the gene expression
based age of embryos in which the precise
age has not been measured by membrane
morphology. Moreover, we show that the same accuracy of prediction can be
achieved when the dimensionality of the feature vector was reduced by applying
factor analysis. The data reduction allowed us to avoid over
-
fi
tting and to
increase the efficiency of the algorithm.

Availability: This software may be obtained from the authors.

Contact: samson@fn.csa.ru


Paper12, Presentation15


Title:

Variance stabilization applied to microarray data calibration and to the
quantif
ication of differential expression


Authors:

Wolfgang Huber, Anja von Heydebreck, Holger Sueltmann, Annemarie Poustka, Martin
Vingron


Abstract:

We introduce a statistical model for microarray gene expression data that
comprises data calibration, the quant
ification of differential expression, and
the quantification of measurement error. In particular, we derive a
transformation $h$ for intensity measurements, and a difference statistic *h
whose variance is approximately constant along the whole intensity ra
nge. This
forms a basis for statistical inference from microarray data, and provides a
rational data pre
-
processing strategy for multivariate analyses. For the
transformation $h$, the parametric form h(x) = arsinh(a+bx) is derived from a
model of the var
iance
-
versus
-
mean dependence for microarray intensity data,
using the method of variance stabilizing transformations. For large intensities,
h coincides with the logarithmic transformation, and *h with the log
-
ratio. The
parameters of h together with thos
e of the calibration between experiments are
estimated with a robust variant of maximum
-
likelihood estimation. We
demonstrate our approach on data sets from different experimental platforms,
including two
-
color cDNA arrays and a series of Affymetrix oligo
nucleotide
arrays.

Availability: Software is freely available for academic use as an R package at
http://www.dkfz.de/abt0840/whuber

Contact: w.huber@dkfz.de


Paper13, Presentation16


Title:

Binary Tree
-
Structured Vector Quantization Approach to Clustering
and
Visualizing Microarray Data


Authors:

M. Sultan, D. Wigle, CA Cumbaa, M. Maziarz, Janice Glasgow, M.
-
S. Tsao and Igor
Jurisica


Abstract:

Motivation: With the increasing number of gene expression databases, the need
for more powerful analysis and visua
lization tools is growing. Many techniques
have successfully been applied to unravel latent similarities among genes and/or
experiments. Most of the current systems for microarray data analysis use
statistical methods, hierarchical clustering, self
-
organ
izing maps, support
vector machines, or k
-
means clustering to organize genes or experiments into
"meaningful" groups. Without prior explicit bias almost all of these clustering
methods applied to gene expression data not only produce different results, bu
t
may also produce clusters with little or no biological relevance. Of these
methods, agglomerative hierarchical clustering has been the most widely applied,
although many limitations have been identified.

Results: Starting with a systematic comparison of
the underlying theories behind
clustering approaches, we have devised a technique that combines tree
-
structured
vector quantization and partitive k
-
means clustering (BTSVQ). This hybrid
technique has revealed clinically relevant clusters in three large pub
licly
available data sets. In contrast to existing systems, our approach is less
sensitive to data preprocessing and data normalization. In addition, the
clustering results produced by the technique have strong similarities to those
of self
-
organizing map
s (SOMs). We discuss the advantages and the mathematical
reasoning behind our approach.

Availability: The BTSVQ system is implemented in Matlab R12 using the SOM
toolbox for the visualization and preprocessing of the
data(http://www.cis.hut.fi/projects/so
mtoolbox/}. BTSVQ is available for non
-
commercial use(http://www.uhnres.utoronto.ca/ta3/BTSVQ}.

Contact: ij@uhnres.utoronto.ca


Paper14, Presentation17


Title:

A Variance
-
Stabilizing Transformation for Gene
-
Expression Microarray Data


Authors:

Blythe Durbi
n, Johanna Hardin, Douglas Hawkins, David Rocke


Abstract:

Motivation: Standard statistical techniques often assume that data are normally
distributed, with constant variance not depending on the mean of the data. Data
that violate these assumptions can
often be brought in line with the assumptions
by application of a transformation. Gene
-
expression microarray data have a
complicated error structure, with a variance that changes with the mean in a
non
-
linear fashion. Log transformations, which are often

applied to microarray
data, can inflate the variance of observations near background.

Results: We introduce a transformation that stabilizes the variance of
microarray data across the full range of expression. Simulation studies also
suggest that this tr
ansformation approximately symmetrizes microarray data.

Contact: bpdurbin@wald.ucdavis.edu


Paper15, Presentation18


Title:

Linking Gene Expression Data with Patient Survival Times Using Partial Least
Squares


Authors:

Peter Park, Lu Tian, Isaac Kohane


Ab
stract:

There is an increasing need to link the large amount of genotypic data, gathered
using microarrays for example, with various phenotypic data from patients. The
classification problem in which gene expression data serve as predictors and a
class la
bel phenotype as the binary outcome variable has been examined
extensively, but there has been less emphasis in dealing with other types of
phenotypic data. In particular, patient survival times with censoring are often
not used directly as a response var
iable due to the complications that arise
from censoring.

We show that the issues involving censored data can be circumvented by
reformulating the problem as a standard Poisson regression problem. The
procedure for solving the transformed problem is a com
bination of two approaches:
partial least squares, a regression technique that is especially effective when
there is severe collinearity due to a large number of predictors, and
generalized linear regression, which extends standard linear regression to dea
l
with various types of response variables. The linear combinations of the
original variables identified by the method are highly correlated with the
patient survival times and at the same time account for the variability in the
covariates. The algorithm

is fast, as it does not involve any matrix
decompositions in the iterations. We apply our method to data sets from lung
carcinoma and diffuse large B
-
cell lymphoma studies to verify its effectiveness.

Contact: peter_park@harvard.edu


Paper16, Presentati
on20


Title:

Microarray Synthesis through Multiple
-
Use PCR Primer Design


Authors:

Rohan Fernandes, Steven Skiena


Abstract:

A substantial percentage of the expense in constructing full
-
genome spotted
microarrays comes from the cost of synthesizing the PCR

primers to amplify the
desired DNA. We propose a computationally
-
based method to substantially reduce
this cost. Historically, PCR primers are designed so that each primer occurs
uniquely in the genome. This condition is unnecessarily strong for select
ive
amplification, since only the primer pair associated with each amplification
need be unique. We demonstrate that careful design in a genome
-
level
amplification project permits us to save the cost of several thousand primers
over conventional approache
s.

Contact: {skiena, rohan}@cs.sunysb.edu


Paper17, Presentation21


Title:

Discovering Statistically Significant Biclusters in Gene Expression Data


Authors:

Amos Tanay, Roded Sharan, Ron Shamir


Abstract:

In gene expression data, a bicluster is a subset o
f the genes exhibiting
consistent patterns over a subset of the conditions. We propose a new method to
detect significant biclusters in large expression datasets. Our approach is
graph theoretic coupled with statistical modeling of the data. Under plausib
le
assumptions, our algorithm is polynomial and is guaranteed to find the most
significant biclusters. We tested our method on a collection of yeast
expression profiles and on a human cancer dataset. Cross validation results
show high specificity in assi
gning function to genes based on their biclusters,
and we are able to annotate in this way 196 uncharacterized yeast genes. We also
demonstrate how the biclusters lead to detecting new concrete biological
associations. In cancer data we are able to detect

and relate finer tissue
types than was previously possible. We also show that the method outperforms
the biclustering algorithm of Cheng and Church (2000).

Contact: {amos,roded,rshamir}@tau.ac.il.

Availability: www.cs.tau.ac.il/~rshamir/biclust.html.


Pa
per18, Presentation22


Title:

Co
-
Clustering of Biological Networks and Gene Expression Data


Authors:

Daniel Hanisch, Alexander Zien, Ralf Zimmer, Thomas Lengauer


Abstract:

Motivation: Large scale gene expression data are often analyzed by clustering
gene
s based on gene expression data alone, though a
-
priori knowledge in the form
of biological networks is available. The use of this additional information
promises to improve exploratory analysis considerably.

Results: We propose to construct a distance func
tion which combines information
from expression data and biological networks. Based on this function, we compute
a joint clustering of genes and vertices of the network. This general approach
is elaborated for metabolic networks. We define a graph distance

function on
such networks and combine it with a correlation
-
based distance function for gene
expression measurements. A hierarchical clustering and an associated statistical
measure is computed to arrive at a reasonable number of clusters. Our method is
v
alidated using expression data of the yeast diauxic shift. The resulting
clusters are easily interpretable in terms of the biochemical network and the
gene expression data and suggest that our method is able to automatically
identify processes that are rel
evant under the measured conditions.

Contact: Daniel.Hanisch@scai.fhg.de


Paper19, Presentation23


Title:

Statistical process control for large scale microarray experiments


Authors:

Fabian Model, Thomas Koenig, Christian Piepenbrock, Peter Adorjan


Abstra
ct:

Motivation: Maintaining and controlling data quality is a key problem in large
scale microarray studies. Especially systematic changes in experimental
conditions across multiple chips can seriously affect quality and even lead to
false biological conc
lusions. Traditionally the influence of these effects can
only be minimized by expensive repeated measurements, because a detailed
understanding of all process relevant parameters seems impossible.

Results: We introduce a novel method for microarray proce
ss control that
estimates quality solely based on the distribution of the actual measurements
without requiring repeated experiments. A robust version of principle component
analysis detects single outlier microarrays and only thereby enables the use of
t
echniques from multivariate statistical process control. In particular, the
T^2 control chart reliably tracks undesired changes in process relevant
parameters. This can be used to improve the microarray process itself, limits
necessary repetitions only t
o affected samples and therefore maintains quality
in a cost effective way. We prove the power of the approach on 3 large sets of
DNA methylation microarray data.

Contact: Fabian.Model@epigenomics.com


Paper20, Presentation24


Title:

Evaluating Machine Le
arning Approaches for Aiding Probe Selection for Gene
-
Expression Arrays


Authors:

John Tobler, Mike Molla, Jude Shavlik


Abstract:

Motivation: Microarrays are a fast and cost
-
effective method of performing
thousands of DNA hybridization experiments simult
aneously. DNA probes are
typically used to measure the expression level of specific genes. Because
probes greatly vary in the quality of their hybridizations, choosing good probes
is a difficult task. If one could accurately choose probes that are li
kely to
hybridize well, then fewer probes would be needed to represent each gene in a
gene
-
expression microarray, and, hence, more genes could be placed on an array
of a given physical size. Our goal is to empirically evaluate how successfully
three stand
ard machine
-
learning algorithms
-

naïve Bayes, decision trees, and
artificial neural networks
-

can be applied to the task of predicting good
probes. Fortunately it is relatively easy to get training examples for such a
learning task: place various probes

on a gene chip, add a sample where the
corresponding genes are highly expressed, and then record how well each probe
measures the presence of its corresponding gene. With such training examples,
it is possible that an accurate predictor of probe quality
can be learned.

Results: Two of the learning algorithms we investigate
-

naïve Bayes and neural
networks
-

learn to predict probe quality surprisingly well. For example, in
the top ten predicted probes for a given gene not used for training, on average
about five rank in the top 2.5% of that gene's hundreds of possible probes.
Decision
-
tree induction and the simple approach of using predicted melting
temperature to rank probes perform significantly worse than these two algorithms.
The features we use t
o represent probes are very easily computed and the time
taken to score each candidate probe after training is minor. Training the naïve
Bayes algorithm takes very little time, and while it takes over 10 times as long
to train a neural network, that time
is still not very substantial (on the order
of a few hours on a desktop workstation). We also report the information
contained in the features we use to describe the probes. We find the fraction
of cytosine in the probe to be the most informative feature.

We also find, not
surprisingly, that the nucleotides in the middle of the probes sequence are more
informative than those at the ends of the sequence.

Contact: molla@cs.wisc.edu



Paper21, Presentation26


Title:

The Degenerate Primer Design Problem


Aut
hors:

Chaim Linhart, Ron Shamir


Abstract:

A PCR primer sequence is called degenerate if some of its positions have several
possible bases. The degeneracy of the primer is the number of unique sequence
combinations it contains. We study the problem of de
signing a pair of primers
with prescribed degeneracy that match a maximum number of given input sequences.
Such problems occur when studying a family of genes that is known only in part,
or is known in a related species. We prove that various simplified v
ersions of
the problem are hard, show the polynomiality of some restricted cases, and
develop approximation algorithms for one variant. Based on these algorithms, we
implemented a program called HYDEN for designing highly
-
degenerate primers for a
set of g
enomic sequences. We report on the success of the program in an
experimental scheme for identifying all human olfactory receptor (OR) genes. In
that project, HYDEN was used to design primers with degeneracies up to 10^10
that amplified with high specific
ity many novel genes of that family, tripling
the number of OR genes known at the time.

Availability: Available on request from the authors

Contact: {chaiml,rshamir}@tau.ac.il


Paper22, Presentation27


Title:

Splicing Graphs and EST Assembly Problem


Autho
rs:

Steffen Heber, Max Alekseyev, Sing
-
Hoi Sze, Haixu Tang, Pavel Pevzner


Abstract:

Motivation: The traditional approach to annotate alternative splicing is to
investigate every splicing variant of the gene in a case
-
by
-
case fashion. This
approach, while
useful, has some serious shortcomings. Recent studies indicate
that alternative splicing is more frequent than previously thought and some
genes may produce tens of thousands of different transcripts. A list of
alternatively spliced variants for such gen
es would be difficult to build and
hard to analyze. Moreover, such a list does not show the relationships between
different transcripts and does not show the overall structure of all transcripts.
A better approach would be to represent all splicing variant
s for a given gene
in a way that captures the relationships between different splicing variants.

Results: We introduce the notion of the splicing graph that is a natural and
convenient representation of all splicing variants. The key difference with the
e
xisting approaches is that we abandon the linear (sequence) representation of
each transcript and substitute it with a graph representation where each
transcript corresponds to a path in the graph. We further design an algorithm to
assemble EST reads into
the splicing graph rather than assembling them into each
splicing variant in a case
-
by
-
case fashion.

Availability: http://www
-
cse.ucsd.edu/groups/bioinformatics/software.html

Contact: sheber@ucsd.edu



Paper23, Presentation28


Title:

Exact Genetic Linkage
Computations for General Pedigrees


Authors:

Ma'ayan Fishelson, Dan Geiger


Abstract:

Motivation: Genetic linkage analysis is a useful statistical tool for mapping
disease genes and for associating functionality of genes to their location on
the chromosom
e. There is a need for a program that computes multipoint
likelihood on general pedigrees with many markers that also deals with two
-
locus
disease models.

Results: In this paper we present algorithms for performing exact multipoint
likelihood calculations

on general pedigrees with a large number of highly
polymorphic markers, taking into account a variety of disease models. We have
implemented these algorithms in a new computer program called SUPERLINK which
outperforms leading linkage software with regar
ds to functionality, speed,
memory requirements and extensibility.

Availability: SUPERLINK is available at
http://bioinfo.cs.technion.ac.il/superlink

Contact: {fmaayan, dang}@cs.technion.ac.il}


Paper24, Presentation29


Title:

Assigning Probes into A Small

Number of Pools Separable by Electrophoresis


Authors:

Teemu Kivioja, Mikko Arvas, Kari Kataja, Merja Penttil‰, Hans Sˆderlund, Esko
Ukkonen


Abstract:

Motivation: Measuring transcriptional expression levels (transcriptional
profiling) has become one of t
he most important methods in functional genomics.
Still, new measuring methods are needed to obtain more reliable, quantitative
data about transcription on a genomic scale. In this paper we concentrate on
certain computational optimization problems arisi
ng in the design of one such
novel method. From a computational point of view the key feature of the new
method is that the hybridized probes are distinguished from each other based on
their different size. Therefore the probes have to be assigned into po
ols such
that the probes in the same pool have unique sizes different enough from each
other. Identification of expressed RNA is given by probe pool and probe size
while quantification is given by the label of the probe, e.g. fluorescence
intensity.

Result
s: We show how to computationally find the probes and assign them into
pools for a whole genome such that i) each gene has a specific probe suitable
for amplification and hybridization, and ii) the expression level measurement
can be done in a minimal numb
er of pools separable by electrophoresis in order
to minimize the total experiment cost of the measurement. Our main result is a
polynomial
-
time approximation algorithm for assigning the probes into pools. We
demonstrate the feasibility of the procedure by

selecting probes for the yeast
genome and assigning them into less than 100 pools. The probe sequences and
their assignment into pools are available for academic research on request from
the authors.

Contact; Teemu.Kivioja@cs.Helsinki.FI


Paper25, Present
ation30


Title:

Representing Genetic Sequence Data for Pharmacogenomics: An Evolutionary
APproach Using Ontological and Relational Models


Authors:

Daniel Rubin, Farhad Shafa, Diane Oliver, Micheal Hewett, Teri Klein, Russ
Altman


Abstract:

Motivation: Th
e information model chosen to store biological data affects the
types of queries possible, database performance, and difficulty in updating that
information model. Genetic sequence data for pharmacogenetics studies can be
complex, and the best information

model to use may change over time. As
experimental and analytical methods change, and as biological knowledge advances,
the data storage requirements and types of queries needed may also change.

Results: We developed a model for genetic sequence and po
lymorphism data, and
used XML schema to specify the elements and attributes required for this model.
We implemented this model as an ontology in a frame
-
based representation and as
a relational model in a database system. We collected genetic data from t
wo
pharmacogenetics resequencing studies, and formulated queries useful for
analyzing these data. We compared the ontology and relational models in terms
of query complexity, performance, and difficulty in changing the information
model. Our results demo
nstrate benefits of evolving the schema for storing
pharmacogenetics data: ontologies perform well in early design stages as the
information model changes rapidly and simplify query formulation, while
relational models offer improved query speed once the
information model and
types of queries needed stabilize.

Availability: Our ontology and relational models are available at
http://www.pharmgkb.org/data/schemas/.

Contact: rubin@smi.stanford.edu, russ.altman@stanford.edu, help@pharmgkb.org



Paper26, Presen
tation32


Title:

Evaluating Functional Network Inference Using Simulations of Complex Biological
Systems


Authors:

V. Anne Smith, Erich D. Jarvis, Alexander J. Hartemink


Abstract:

Motivation: Although many network inference algorithms have been presented
in
the bioinformatics literature, no suitable approach has been formulated for
evaluating their effectiveness at recovering models of complex biological
systems from limited data. To overcome this limitation, we propose an approach
to evaluate network inf
erence algorithms according to their ability to recover a
complex functional network from biologically reasonable simulated data.

Results: We designed a simulator to generate data from a complex biological
system at multiple levels of organization: behavio
r, neural anatomy, brain
electrophysiology, and gene expression of songbirds. About 90% of the simulated
variables are unregulated by other variables in the system and are included
simply as distracters. We sampled the simulated data at intervals as one
would
sample from a biological system in practice, and then used the sampled data to
evaluate the effectiveness of an algorithm we developed for functional network
inference. We found that our algorithm is highly effective at recovering the
functional net
work structure of the simulated system
---
including the irrelevance
of unregulated variables
---
from sampled data alone. To assess the
reproducibility of these results, we tested our inference algorithm on
50separately simulated sets of data and it consiste
ntly recovered almost
perfectly the complex functional network structure underlying the simulated data.
To our knowledge, this is the first approach for evaluating the effectiveness of
functional network inference algorithms at recovering models from limi
ted data.
Our simulation approach also enables researchers a priori to design experiments
and data
-
collection protocols that are amenable to functional network inference.

Availability: Source code and simulated data are available upon request.

Contact: am
ink@cs.duke.edu, asmith@neuro.duke.edu, jarvis@neuro.duke.edu



Paper27, Presentation33


Title:

The Pathway Tools Software


Authors:

Peter Karp, Suzanne Paley, Pedro Romero


Abstract:

Motivation: Bioinformatics requires reusable software tools for creating

model
-
organism databases (MODs).

Results: The Pathway Tools is a reusable, production
-
quality software
environment for creating a type of MOD called a Pathway/Genome Database (PGDB).
A PGDB such as EcoCyc (see ecocyc.org) integrates our evolving understa
nding of
the genes, proteins, metabolic network, and genetic network of an organism.
This paper provides an overview of the four main components of the Pathway Tools:
The PathoLogic component supports creation of new PGDBs from the annotated
genome of an
organism. The Pathway/Genome Navigator provides query,
visualization, and Web
-
publishing services for PGDBs. The Pathway/Genome
Editors support interactive updating of PGDBs. The Pathway Tools ontology
defines the schema of PGDBs. The Pathway Tools mak
es use of the Ocelot object
database system for data management services for PGDBs. The Pathway Tools has
been used to build PGDBs for 13 organisms within SRI and by external users.

Availability: The software is freely available to academics and is availa
ble for
a fee to commercial institutions. Contact ptools
-
support@ai.sri.com for
information on obtaining the software.

Contact: pkarp@ai.sri.com


Paper28, Presentation34


Title:

Discovering regulatory and signaling circuits in molecular interaction networ
ks


Authors:

Trey Ideker, Owen Ozier, Benno Schwikowski, Andrew Siegel


Abstract:

Motivation: In model organisms such as yeast, large databases of protein
-
protein
and protein
-
DNA interactions have become an extremely important resource for the
study of pro
tein function, evolution, and gene regulatory dynamics. In this
paper we demonstrate that by integrating these interactions with widely
-
available mRNA expression data, it is possible to generate concrete hypotheses
for the underlying mechanisms governing t
he observed changes in gene expression.
To perform this integration systematically and at large scale, we introduce an
approach for screening a molecular interaction network to identify active
subnetworks, i.e., connected regions of the network that show s
ignificant
changes in expression over particular subsets of conditions. The method we
present here combines a rigorous statistical measure for scoring subnetworks
with a search algorithm for identifying subnetworks with high score.

Results: We evaluated ou
r procedure on a small interaction network of 332 genes
and a large network of 4160 genes containing all 7462 protein
-
protein and
protein
-
DNA interactions in the yeast public databases. In the case of the small
network, we identified five significant subne
tworks that covered 41 out of 77
(53%) of all significant changes in expression. Both network analyses returned
several top
-
scoring subnetworks with good correspondence to known regulatory
mechanisms in the literature. These results demonstrate how large
-
s
cale genomic
approaches may be used to uncover signaling and regulatory pathways in a
systematic, integrative fashion.

Availability: The methods presented in this paper are implemented in the
Cytoscape software package which is available to the academic co
mmunity at
www.cytoscape.org.

Contact: trey@wi.mit.edu



Paper29, Presentation35


Title:

Modelling Pathways in E.coli from Time Series Expression Profiles


Authors:

Irene Ong, David Page, Jeremy Glasner


Abstract:

Motivation: Cells continuously reprogram t
heir gene expression network as they
move through the cell cycle or sense changes in their environment. In order to
understand the regulation of cells, time series expression profiles provide a
more complete picture than single time point expression profi
les. Few analysis
techniques, however, are well suited to modeling such time series data.
\

Results: We describe an approach that naturally handles time series data with
the capabilities of modeling causality, feedback loops, and environmental or
hidden va
riables using a Dynamic Bayesian network. We also present a novel way
of combining prior biological knowledge and current observations to improve the
quality of analysis and to model interactions between sets of genes rather than
individual genes. Our ap
proach is evaluated on time series expression data
measured in response to physiological changes that affect tryptophan metabolism
in E. coli. Results indicate that this approach is capable of finding
correlations between sets of related genes.

Contact: o
ng@cs.wisc.edu


Paper30, Presentation36


Title:

Of Truth and Pathways: Chasing Bits of Information through Myriads of Articles


Authors:

Michael Krauthammer, Carol Friedman, George Hripcsak, Ivan Iossifov, Andrey
Rzhetsky


Abstract:

Knowledge on interacti
ons between molecules in living cells is indispensable for
theoretical analysis and practical applications in modern genomics and molecular
biology. Building such networks relies on the assumption that the correct
molecular interactions are known or can be

identified by reading a few research
articles. However, this assumption does not necessarily hold, as truth is rather
an emerging property based on many potentially conflicting facts. This paper
explores the processes of knowledge generation and publishin
g in the molecular
biology literature using modeling and analysis of real molecular interaction
data. The data analyzed in this article was automatically extracted from 50,000
research articles in molecular biology using a computer system called GeneWays
containing a natural language processing module. The paper indicates that
truthfulness of statements is associated in the minds of scientists with the
relative importance (connectedness) of substances under study, revealing a
potential selection bias in th
e reporting of research results. Aiming at
understanding the statistical properties of the life cycle of biological facts
reported in research articles, we formulate a stochastic model describing
generation and propagation of knowledge about molecular inte
ractions through
scientific publications. We hope that in the future such a model can be useful
for automatically producing consensus views of molecular interaction data.

Contact: ar345@columbia.edu



Paper31, Presentation37


Title:

A Fast and Robust Metho
d to Infer and Characterize an Active Regulator Set for
Molecular Pathways


Authors:

Dana Pe'er, Aviv Regev, Amos Tanay


Abstract:

Regulatory relations between genes are an important component of molecular
pathways. Here, we devise a novel global method t
hat uses a set of gene
expression profiles to find a small set of relevant active regulators,
identifies the genes that they regulate, and automatically annotates them. We
show that our algorithm is capable of handling a large number of genes in a
short t
ime and is robust to a wide range of parameters. We apply our method to a
combined dataset of S. cerevisiae expression profiles, and validate the
resulting model of regulation by cross
-
validation and extensive biological
analysis of the selected regulators

and their derived annotations.


Paper32, Presentation39


Title:

Marginalized Kernels for Biological Sequences


Authors:

Koji Tsuda, Taishin Kin, Kiyoshi Asai


Abstract:

Motivation: Kernel methods such as support vector machines require a kernel
function
between objects to be defined a priori. Several works have been done
to derive kernels from probability distributions, e.g. the Fisher kernel.
However, a general methodology to design a kernel is not fully developed.

Results: We propose a reasonable way
of designing a kernel when objects are
generated from latent variable models (e.g. HMM).First of all, a joint kernel is
designed for complete data which include both visible and hidden variables.
Then a marginalized kernel for visible data is obtained by
taking the
expectation with respect to hidden variables. We will show that the Fisher
kernel is a special case of marginalized kernels, which gives another viewpoint
to the Fisher kernel theory. Although our approach can be applied to any object,
we part
icularly derive several marginalized kernels useful for biological
sequences (e.g. DNA and proteins).The effectiveness of marginalized kernels is
illustrated in the task of classifying bacterial gyrase subunit B (gyrB} amino
acid sequences.

Contact]: koji.
tsuda@aist.go.jp


Paper33, Presentation40


Title:

A tree kernel to analyze phylogenetic profiles


Authors:

Jean
-
Philippe Vert


Abstract:

Motivation: The phylogenetic profile of a protein is a string that encodes the
presence or absence of the protein in ev
ery fully sequenced genome. Because
proteins that participate in a common structural complex or metabolic pathway
are likely to evolve in a correlated fashion, the phylogenetic profiles of such
proteins are often "similar" or at least "related" to each ot
her. The question
we address in this paper is the following: how to measure the "similarity"
between two profiles, in an evolutionarily relevant way, in order to develop
efficient function prediction methods?

Results: We show how the profiles can be mappe
d to a high
-
dimensional vector
space which incorporates evolutionarily relevant information, and we provide an
algorithm to compute efficiently the inner product in that space, which we call
the tree kernel. The tree kernel can be used by any kernel
-
based
analysis method
for classification or data mining of phylogenetic profiles. As an application a
Support Vector Machine (SVM) trained to predict the functional class of a gene
from its phylogenetic profile is shown to perform better with the tree kernel
th
an with a naive kernel that does not include any information about the
phylogenetic relationships among species. Moreover a kernel principal component
analysis (KPCA) of the phylogenetic profiles illustrates the sensitivity of the
tree kernel to evolution
arily relevant variations.

Availability: All data and software used are freely and publicly available upon
request.

Contact: Jean
-
Philippe.Vert@mines.org


Paper34, Presentation41


Title:

Statistically Based Postprocessing of Phylogenetic Analysis by Cluste
ring


Authors:

Cara Stockham, Li
-
San Wang, Tandy Warnow


Abstract:

Motivation: Phylogenetic analyses often produce thousands of candidate trees.
Biologists resolve the conflict by computing the consensus of these trees.
Single
-
tree consensus as postproce
ssing methods can be unsatisfactory due to
their inherent limitations.

Results: In this paper we present an alternative approach by using clustering
algorithms on the set of candidate trees. We propose bicriterion problems, in
particular using the concept

of information loss, and new consensus trees called
characteristic trees that minimize the information loss. Our empirical study
using four biological datasets shows that our approach provides a significant
improvement in the information content, while a
dding only a small amount of
complexity. Furthermore, the consensus trees we obtain for each of our large
clusters are more resolved than the single
-
tree consensus trees. We also
provide some initial progress on theoretical questions that arise in this
context.

Availability: Software available upon request from the authors. The
agglomerative clustering is implemented using Matlab (MathWorks, 2000) with the
Statistics Toolbox. The Robinson
-
Foulds distance matrices and the strict
consensus trees are compu
ted using PAUP (Swofford, 2001) and the Daniel Huson's
tree library on Intel Pentium workstations running Debian Linux.

Contact: lisan@cs.utexas.edu,telephone: +1 (512) 232
-
7455, fax: +1 (512) 471
-
8885.

Supplementary Information: http://www.cs.utexas.edu/u
sers/lisan/ismb02/


Paper35, Presentation42


Title:

Efficiently Detecting Polymorphisms During The Fragment Assembly Process


Authors:

Daniel Fasulo, Aaron Halpern, Ian Dew, Clark Mobarry


Abstract:

Motivation: Current genomic sequence assemblers assume th
at the input data is
derived from a single, homogeneous source. However, recent whole
-
genome shotgun
sequencing projects have violated this assumption, resulting in input fragments
covering the same region of the genome whose sequences differ due to polym
orphic
variation in the population. While single
-
nucleotide polymorphisms (SNPs) do
not pose a significant problem to state
-
of
-
the
-
art assembly methods, these
methods do not handle insertion/deletion (indel) polymorphisms of more than a
few bases.

Results
: This paper describes an efficient method for detecting sequence
discrepencies due to polymorphism that avoids resorting to global use of more
costly, less stringent affine sequence alignments. Instead, the algorithm uses
graph
-
based methods to determine

the small set of fragments involved in each
polymorphism and performs more sophisticated alignments only among fragments in
that set. Results from the incorporation of this method into the Celera
Assembler are reported for the D. melanogaster, H. sapiens
, and M. musculus
genomes.

Availability; The method described herein does not constitute a stand
-
alone
software application, but is laid out in sufficient detail to be implemented as
a component of any genomic sequence assembler.

Contact: daniel.fasulo@ce
lera.com


Paper36, Presentation43


Title:

Multiple Genome Rearrangement: A General Approach via the Evolutionary Genome
Graph


Authors:

Dmitry Korkin, Lev Goldfarb


Abstract:

Motivation: In spite of a well
-
known fact that genome rearrangements are
supposed

to be viewed in the light of the evolutionary relationships within and
between the species involved, no formal underlying framework based on the
evolutionary considerations for treating the questions arising in the area has
been proposed. If such an under
lying framework is provided, all the basic
questions in the area can be posed in a biologically more appropriate and useful
form: e.g., the similarity between two genomes can then be computed via the
nearest ancestor, rather than "directly", ignoring the e
volutionary connections.

Results: We outline an evolution
-
based general framework for answering questions
related to the multiple genome rearrangement. In the proposed model, the
evolutionary genome graph (EG
-
graph) encapsulates an evolutionary history of
a
genome family. For a set of all EG
-
graphs, we introduce a family of similarity
measures each defined via a set of genome transformations associated with a
particular EG
-
graph. Given a set of genomes and restricting ourselves to the
transpositions, an alg
orithm for constructing an EG
-
graph is presented. We also
present the experimental results in the form of an EG
-
graph for a set of
concrete genomes (for several species). This EG
-
graph turns out to be very close
to the corresponding known phylogenetic tree
.

Contact: dkorkin@unb.ca



Paper37, Presentation44


Title:

Efficient multiple genome alignment


Authors:

Michael Hoehl, Stefan Kurtz, Enno Ohlebusch


Abstract:

Motivation: To allow a direct comparison of the genomic DNA sequences of
sufficiently similar
organisms, there is an urgent need for software tools that
can align more than two genomic sequences.

Results: We developed new algorithms and a software tool "Multiple Genome
Aligner" (MGA for short) that efficiently computes multiple genome alignments of

large, closely related DNA sequences. For example, it can align 85% percent of
the complete genomes of six human adenoviruses (average length 35,305 bp.) in
159 seconds. An alignment of 74% of the complete genomes of three strains of E.
coli (lengths: 5
,528,445; 5,498,450; 4,639,221 bp.) is produced in 30 minutes.

Availability: The software MGA is available free of charge for non
-
commercial
research institutions. For details see http://bibiserv.techfak.uni
-
bielefeld.de/mga/

Contact]: {kurtz, enno}@techfa
k.uni
-
bielefeld.de


Paper38, Presentation46


Title:

PSEUDOVIEWER: Automatic Visualization of RNA Pseudoknots


Authors:

Kyungsook Han, Yujin Lee, Wootaek Kim


Abstract:

Motivation: Several algorithms have been developed for drawing RNA secondary
structures,

however none of these can be used to draw RNA pseudoknot structures.
In the sense of graph theory, a drawing of RNA secondary structures is a tree,
whereas a drawing of RNA pseudoknots is a graph with inner cycles within a
pseudoknot as well as possible o
uter cycles formed between a pseudoknot and
other structural elements. Thus, RNA pseudoknots are more difficult to visualize
than RNA secondary structures. Since no automatic method for drawing RNA
pseudoknots exists, visualizing RNA pseudoknots relies on
significant amount of
manual work and does not yield satisfactory results. The task of visualizing RNA
pseudoknots by hand becomes more challenging as the size and complexity of the
RNA pseudoknots increase.

Results: We have developed a new representation
and an algorithm for drawing H
-
type pseudoknots with RNA secondary structures. Compared to existing
representations of H
-
type pseudoknots, the new representation ensures uniform
and clear drawings with no edge crossing for all H
-
type pseudoknots. To the be
st
of our knowledge, this is the first algorithm for automatically drawing RNA
pseudoknots with RNA secondary structures. The algorithm has been implemented in
a Java program, which can be executed on any computing system. Experimental
results demonstrate
that the algorithm generates an aesthetically pleasing
drawing of all H
-
type pseudoknots. The results have also shown that the drawing
has high readability, enabling the user to quickly and easily recognize the
whole RNA structure as well as the pseudoknot
s themselves.

Availability: available on request from the corresponding author.

Contact: khan@inha.ac.kr



Paper39, Presentation47


Title:

A powerful non
-
homology method for the prediction of operons in prokaryotes


Authors:

Gabriel Moreno
-
Hagelsieb, Juli
o Collado
-
Vides


Abstract:

Motivation: The prediction of the transcription unit organization of genomes is
an important clue in the inference of functional relationships of genes, the
interpretation and evaluation of transcriptome experiments, and the over
all
inference of the regulatory networks governing the expression of genes in
response to the environment. Though several methods have been devised to predict
operons, most need a high characterization of the genome analyzed. Log
-
likelihoods derived from i
nter
-
genic distance distributions work surprisingly
well to predict operons in Escherichia coli and are available for any genome as
soon as the gene sets are predicted.

Results: Here we provide evidence that the very same method is applicable to any
proka
ryotic genome. First, the method has the same efficiency when evaluated
using a collection of experimentally known operons of Bacillus subtilis. Second,
operons among most if not all prokaryotes seem to have the same tendencies to
keep short distances betw
een their genes, the most frequent distances being the
overlaps of four and one base pairs. The universality of this structural feature
allows us to predict the organization of transcription units in all prokaryotes.
Third, predicted operons contain a high
er proportion of genes with related
phylogenetic profiles and conservation of adjacency than predicted borders of
transcription units.

Contact: moreno@cifn.unam.mx

Supplementary information: Additional materials and graphs, are available at:
http://www.cif
n.unam.mx/moreno/pub/TUpredictions/.



Paper40, Presentation48


Title:

Identifying Operons and Untranslated Regions of Transcripts Using Escherichia
coli RNA Expression Analysis


Authors:

Brian Tjaden, David Haynor, Sergey Stolyar, Carsten Rosenow, Eugene

Kolker


Abstract:

Microarrays traditionally have been used to assay the transcript expression of
coding regions of genes. Here, we use Escherichia coli oligonucleotide
microarrays to assay transcript expression of both open reading frames (ORFs)
and inter
genic regions. We then use hidden Markov models to analyze this
expression data and estimate transcription boundaries of genes. This approach
allows us to identify 5( untranslated regions (UTRs) of transcripts as well as
genes which are likely to be operon

members. The operon elements we identify
correspond to documented operons with 99% specificity and 63% sensitivity.
Similarly we find that our 5( UTR results accurately coincide with
experimentally verified promoter regions for most genes.

Contact: tjaden
@cs.washington.edu



Paper41, Presentation49


Title:

Detecting Recombination with MCMC


Authors:

Dirk Husmeier, Grainne McGuire


Abstract:

Motivation: We present a statistical method for detecting recombination, whose
objective is to accurately locate the

recombinant breakpoints in DNA sequence
alignments of small numbers of taxa (4 or 5). Our approach explicitly models
the sequence of phylogenetic tree topologies along a multiple sequence alignment.
Inference under this model is done in a Bayesian way, u
sing Markov chain Monte
Carlo (MCMC). The algorithm returns the site
-
dependent posterior probability of
each tree topology, which is used for detecting recombinant regions and locating
their breakpoints.

Results: The method was tested on a synthetic and t
hree real DNA sequence
alignments, where it was found to outperform the established detection methods
PLATO, RECPARS, and TOPAL.

Availability: The algorithm has been implemented in the C++ program package
BARCE, which is freely available from http://www.bi
oss.sari.ac.uk/~dirk/software

Contact: dirk@bioss.ac.uk


Paper42, Presentation50


Title:

Finding Composite Regulatory Patterns in DNA Sequences


Authors:

Eleazar Eskin, Pavel Pevzner


Abstract:

Motivation: Pattern discovery in unaligned DNA sequences is a
fundamental
problem in computational biology with important applications in finding
regulatory signals. Current approaches to pattern discovery focus on monad
patterns that correspond to relatively short contiguous strings. However, many
of the actual r
egulatory signals are composite patterns that are groups of monad
patterns that occur near each other. A difficulty in discovering composite
patterns is that one or both of the component monad patterns in the group may be
"`too weak". Since the tradition
al monad
-
based motif finding algorithms usually
output one (or a few) high scoring patterns, they often fail to find composite
regulatory signals consisting of weak monad parts. In this paper, we present a
MITRA (MIsmatch TRee Algorithm)} approach for di
scovering composite signals. We
demonstrate that MITRA performs well for both monad and composite patterns by
presenting experiments over biological and synthetic data.

Availability: MITRA is available at http://www.cs.columbia.edu/compbio/mitra/

Contact:

eeskin@cs.columbia.edu