Bioinformatics Biocomputing - ERCIM

lambblueearthΒιοτεχνολογία

29 Σεπ 2013 (πριν από 4 χρόνια και 1 μήνα)

271 εμφανίσεις

European Research Consortium for Informatics and Mathematics
Number 43 October 2000
www.ercim.org
Special
Theme:
Bioinformatics
Biocomputing
ERCIM offers
postgraduate
Fellowships:
page 4 CONTENTS
KEYNOTE SPECIAL THEME: BIOCOMPUTING
3 by Michael Ashburner 30 Biomolecular Computing
by John McCaskill
JOINT ERCIM ACTIONS
32 The European Molecular Computing Consortium
4 ERCIM has launched the 2001/2002 Fellowship by Grzegorz Rozenberg
Programme
33 Configurable DNA Computing
by John McCaskill
EUROPEAN SCENE
35 Molecular Computing Research at Leiden Center
5 Irish Government invests over €635 Million in Basic
for Natural Computing
Research
by Grzegorz Rozenberg
SPECIAL THEME: BIOINFORMATICS 36 Cellular Computing
by Martyn Amos and Gerald G. Owenson
6 Bioinformatics: From the Pre-genomic to the
Post-genomic Era
37 Research in Theoretical Foundations
by Thomas Lengauer of DNA Computing
by Erzsébet Csuhaj-Varjú and György Vaszil
8 Computational Genomics: Making Sense of Complete
Genomes by Anton Enright,
38 Representing Structured Symbolic Data
Sophia Tsoka and Christos Ouzounis
with Self-organizing Maps
by Igor Farkas
10 Searching for New Drugs in Virtual Molecule
Databases
39 Neurobiology keeps Inspiring New Neural
by Matthias Rarey and Thomas Lengauer
Network Models
by Lubica Benuskova
12 A High Performance Computing Network
for Protein Conformation Simulations
RESEARCH AND DEVELOPMENT
by Marco Pellegrini
41 Contribution to Quantitative Evaluation
13 Ab Initio Methods for Protein Structure Prediction:
A New Technique based on Ramachandran Plots of Lymphoscintigraphy of Upper Limbs
by Petr Gebousky, Miroslav Kárny
by Anna Bernasconi
and Hana Kfiížová
15 Phylogenetic Tree Reconciliation: the Species/
Gene Tree Problem 42 Approximate Similarity Search
by Jean-François Dufayard, Laurent Duret by Giuseppe Amato
and François Rechenmann
43 Education of ‘Information Technology
16 Identification of Drug Target Proteins
Non-professionals’ for the Development and Use
by Alexander Zien, Robert Küffner, Theo Mevissen,
of Information Systems
Ralf Zimmer and Thomas Lengauer
by Peter Mihók, Vladimír Penjak and Jozef Bucko
18 Modeling and Simulation of Genetic Regulatory
45 Searching Documentary Films On-line:
Networks
the ECHO Project
by Hidde de Jong, Michel Page, Céline Hernandez, H
by Pasquale Savino
ans Geiselmann and Sébastien Maza
46 IS4ALL: A New Working Group promoting
19 Bioinformatics for Genome Analysis in Farm Animals
Universal Design in Information Society
by Andy S. Law and Alan L. Archibald
Technologies
by Constantine Stephanidis
21 Modelling Metabolism Knowledge using Objects
and Associations
TECHNOLOGY TRANSFER
by Hélène Rivière-Rolland, Loïc Taloc, Danielle Ziébel
in, François Rechenmann and Alain Viari
47 Identifying Vehicles on the Move
22 Co-operative Environments for Genomes by Beate Koch
Annotation: from Imagene to Geno-Annot
48 Virtual Planetarium at Exhibition ‘ZeitReise’
by Claudine Médigue, Yves Vandenbrouke,
by Igor Nikitin and Stanislav Klimenko
François Rechenmann and Alain Viari
23 Arevir: Analysis of HIV Resistance Mutations
EVENTS
by Niko Beerenwinkel, Joachim Selbig, Rolf Kaiser an
50 6th Eurographics Workshop on Virtual
d Daniel Hoffmann
Environments
24 Human Brain Informatics – Understanding
by Robert van Liere
Causes of Mental Illness
50 Trinity College Dublin hosted the Sixth European
by Stefan Arnborg, Ingrid Agartz, Mikael Nordström
Conference on Computer Vision – ECCV’2000
Håkan Hall and Göran Sedvall
by David Vernon
26 Intelligent Post-Genomics
51 Fifth Workshop of the ERCIM Working Group on
by Francisco Azuaje
Constraints
27 Combinatorial Algorithms in Computational
by Eric Monfroy
Biology by Marie-France Sagot
52 Announcements
28 Crossroads of Mathematics, Informatics
and Life Sciences by Jan Verwer and Annette Kik 55 IN BRIEF
Cover image: Dopamine-D1 receptors in the human brain, by Håkan Hall, Karolinska Institutet,
see article ‘Human Brain Informatics – Understanding Causes of Mental Illness’ on page 24 ERCIM News No. 43, October 2000KEYNOTE
ome revolutions in science often come when least you expect them. Others are forced upon us.
Bioinformatics is a revolution forced by the extraordinary advances in DNA sequencing
Stechnologies, in our understanding of protein structures and by the necessary growth of biological
databases. Twenty years ago pioneers such as Doug Brutlag in Stanford and Roger Staden in Cambridge
began to use computational methods to analyse the very small DNA sequences then determined. Pioneer
efforts were made in 1974 by Bart Barrell and Brian Clarke to catalogue the first few nucleic acid
sequences that had been determined. A few years later, in the early 1980’s,
first the European Molecular Biology Laboratory (EMBL) and then the
US National Institutes of Health (NIH) established computerised data
libraries for nucleic acid sequences. The first release of the EMBL data
library was 585,433-bases; it is 9,678,428,579 on the day I write this,
and doubling every 10 months or so.
Bioinformatics is a peculiar trade since, until very recently, most in the
field were trained in other fields – computer science, physics, linguistics,
genetics, etc. The term will include database curators and algorithmists,
software engineers and molecular evolutionists, graph theorists and
geneticists. By and large their common characteristic is a desire to
understand biology through the organisation and analysis of molecular
data, especially those concerned with macromolecular sequence and
structure. They rely absolutely on a common infrastructure of public
databases and shared software. It has proven, in the USA, Japan and
Europe, to be most effective to provide this infrastructure by a mix of
Michael Ashburner, Joint-Head of the
major public domain institutions, academic centres of excellence and
European Bioinformatics Insitute:
industrial research. Indeed, such are the economies of scale for both data
providers and data users that it has proved to be effective to collect the
“What we so desperately need, if we are
major data classes, nucleic acid and protein sequence, protein structure
going to have any chance of competing
co-ordinates, by truly global collaborative efforts.
with our American cousins over the long
term in bioinformatics, genomics and
In Europe the major public domain institute devoted to bioinformatics is
science in general, is a European Science
the European Bioinformatics Institute, an Outstation of the EMBL.
Council with a consistent and science led
Located adjacent to the Sanger Centre just outside Cambridge, this is the
policy with freedom from political and
European home of the major international nucleic acid sequence and
nationalistic interference.”
protein structure databases, as well as the world’s premier protein sequence
database. Despite the welcome growth of national centres of excellence
in bioinformatics in Europe these major infrastructural projects must be supported centrally. The EBI
is a major database innovator, eg its proposed ArrayExpress database for microarray data, and software
innovator, eg its SRS system. Jointly with the Sanger Centre the EBI produces the highest quality
automatic annotation of the emerging human genome sequence (Ensembl).
To the surprise of the EBI, and many others, attempts to fund these activities at any serious level
through the programmes of the European Commission were rebuffed in 1999. Under Framework IV
the European Commission had funded databases at the EBI; despite an increased funding to the area
of ‘infrastructure’ generally the EBI was judged ineligible for funding under Framework Programme
V. Projects internationally regarded as excellent, such as the ArrayExpress database, simply lack
funding. The failure of the EC to fund the EBI in 1999 led to a major funding crisis which remains to
be resolved for the long term, although the Member States of EMBL have stepped in with emergency
funds, and are considering a substantial increase in funding for the longer term.
It is no coincidence that the number of ‘start-up’ companies in the fields of bioinformatics and genomics
in the USA is many times that in Europe. There there is a commitment to funding both national
institutions (the budget of the US National Center for Biotechnology Information is three-times that
of the EBI) and academic groups. What we so desperately need, if we are going to have any chance
of competing with our American cousins over the long term in bioinformatics, genomics and science
in general, is a European Science Council with a consistent and science led policy with freedom from
political and nationalistic interference. The funding of science through the present mechanisms in place
in Brussels is failing both the community and the Community.
Michael Ashburner
3
ERCIM News No. 43, October 2000JOINT ERCIM ACTIONS
ERCIM has launched the 2001/2002 Fellowship Programme
ERCIM offers postdoctoral fellowships in leading in two research centres. Next deadline for
European information technology research centres. applications: 31 October 2000.
The Fellowships are of 18 months duration, to be spent
The ERCIM Fellowship Programme was Selection Procedure Fellowships are of 18 months duration,
established in 1990 to enable young Each application is reviewed by one or spent in two of the ERCIM institutes.
scientists from around the world to more senior scientists in each ERCIM ERCIM offers a competitive salary which
perform research at ERCIM institutes. For institute. ERCIM representatives will may vary depending on the country. Costs
the 2001/2002 Programme, applications select the candidates taking into account for travelling to and from the institutes
are solicited twice with a deadline of 31 the quality of the applicant, the overlap will be paid. In order to encourage the
October 2000 and 30 April 2001. of interest between applicant and the mobility, a member institution will not be
hosting institution and the available eligible to host a candidate of the same
Topics funding. nationality.
This year, the ERCIM Fellowship
Links:
programme focuses on the following Conditions
Detailed description and online application
topics: Candidates must:
form: http://www.ercim.org/activity/fellows/
• Multimedia Systems • have a PhD degree (or equivalent), or
Please contact
• Database Research be in the last year of the thesis work
Aurélie Richard – ERCIM Office
• Programming Language Technologies with an outstanding academic record
Tel: +33 4 92 38 50 10
• Constraints Technology and • be fluent in English
E-mail: aurelie.richard@ercim.org
Application • be discharged or get deferment from
• Control and Systems Theory military service
• Formal Methods • start the grant before October 2001.
• Electronic Commerce
• User Interfaces for All
• Environmental Modelling
• Health and Information Technology
• Networking Technologies
• E-Learning
• Web Technology, Research and
Application
• Software Systems Validation
• Computer Graphics
• Mathematics in Computer Science
• Robotics
• others.
Objectives
The objective of the Programme is to
enable bright young scientists to work
collectively on a challenging problem as
fellows of an ERCIM insitute. In addition,
an ERCIM fellowship helps widen and
intensify the network of personal relations
and understanding among scientists. The
Programme offers the opportunity:
• to improve the knowledge about
European research structures and
networks
• to become familiar with working
conditions in leading European research
centres
• to promote co-operation between
Poster for the
research groups working in similar
2001/2002 ERCIM
areas in different laboratories, through
Fellowship
the fellowships.
Programme.
4
ERCIM News No. 43, October 2000EUROPEAN SCENE
Irish Government invests over €635 Million in Basic Research
Science Foundation Ireland, the National Foundation relevant to economic development, particularly
for Excellence in Scientific Research, was launched Biotechnology and Information and Communications
by the Irish Government to establish Ireland as a Technologies (ICT). The foundation has over €635
centre of research excellence in strategic areas million at its disposal.
The Technology Foresight Reports and evaluation of expenditure of the strategy of Science Foundation Ireland,
published in 1999 had recommended that Technology Foresight Fund. The including Dr. Gerard van Oortmerssen,
the Government establish a major fund to Foundation will be set up initially as a chairman of ERCIM, who comments:
develop Ireland as a centre for world class sub-Board of Forfás, the National Policy “The decision by the Irish Government
research excellence in strategic niches of and Advisory Board for Enterprise, Trade, to make a major strategic investment in
Biotechnology and ICT. As part of its Science, Technology and Innovation. fundamental research in ICT shows vision
response, the Government approved a and courage and is an example for other
Technology Foresight Fund of over €635 Speaking at the launch of the first call for European countries. It is fortunate that
million for investment in research in the Proposals to the Foundation on the 27th Ireland recently joined ERCIM. We are
years 2000-2006. Of this fund, €63 of July, 2000, Ireland’s Deputy Prime looking forward to co-operating with a
million has been allocated to set up a new Minister, Mary Harney, said that the Irish strong research community in Ireland.”
research and development institute Government was keen to establish Ireland
located in Dublin in partnership with as a centre of research excellence in ICT The aim of the first call of the proposals
Massachusetts Institute of Technology. and Biotechnology. “We wish to attract is to identify and fund, at a level of up to
The new institute will be known as Media the best scientific brains available in the €1.3 million per year, a small number of
Lab Europe and will specialise in international research community, outstanding researchers and their teams
multimedia, digital content and internet particularly in the areas of Biotechnology who will carry out their work in public
technologies. and ICT, to develop their research in research organisations in Ireland.
Ireland. The large amount of funding Applications are invited not only from
This Fund is part of a €2.5 billion being made available demonstrates the Irish scientists at home and abroad but
initiative on R&D that the Irish Irish government’s commitment to this also from the global research community.
Government has earmarked for Research, vitally important project.” Selection will be by an international peer
Technology and Innovation (RTI) review system. The funding awards will
activities in the National Development Advisory Panels with international cover the cost of research teams, possibly
Plan 2000-2006. Science Foundation experts in Biotechnology and Information up to 12 people, over a three to five year
Ireland is responsible for the and Communications Technologies (ICT) period. The SFI Principal Investigator and
management, allocation, disbursement have been set up to advise on the overall his/her team will function within a
research body in Ireland; either in an Irish
University, Institute of Technology or
public research organisation. International
co-operation will be encouraged.
This initiative will be observed with
interest by other European countries, and
investment decisions made now will have
far-reaching effects for future research in
Ireland.
Links:
Science Foundation Ireland: http://www.sfi.ie
Media Lab Europe: http://www.mle.ie
Please contact:
Josephine Lynch – Science Foundation
Ireland
Tel: +353 1 6073 200
E-mail: info@sfi.ie, Josephine.Lynch@forfas.ie
At the launch of the First Call for Proposals on 27th July were: L to R: Mr. Paul Haran,
Secretary General, Dept. of Enterprise, Trade and Employment; Ms. Mary Harney, T.D.,
Deputy Prime Minister and Minister for Enterprise, Trade and Employment; Mr. Noel Treacy,
TD., Minister for Science, Technology and Commerce and Mr. John Travers, Chief Executive
Officer, Forfás.
5
ERCIM News No. 43, October 2000SPECIAL THEME: BIOINFORMATICS
Bioinformatics:
From the Pre-genomic to the Post-genomic Era
by Thomas Lengauer
Computational Biology and The goal of this field is to provide biomolecules and molecular complexes
Bioinformatics are terms for an computer-based methods for coping with in aqueous solution as well as the
interdisciplinary field joining and interpreting the genomic data that are modeling and simulation of molecular
information technology and being uncovered in large volumes within interaction networks inside the cell and
biology that has skyrocketed in the diverse genome sequencing projects between cells. Solving these problems is
recent years. The field is located and other new experimental technology essential for an accurate and effective
at the interface between the two in molecular biology. The field presents analysis of disease processes by
scientific and technological one of the grand challenges of our times. computer.
disciplines that can be argued to It has a large basic research aspect, since
drive a significant if not the we cannot claim to be close to Besides these more ‘timeless’ scientific
dominating part of contemporary understanding biological systems on an problems, there is a significant part of
innovation. In the English organism or even cellular level. At the computational biology that is driven by
language, Computational Biology same time, the field is faced with a strong new experimental data provided through
refers mostly to the scientific part demand for immediate solutions, because the dramatic progress in molecular
of the field, whereas the genomic data that are being uncovered biology techniques. Starting with genomic
Bioinformatics addresses more encode many biological insights whose sequences, the past few years have
the infrastructure part. In other deciphering can be the basis for dramatic provided gene expression data on the
languages (eg German) scientific and economical success. With basis of ESTs (expressed sequence tags)
Bioinformatics covers both the pre-genomic era that was and DNA microarrays (DNA chips).
aspects of the field. characterized by the effort to sequence These data have given rise to a very active
the human genome just being completed, new subfield of computational biology
we are entering the post-genomic era that called expression data analysis. These
concentrates on harvesting the fruits data go beyond a generic view on the
hidden in the genomic text. In contrast to genome and are able to distiniguish
the pre-genomic era which, from the between gene populations in different
announcement of the quest to sequence tissues of the same organism and in
the human genome to its completion, has different states of cells belonging to the
lasted less than 15 years, the post-genomic same tissue. For the first time, this affords
era can be expected to last much longer, a cell-wide view of the metabolic and
probably extending over several regulatory processes under different
generations. conditions. Therefore these data are
believed to be an effective basis for new
At the basis of the scientific grand diagnoses and therapies of diseases.
challenge in computational biology there
are problems in computational biology Eventually genes are transformed into
such as identifying genes in DNA proteins inside the cell, and it is mostly
sequences and determining the three- the proteins that govern cellular processes.
dimensional structure of proteins given Often proteins are modified after their
the protein sequence (the famed protein synthesis. Therefore, a cell-wide analysis
folding problem). Other unsolved of the population of mature proteins is
mysteries include the computational expected to correlate much more closely
estimation of free energies of with cellular processes than the expressed
6
ERCIM News No. 43, October 2000SPECIAL THEME: BIOINFORMATICS
Articles in this section:
Introduction
6 Bioinformatics: From the Pre-genomic to the
Post-genomic Era
by Thomas Lengauer
Review Papers
8 Computational Genomics: Making sense of
Complete Genomes
by Anton Enright, Sophia Tsoka and Christos
Ouzounis
10 Searching for New Drugs in Virtual Molecule
Databases
by Matthias Rarey and Thomas Lengauer
Classical Bioinformatics Problems
12 A High Performance Computing Network
for Protein Conformation Simulations
by Marco Pellegrini
13 Ab Initio Methods for Protein Structure
genes that are measured today. The encoding that Nature has afforded for
Prediction: A New Technique based on
emerging field of proteomics addresses biological signals as well as the enormous
Ramachandran Plots
the analysis of the protein population data volume present large challenges and
by Anna Bernasconi
inside the cell. Technologies such as 2D are continuing to have large impact on
15 Phylogenetic Tree Reconciliation: the
gels and mass spectrometry offer the processes of information technology
Species/Gene Tree Problem
by Jean-François Dufayard, Laurent
glimpses into the world of mature proteins themselves.
Duret and François Rechenmann
and their molecular interactions.
New Developments
In this theme section, we present 15
16 Identification of Drug Target Proteins
Finally, we are stepping beyond analyzing scientific progress reports on various
by Alexander Zien, Robert Küffner, Theo
generic genomes and are asking what aspects of computational biology. We
Mevissen, Ralf Zimmer
genetic differences between individuals begin with two review papers, one from
and Thomas Lengauer
of a species are the key for predisposition the biological and one from the
18 Modeling and Simulation of Genetic
to certain diseases and effectivity of pharmaceutical perspective. In three
Regulatory Networks
special drugs. These questions join the further articles we present progress on
by Hidde de Jong, Michel Page, Céline
Hernandez, Hans Geiselmann
fields of molecular biology, genetics, and solving classical grand challenge
and Sébastien Maza
pharmacy in what is commonly named problems in computational biology. A
19 Bioinformatics for Genome Analysis in
pharmacogenomics. section of five papers deals with projects
Farm Animals
addressing computational biology
by Andy S. Law and Alan L. Archibald
Pharmaceutical industry was the first problems pertaining to current problems
21 Modelling Metabolism Knowledge
branch of the economy to strongly engage in the field. In a section with three papers
using Objects and Associations
in the new technology combining high- we discuss medical applications. The last
by Hélène Rivière-Rolland, Loïc Taloc,
Danielle Ziébelin, François Rechenmann
throughput experimentation with two papers concentrate on the role of
and Alain Viari
bioinformatics analysis. Medicine is information technology contributions,
22 Co-operative Environments for Genomes
following closely. Medical applications specifically, algorithms and visualization.
Annotation: from Imagene to Geno-Annot
step beyond trying to find new drugs on
by Claudine Médigue, Yves Vandenbrouke,
the basis of genomic data. The aim here This theme section witnesses the activity
François Rechenmann and Alain Viari
is to develop more effective diagnostic and dynamics that the field of
Medical applications
techniques and to optimize therapies. The computational biology and bioinformatics
23 Arevir: Analysis of HIV Resistance Mutations
first steps to engage computational enjoys not only among biologists but also
by Niko Beerenwinkel, Joachim Selbig,
biology in this quest have already been among computer scientists. It is the
Rolf Kaiser and Daniel Hoffmann
taken. intensive interdisciplinary cooperation
24 Human Brain Informatics – Understanding
between these two scientific communities Causes of Mental Illness
by Stefan Arnborg, Ingrid Agartz, Mikael
While driven by the biological and that is the motor of progress in this key-
Nordström, Håkan Hall and Göran Sedvall
medical demand, computational biology technology for the 21st century.
26 Intelligent Post-Genomics
will also exert a strong impact onto
by Francisco Azuaje
Please contact:
information technology. Since, due to
Thomas Lengauer – GMD
General
their complexity, we are not able to
Tel: +49 2241 14 2777
27 Combinatorial Algorithms
simulate biological processes on the basis
E-mail: Thomas.Lengauer@gmd.de
in Computational Biology
of first principles, we resort to statistical
by Marie-France Sagot
learning and data mining techniques,
28 Crossroads of Mathematics, Informatics
methods that are at the heart of modern
and Life Sciences
information technology. The mysterious
by Jan Verwer and Annette Kik
7
ERCIM News No. 43, October 2000SPECIAL THEME: BIOINFORMATICS
Computational Genomics: Making Sense of Complete Genomes
by Anton Enright, Sophia Tsoka and Christos Ouzounis
The current goal of bioinformatics is to take the raw European Bioinformatics Institute (an EMBL
genetic information produced by sequencing projects outstation) in Cambridge, work is underway to tackle
and make sense of it. The entire genome sequence this vast flood of data using both existing and novel
should reflect the inheritable properties of a given technologies for biological discovery.
species. At the Computational Genomics Group of the
The recent sequencing of the complete (CGG) is targeting research on the module to assess the quality of the results
genomes of many species (including a following fields. and assign function to each gene.
‘draft’ human genome) has emphasised
the importance of bioinformatics research. Automatic Genome Annotation Assigning Proteins into Families
Once the DNA sequence of an organism Accurately annotating the proteins Clustering protein sequences by similarity
is known, proteins encoded by this encoded by complete genomes in a into families is another important aspect
sequence are predicted. While some of comprehensive and reproducible manner of bioinformatics research. Many
these proteins are highly similar to well- is important. Large scale sequence available clustering techniques fail to
studied proteins whose functions are analysis necessitates the use of rapid accurately cluster proteins with multiple
known, many will only have similarity to computational methods for functional domains into families. Multi-domain
another poorly annotated protein from characterisation of molecular proteins generally perform at least two
another genome or worse still, no components. GeneQuiz is an integrated functions that are not necessarily related,
similarity at all. A major goal of system for the automated analysis of and so ideally should belong in multiple
computational genomics is to accurately complete genomes that is used to derive families. To this end we have developed
predict the function of all proteins protein function for each gene from raw a novel algorithm called GeneRAGE. The
encoded by a genome, and if possible sequence information in a manner GeneRAGE algorithm employs a fast
determine how each of these proteins comparable to a human expert. It employs sequence similarity search algorithm such
interacts with other proteins in that a variety of similarity search and analysis as BLAST and represents similarity
organism. Using a combination of methods that entail the use of up-to-date information between proteins as a binary
sequence analysis, novel algorithm protein and DNA databases and creates a matrix. This matrix is then processed and
development and data-mining techniques compact summary of findings that can be passed through successive rounds of the
the Compuational Genomics Group accessed through a Web-based browser. Smith-Waterman dynamic programming
The system applies an ‘expert system’ algorithm, to detect inconsistencies which
Figure 1: The GeneQuiz entry page for the S.cerevisiae genome. Figure 2: Protein families in the Methanococcus jannashii genome
displayed using the X-layout algorithm.
8
ERCIM News No. 43, October 2000SPECIAL THEME: BIOINFORMATICS
represent false-positive or false-negative
similarity assignments. The resulting
clusters represent protein families
accurately and also contain information
regarding the domain structure of multi-
domain proteins. A visualization program
called xlayout based on the Fruchterman-
Rheingold graph-layout optimisation
algorithm has also been developed for
displaying these complex similarity
relationships.
Prediction of Protein Interaction
Another novel algorithm developed in the
CGG group is the DifFuse algorithm. This
algorithm is based on the hypothesis that
there is a selective advantage for proteins
performing related functions to fuse
together during the course of evolution
(eg different steps in the same metabolic
pathway). The DifFuse algorithm can
detect a fused protein in one genome
based its similarity to complementary pair
Figure 3: Gene Fusion – The TRP5 tryptophan synthase protein in S.cerevisiae is a fusion of
of unfused proteins in another genome.
two single domains, such as TrpA and TrpB in E. coli.
The detection of these fused proteins
allows one to predict either functional
association or direct physical interaction
of the un-fused proteins. This algorithm public molecular biology database. to a so-called go-list. Abstracts are
is related to GeneRAGE in the sense that Similarly, we have also developed a clustered using an unsupervised machine
the fusion detection process is similar to standard for genome annotation called learning approach, according to their
the multi-domain detection step described GATOS (Genome AnoTattiOn System) sharing of words contained in the go-list.
above. This algorithm can be applied to which is used as a data exchange format. The xlayout algorithm (see above) is then
many genomes for large-scale detection Work is under development to incorporate used to display the clustering results. The
of protein interactions. an XML-based standard called XOL resulting document clusters accurately
(XML Ontology Language). represent sets of abstracts referring to the
Knowledge-Base Development same biological process or pathway.
Databases in molecular biology and Text-Analysis and Data-Mining TextQuest has been applied to the
bioinformatics are generally poorly There is already a vast amount of data development of the dorsal-ventral axis in
structured, many existing as flat text files. available in the abstracts of published the fruit-fly Drosophila melanogaster and
In order to get the most out of complex biological text. The MEDLINE database has produced meaningful clusters relating
biological databases these data need to be contains abstracts for over 9 million to different aspects of this developmental
represented in a format suitable for biological papers published worldwide process.
complex information extraction through since 1966. However, these data are not
Links:
a simple querying system and also ensure represented in a format suitable for large-
http://www.ebi.ac.uk/research/cgg/
data integrity. An ontology is an exact scale information extraction. We have
specification of a data model that can be developed an algorithm called TextQuest
Please Contact:
Christos A. Ouzounis – European Molecular
used to generate such a ‘knowledge’ base. which can perform document clustering
Biology Laboratory, European
We have developed an ontology for of MEDLINE abstracts. TextQuest uses
Bioinformatics Institute
representation of genomic data which is an approach that restructures these
Tel: +44 1223 49 46 53
used to built a database called GenePOOL biological abstracts and obtains the
E-mail: ouzounis@ebi.ac.uk
incorporating these concepts. This system optimal number of terms that can
stores computationally-derived associate large numbers of abstracts into
information such as functional meaningful groups. Using a term-
classifications, protein families and weighting system based on the TF.IDF
reaction information. Database analysis family of metrics and term frequency data
is performed through flexible and from the British National Corpus, we
complex queries using LISP that are select words that are biologically
simply not possible through any other significant from abstracts and add them
9
ERCIM News No. 43, October 2000SPECIAL THEME: BIOINFORMATICS
Searching for New Drugs in Virtual Molecule Databases
by Matthias Rarey and Thomas Lengauer
The rapid progress in sequencing the human genome activity specifically and which are therefore
opens the possibility for the near future to understand considered to be potential drugs against the disease.
many diseases better on molecular level and to obtain At GMD, approaches to the computer-based search
so-called target proteins for pharmaceutical research. for new drugs are being developed (virtual screening)
If such a target protein is identified, the search for which have already been used by industry in parts.
those molecules begins which influence the protein’s
Searching for New Lead Structures New Approaches to Screening flexibility of the ligand is considered
The development process of a new Molecule Databases during a FlexX prediction. In a set of
medicine can be divided into three phases. The methods of searching for drug benchmarks tests, FlexX is able to predict
In the first phase, the search for target molecules can be classified according to about 70 percent of the protein-ligand
proteins, the disease must be understood two criteria: the existence of a three- complexes sufficiently similar to the
on molecular-biological level as far as to dimensional structural model of the target experimental structure. With about 90
know individual proteins and their protein and the size of the data set to be seconds computing time per prediction,
importance to the symptoms. Proteins are searched. If a structural model of the the software belongs to the fastest docking
the essential functional units in our protein is available, it can be used directly tools currently available. FlexX has been
organism and can perform various tasks to search for suitable drugs (structure- marketed since 1998 and is currently
ranging from the chemical transformation based virtual screening); ie we search for being used by about 100 pharmaceutical
of materials up to the transportation of a key fitting a given lock. If a structural companies, universities and research
information. The function is always model is missing, the similarity to institutes.
linked with the specific binding of other molecules that bind to the target protein
molecules. As early as 100 years ago, is used as a measure for the suitability as If the three-dimensional structure of the
Emil Fischer recognised the lock-and-key a drug (similarity-based virtual target protein is not available, similarity-
principle: Molecules that bind to each screening). Here we use a given key to based virtual screening methods are
other are complementary to each other search for fitting keys without knowing applied to molecules with known binding
both spatially and chemically, just as only the lock. In the end, the size of the data properties, called the reference molecule.
a specific key fits a given lock (see Figure set to be searched decides on the amount The main problem here is the structural
1). If a relationship between the of time to be put into the analysis of an alignment problem which is closely
suppression (or reinforcement) of a individual molecule. The size ranges from related to the docking problem described
protein function and the symptoms is a few hundred already preselected above. Here, we have to superimpose a
recognised, the protein is declared to be molecules via large databases of several potential drug molecule with the reference
a target protein. In the second phase, the millions of molecules to virtual molecule so that a maximum of functional
actual drug is developed. The aim is to combinatorial molecule libraries groups are oriented such that they can
detect a molecule that binds to the target theoretically allowing to synthese of up form the same interactions with the
protein, on the one hand, thus hindering to billions of molecules from some protein. Along the lines of FlexX, we have
its function and that, on the other, has got hundred molecule building blocks. developed the software tool FlexS [2,3]
further properties that are demanded for for the prediction of structural alignments
drugs, for example, that it is well tolerated The key problem in structure-based with approximately the same performance
and accumulates in high concentration at virtual screening is the prediction of the with respect to computing time and
the place of action. The first step is the relative orientation of the target protein prediction quality.
search for a lead structure - a molecule and a potential drug molecule, the so-
that binds well to the target protein and called docking problem. For solving this If very large data sets are to be searched
serves as a first proposal for the drug. problem we have developed the software for similar molecules, the speed of the
Ideally, the lead structure binds very well tool FlexX [1] in co-operation with alignment-based screening does not
to the target protein and can be modified Merck KGaA, Darmstadt, and BASF AG, suffice yet. The aim is to have comparison
such that the resulting molecule is suitable Ludwigshafen. On the one hand, the operations whose computation takes by
as a drug. In the third phase, the drug is difficulty of the docking problem arises far less than one second. Today linear
transformed into a medicine and is tested from the estimation of the free energy of descriptors (bit strings or integral vectors)
in several steps to see if it is well tolerated a molecular complex in aqueous solution are usually applied to solve this problem.
and efficient. The present paper is to and, on the other, from the flexibility of They store the occurrence or absence of
discuss the first step, ie the computer- the molecules involved. While a sufficient characteristic properties of the molecules
based methods of searching for new lead description of the flexibility of the protein such as specific chemical fragments or
structures. presumably will not be possible even in short paths in the molecule. Once such a
the near future, the more important descriptor has been determined, the linear
10
ERCIM News No. 43, October 2000SPECIAL THEME: BIOINFORMATICS
structure enables a very fast comparison.
A considerable disadvantage is, however,
that the overall structure of the molecule
is represented only inadequately and an
exact match between the fragments is
frequently required for the recognition of
similarities. As an alternative, we have
developed a new descriptor, the feature
tree [4], in co-operation with SmithKline
Beecham Pharmaceuticals, King of
Prussia (USA) . Unlike the representation
using linear descriptors, in this approach
a molecule is represented by a tree
structure representing the major building
blocks of the molecules. If two molecules
are to be compared with each other, the
task is to find first an assignment of the
building blocks of the molecules which
might be able to occupy the same regions
of the active site upon binding. With the
aid of a time-efficient implementation,
average comparison times of less than a
tenth second can be achieved. This allows
400.000 molecule comparisons to be
carried out overnight within a single
computation on a single processor.
Complex between the protein HIV-protease
Applying the new descriptor to nor in the distant future. Nevertheless, the
(shown as blue ribbon) and a known inhibitor
benchmark data sets, we could show that, importance of the computer increases in
(shown in red). HIV-protease plays a major
in many cases, an increase of the rate of drug research. The reason is the very great
role in the reproduction cycle of the HIV
active molecules in a selected subset of number of potential molecules which
virus. Inhibitors like the one shown here are
used in the treatment of AIDS.
the data set is achieved if compared with come into consideration as a drug for a
the standard method. specific application. The computer allows
a reasonable pre-selection and molecule
Designing New Medicines libraries can be optimised for specific
with the Computer? characteristics before synthesising. On
Unlike many design problems from the the basis of experiments, the computer is
field of engineering, for example, the able to generate hypotheses which enable
design of complex plants, machines or in turn better planning of further
microchips, the underlying models are experiments. In these domains, the
still very inaccurate for solving problems computer has already proven to be a tool
in the biochemical environment. without which pharmaceutical research
Important quantities such as the binding cannot be imagined anymore.
energy of protein-ligand complexes can
Links:
be predicted only with high error rates.
http://www.gmd.de/SCAI/
In addition, not only on the interaction of
the drug with the target protein is of
Please contact:
Matthias Rarey – GMD
importance to the development of
Tel: +49 2241 14 2476
medicines. The influence of the drug on
E-mail: matthias.rarey@gmd.de
the whole organism rather is to be
examined. Even in the near future, it will
not be possible to answer many questions
arising in this context accurately by means
of the computer due to their complexity:
Is the drug absorbed in the organism?
Does it produce the desired effect? Which
side effects are experienced or is it
possibly even toxic? Therefore a medicine
originating directly from the computer
will not be available neither in the near
11
ERCIM News No. 43, October 2000SPECIAL THEME: BIOINFORMATICS
A High Performance Computing Network
for Protein Conformation Simulations
by Marco Pellegrini
The Institute for Computational Mathematics (IMC- parallel algorithms for numerical liner algebra and
CNR), Pisa, with the collaboration of Prof. Alberto continuous optimization, methods for efficient
Segre of University of Iowa, is now combining several computation of electrostatic fields) in a project on
core areas of expertise (sequential, distributed and Protein Conformation Simulation.
The need for better methodologies to science, with special openings for the combinatorial and continuous optimization,
simulate biological molecular systems has design of efficient algorithms. and strategies for searching over (hyper)-
not been satisfied by the increased surfaces with multiple local minima.
computing power of workstations Protein Folding requires searching an
available nowadays. Instead such optimal configuration (minimizing the Innovative techniques will be used in
increased speed has pushed research into energy) among exponentially large spaces order to push forward the state of the art:
new areas and increasingly complex of possible configurations with a number a new distributed paradigm (Nagging and
simulations. A clear example is the of degrees of freedom ranging from Partitioning), techniques from
problem of protein folding. This problem hundreds to a few thousands. To cope Computational Geometry (hierarchical
is at the core of the technology of rational with this challenge, a double action is representations) and innovative algorithms
drug design and its impact on future required. On one hand, the use of High for long range interactions.
society cannot be underestimated. In a Performance Computing Networks
nutshell, the problem is that of permits the exploitation of the intrinsic Nagging and Partitioning
determining the 3-dimensional structure parallelism which is present in many As observed above, the protein folding
of a protein (and especially its biologically aspects of the problem. On the other hand, problem is reduced to a searching
active sites) starting from its known DNA sophisticated and efficient algorithms are problem in a vast space of conformations.
encoding. There are favourable needed to exploit fully the deep A classical paradigm for finding the
opportunities for approaches employing mathematical properties of the physics optimum over a complex search space is
a spectrum of competences, from strictly involved. Brute force methods are at a that of ‘partitioning’. A master process
biological and biochemical to loss here and the way is open for effective splits the search space among several
mathematical modeling and computer sampling methodologies, algorithms for slave processes and selects the best one
from the solutions returned by the slaves.
This approach is valid when it is relatively
easy to split the workload evenly among
the slave processes and the searching
strategy of each slave is already
optimized. The total execution time is
determined by the slowest of the slaves
and, when any slave is faulty, the
computation is either blocked or returns
a sub-optimal solution. A different and
complementary approach is that of
‘nagging’. Here slaves operate on the
same search space, however each slave
uses a different search strategy. The total
time is determined by the most efficient
slave and the presence of a faulty slave
does not block the computation.
Moreover, in a branch and bound overall
philosophy, it has been shown that the
partial solutions of any processor help to
speed up the search of others. Searching
An image obtained
for the optimal energy conformation is a
with RASMOL 2.6
complex task where parallelism is present
of Bovine
at many different levels, so that neither
HYDROLASE
(SERINE pure nagging nor pure partitioning is the
PROTEINASE).
best choice for all of them. A mixed
12
ERCIM News No. 43, October 2000SPECIAL THEME: BIOINFORMATICS
strategy that uses both is more adaptive (eg an axis parallel box) enclosing the electrostatic fields and integral geometric
and yields better chances of success. molecule. Recursively, we split the formulae leading to new efficient and
molecule into two groups of atoms and robust algorithms for computing
Hierarchical Representations build the corresponding bounding boxes. electrostatic forces. In particular, for
Techniques for representing and The process stops at the leaves of the tree, 3-dimensional continuous distributions
manipulating hierarchies of 3- each of which correspond to a single of charge, there is a representation without
dimensional objects have been developed atom. Such a representation speeds up analytic singularities for which an
in computational geometry for the steric testing since it quickly rules out adaptive Gaussian quadrature algorithm
purpose of speeding up visibility many pairs of atoms that are distant in the converges with exponential speed. Such
computations and collision detection and tree hierarchy. In general many such trees recent techniques benefit from the
could be adapted to the folding problem. may be built with the same input and the calculation of molecular energy since they
One of the main sub-problems in finding aim is to obtain good properties such as obtain a good approximation without
admissible configurations is to determine minimizing the total surface or volume considering all pairs of atoms thus
the existence or absence of steric clashes of the bounding boxes. It is interesting to avoiding quadratic complexity growth.
among single atoms in the model. This is note that such trees are more flexible and
Please contact:
a problem similar to that of collision use less storage than uniform grid
Marco Pellegrini – IMC-CNR
detection for complex 3d models of decompositions and even oct-tree data
Tel: +39 050 315 2410
macroscopic objects in robotics and structures.
E-mail: pellegrini@imc.pi.cnr.it
http://www.imc.pi.cnr.it/~pellegrini
geometric modeling. A popular technique
is that of inclusion hierarchies. The Electrostatic Long Range Interactions
hierarchy is organized as a tree and the A line of research at IMC has found
root is associated with a bounding volume interesting connections between
Ab Initio Methods for Protein Structure Prediction:
A New Technique based on Ramachandran Plots
by Anna Bernasconi and Alberto M. Segre
A new technique for ab initio protein structure guidance of Prof. Alberto M. Segre, in collaboration
prediction, based on Ramachandran plots, is currently with the Institute for Computational Mathematics, IMC-
being studied at the University of Iowa, under the CNR, Pisa.
One of the most important open problems becoming increasingly important. evolutionarily related proteins with
in molecular biology is the prediction of Unfortunately, since it was discovered that similar sequences, as measured by the
the spatial conformation of a protein from proteins are capable of folding into their percentage of identical residues at each
its primary structure, ie from its sequence unique native state without any additional position based on an optimal structural
of amino acids. The classical methods for genetic mechanisms, over 25 years of superposition, often have similar
structure analysis of proteins are X-ray effort has been expended on the structures. For example, two sequences
crystallography and nuclear magnetic determination of the three-dimensional that have just 25% sequence identity
resonance (NMR). Unfortunately, these structure from the sequence alone, without usually have the same overall fold.
techniques are expensive and can take a further experimental data. Despite the Threading methods compare a target
long time (sometimes more than a year). amount of effort, the protein folding sequence against a library of structural
On the other hand, the sequencing of problem remains largely unsolved and is templates, producing a list of scores. The
proteins is relatively fast, simple, and therefore one of the most fundamental scores are then ranked and the fold with
inexpensive. As a result, there is a large unsolved problems in computational the best score is assumed to be the one
gap between the number of known protein molecular biology today. adopted by the sequence. Finally, the ab
sequences and the number of known initio prediction methods consist in
three-dimensional protein structures. This How can the native state of a protein be modelling all the energetics involved in
gap has grown over the past decade (and predicted (either the exact or the the process of folding, and then in finding
is expected to keep growing) as a result approximate overall fold)? There are three the structure with lowest free energy. This
of the various genome projects major approaches to this problem: approach is based on the ‘thermodynamic
worldwide. Thus, computational methods ‘comparative modelling’, ‘threading’, and hypothesis’, which states that the native
which may give some indication of ‘ab initio prediction’. Comparative structure of a protein is the one for which
structure and/or function of proteins are modelling exploits the fact that the free energy achieves the global
13
ERCIM News No. 43, October 2000SPECIAL THEME: BIOINFORMATICS
minimum. While ab initio prediction is
clearly the most difficult, it is arguably
Residue C
β
the most useful approach.
χ
There are two components to ab initio
prediction: devising a scoring (ie, energy)
C
α
ψ
function that can distinguish between
correct (native or native-like) structures
ω
φ
NC′
from incorrect ones, and a search method
to explore the conformational space. In
Figure 1: Backbone
many methods, the two components are
torsion angles of a
O
coupled together such that a search
protein.
function drives, and is driven by, the
scoring function to find native-like
structures. Unfortunately, this direct
approach is not really useful in practice,
both due to the difficulty of formulating
an adequate scoring function and to the
formidable computational effort required
to solve it. To see why this is so, note that
any fully-descriptive energy function
must consider interactions between all
pairs of atoms in the polypeptide chain,
and the number of such pairs grows
exponentially with the number of amino
acids in the protein. To make matters
worse, a full model would also have to
contend with vitally important interactions
between the protein’s atoms and the
Figure 2: The
environment, the so-called ‘hydrophobic
Ramachandran Plot.
effect’. Thus, in order to make the
computation practical, simplifying
assumptions must necessarily be made. occurring legal values for the torsion space, guided by some scoring function.
angles. In particular, the peptide bond is Of course, this discrete search process is
Different computational approaches to rigid (and shorter than expected) because an exponential one, meaning that in its
the problem differ as to which assumptions of the partial double-bond character of most naive form it is impractical for all
are made. A possible approach, based on the CO-NH bond. Hence the torsion angle but the smallest proteins. Thus, the search
the discretization of the conformational omega around this bond generally occurs should be made more palatable by
space, is that of deriving a protein-centric in only two conformations: ‘cis’ (omega incorporating a number of search pruning
lattice, by allowing the backbone torsion about 0 degrees) and ‘trans’ (omega about and reduction techniques, and/or by
angles, phi, psi, and omega, to take only 180 degrees), with the trans conformation exploring the discrete space in parallel.
a discrete set of values for each different being by far the more common. Moreover From this naive folding, appropriate
residue type. Under biological conditions, the other two torsion angles, phi and psi, constraints (based on atomic bonds
the bond lengths and bond angles are are highly constrained, as noted by G. N. present, their bond lengths, and any Van
fairly rigid. Therefore, the internal torsion Ramachandran (1968). The (paired) der Waals or sulfide-sulfide interactions
angles along the protein backbone values allowed for them can be selected in the naive folding) are derived for an
determine the main features of the final using clustering algorithms operating in interior-point optimization process, which
geometric shape of the folded protein. a Ramachandran plot space constructed adjusts atomic positions and computes an
Furthermore, one can assume that each from the protein database energy value for this conformation. The
of the torsion angles is restricted to a (http://www.rcsb.org/pdb), while discrete computed energy value is then passed
small, finite set of values for each values for omega can be set to {0,180}. back to the discrete space and used to tune
n
different residue type. As a matter of fact, This defines a space of (2k) possible the scoring function for further additional
not all torsion angles are created equally. conformations for a protein with n amino parallel pruning.
While they may feasibly take any value acids (assuming each phi and psi pair is
Please contact:
from -180 to 180 degrees, in nature all allowed to assume k distinct values). In
Anna Bernasconi – IMC-CNR
these values do not occur with uniform this way, the conformational space of
Tel. +39 050 3152411
probability. This is due to the geometric proteins is discretized in a protein-centric
E-mail: bernasconi@imc.pi.cnr
or Alberto M. Segre – University of Iowa
constraints from neighboring atoms, fashion. An approximate folding is then
E-mail: alberto-segre@uiowa.edu
which dramatically restrict the commonly found by searching this reduced, discrete
14
ERCIM News No. 43, October 2000SPECIAL THEME: BIOINFORMATICS
Phylogenetic Tree Reconciliation: the Species/Gene Tree Problem
by Jean-François Dufayard, Laurent Duret and François Rechenmann
An algorithm to find gene duplications in phylogenetic génomes’ team from the UMR ‘biométrie, biologie
trees in order to improve gene function inferences évolutive de Lyon’. The algorithm and its software is
has been developed in a collaboration between the applicable to realistic data, especially n-ary species
the Helix team from INRIA Rhône-Alpes and the tree and unrooted phylogenetic tree. The algorithm
‘biométrie moléculaire, évolution et structure des also takes branch lengths into account.
With appropriate algorithms, it is possible ancestor results from a speciation event, unrooted: the number of duplications is
to deduce species history studying genes while they are paralogous if the a good criterion to make a choice, and
sequences. Genes are indeed subject to divergence results from a duplication with this method the algorithm is able to
mutations during the evolution process, event. root phylogenetic trees.
and hence the corresponding
(homologous) sequences in different It is essential to make the distinction Software has been developed to use the
species differ from each other. A tree can because two paralogous genes are less algorithm. It has been written in JAVA
be built from the sequences comparison, likely to have preserved the same function 1.2, and the graphical interface permits
relating genes and species history: a than two orthologues. Therefore, if one an easy application to realistic data. An
phylogenetic tree. Sometimes a wants to predict gene function by exhaustive species tree can be easily seen
phylogenetic tree disagrees with the homology between different species, it is and edited (tested with more than 10,000
species tree (constructed for example necessary to check whether genes are leaves). Results can be modified and
from anatomical and paleontological orthologous or paralogous to increase the saved.
considerations). These differences can be accuracy of the prediction.
Links:
explained by a gene being duplicated in
Action Helix:
a genome, and each copy having its own An algorithm has been developed which
http://www.inrialpes.fr/helix
history. Consequently, a node in a can deduce this information by comparing
Please contact:
phylogenetic tree can be the division of gene trees with the taxonomy of different
Jean-François Dufayard – INRIA Rhône-
an ancestral species into two others, as species. Currently, the algorithm is
Alpes
well as a gene duplication. More applicable to gene families issued of
Tel: +33 4 76 61 53 72
precisely, in a family of homologous vertebrates. It can be applied to realistic
E-mail: Jean-Francois.Dufayard@inrialpes.fr
genes, paralogous genes have to be data: species trees may not necessarily be
distinguished from orthologous genes. binary, and the tree structures are
Two genes are orthologous if the compared as well as their branch lengths.
divergence from their last common Finally, phylogenetic trees can be
Reconciliation of a
phylogenetic tree
(lower-left corner)
with the
corresponding
species tree (upper-
left corner).
Reconciliation
produces a
reconciled tree
(right) which
describes both
genes and species
history. It allows to
deduce the location
of gene duplications
(white squares)
which are the only
information needed
to distinguish
orthologous from
paralogous genes.
15
ERCIM News No. 43, October 2000SPECIAL THEME: BIOINFORMATICS
Identification of Drug Target Proteins
by Alexander Zien, Robert Küffner, Theo Mevissen, Ralf Zimmer and Thomas Lengauer
As ever more knowledge is accumulated on various GMD scientists work to utilize protein structures,
aspects of the molecular machinery underlying the expression profiles as well as metabolic and
biological processes within organisms, including the regulatory networks in the search for target proteins
human, the question how to exploit this knowledge for pharmaceutical applications.
to combat diseases becomes increasingly urgent. At
Huge amounts of heterogeneous data are the given protein sequence to a known Nowadays, gene expression levels are
pouring out of the biological labs into fold by aligning the sequence onto the most frequently obtained by DNA chips,
molecular biology databases. Most known protein structure. Thus, threading micro-arrays or SAGE (serial analysis of
popular are the sequencing projects that utilizes the available knowledge directly gene expression). These technologies are
decipher complete genomes and uncover in the form of the structures that are designed for high throughput and are
their protein complements. Many more already experimentally resolved. Another already capable of monitoring the
projects are under way, aiming e.g. at advantage of threading in comparison to complete gene inventory of small
resolving the yet unknown protein folds ab initio methods is the low demand of organisms (or a large part of all human
or at collecting human single nucleotide computing time. This is especially true genes). We apply statistics and machine
polymorphisms. In a less coordinated for 123D, a threader developed at GMD learning techniques in order to normalize
way, many labs measure gene expression that models pairwise amino acid contacts the resulting raw measurement data, to
levels inside cells in numerous cell states. in a way that allows alignment identify differentially regulated genes and
Last but not least there is an enormous computation by fast dynamic to clusters of cell states. Subsequently,
amount of unstructured information programming. The objective function for we apply statistics and machine learning
hidden in the plethora of scientific the optimization includes potentials techniques in order to identify
publications, as is documented in accounting for chemical environments differentially regulated genes, clusters of
PubMed, for instance. Each of these within the protein structure, and cell states etc.
sources of data provides valuable clues membership in secondary structure
to researchers that are interested in elements as well as amino acid Pathway Modeling
molecular biological problems. The substitution scores and gap penalties, all While a large part of the current effort is
project TargId at GMD SCAI focuses on carefully balanced by a systematic focused on inferring high-level structures
methods to address the arguably most procedure. A second threading program from gene expression data, much is
urgent problem: the elucidation of the programmed at GMD, called RDP, already known on the underlying genetic
origins and mechanisms of human utilizes more computation time than 123D networks. Several databases that are
diseases, culminating in the identification in order to optimize full pair interaction available on the internet document
of potential drug target proteins. contact potentials to yield refined metabolic relations in machine-readable
alignments. form. The situation is worse for regulatory
TargId responds to the need for pathways; most of this knowledge is still
bioinformatics support for this task. The Expression Data Analysis hidden in the literature. Consequently, we
goal of the project is to develop methods Data on expressed genes comes in several have implemented methods that extract
that extract useful knowledge from the different flavors. The historically first additional protein relations from article
raw data and help to focus on the relevant method is the generation of ESTs abstracts and model them as Petri nets.
items of data. The most sophisticated (expressed sequence tags), i.e. low-quality The resulting graphs can be restricted to
aspect is the generation of new insights sequence segments of mRNA, either species, tissues, diseases or other areas of
through the combination of information proportional to its cellular abundance or interest. Tools are under development for
from different sources. Currently, our enriched for rare messages. While ESTs viewing and editing using standard graph
TargId methodology builds on three main are superseded by more modern methods and Petri net packages. The generated
pillars: protein structure prediction, for the purpose of expression level networks can provide overviews that
expression data analysis and measurements, they are still valuable for cross the boundaries of competence fields
metabolic/regulatory pathway modeling. finding new genes and resolving gene of human experts.
structures in the genomic DNA. Thus, we
Protein Structure Prediction have implemented a variant of 123D that Further means are necessary to allow for
Knowledge on the three-dimensional threads ESTs directly onto protein more detailed analyses. We can
structure (fold) of a protein provides clues structures, thereby translating nucleotide automatically extract pathways from
on its function and aids in the search for codons into amino acids on the fly. This networks that are far too large and
inhibitors and other drugs. Threading is program, called EST123D, is useful for complicated to lend themselves to easy
an approach to structure prediction which proteins that are not yet characterized interpretation. Pathways are biologically
essentially assesses the compatibility of other than by ESTs. meaningful subgraphs, e.g. signaling
16
ERCIM News No. 43, October 2000SPECIAL THEME: BIOINFORMATICS
metabolic/regulatory network extracted pathways structure/function prediction
gene expression data pathway scoring predicted target protein
In the TargId project, new bioinformatics methods combine heterogeneous information in the search for drug target proteins.
cascades or metabolic pathways that proteins, while it still allows for
Links:
http://cartan.gmd.de/TargId/
account for supply and consumption of incorporating and testing hypotheses.
any intermediate metabolites. Another E.g., pathways can be constructed from
Please contact:
method conceived in TargId, called DMD interactions that are observed in different
Alexander Zien or Ralf Zimmer – GMD
(differential metabolic display), allows tissues or species. The expression data
Tel: + 49 2241 14-2563 or -2818
for comparing different systems provide an orthogonal view on these
E-mail: Alexander.Zien@gmd.de or
Ralf.Zimmer@gmd.de
(organisms, tissues, etc.) on the level of interactions and can thus be used to
complete pathways rather than mere validate the hypotheses.
interactions.
Structure prediction can aid in this process
Bringing it All Together ... at several stages. First, uncharacterized
Each of the methods described above can proteins can tentatively be embedded into
provide valuable clues pointing to target known networks based on predicted
proteins. But the crux lies in their clever structure and function. Second, structural
combination, interconnecting data from information can be integrated into the
different sources. In recent work, we have pathway scoring function. Finally, when
shown that in real life situations clustering a target protein is identified, its structure
alone may not be able to reconstruct will be of utmost interest for further
pathways from gene expression data. investigations.
Instead of searching for meaning in
clusters, we invented an approach that It can be imagined that target finding can
proceeds inversely: First, a set of gain from broadening the basis for the
pathways is extracted from a protein/gene search to also include, e.g., phylogenetic
network, using the methods described profiles, post-translational modifications,
above. Then, these pathways are scored genome organization or polymorphisms.
with respect to gene expression data. The As these fields are still young and in need
restriction to pathways prevents us from of further progress, it is clear that holistic
considering unreasonable groupings of target finding is only in its infancy.
17
ERCIM News No. 43, October 2000SPECIAL THEME: BIOINFORMATICS
Modeling and Simulation of Genetic Regulatory Networks
by Hidde de Jong, Michel Page, Céline Hernandez, Hans Geiselmann and Sébastien Maza
In order to understand the functioning of an organism, at INRIA Rhône-Alpes have been developing a
the network of interactions between genes, mRNAs, computer tool for the modeling and simulation of
proteins, and other molecules needs to be elucidated. genetic regulatory networks in collaboration with
Since 1999, researchers in the bioinformatics group molecular biologists.
The sequencing of the entire genome of on ideas from mathematical biology and constraints on the local behavior of the
prokaryotic and eukaryotic organisms has artificial intelligence. system. By analyzing the possible
been completed in the past few years, transitions between volumes, an
culminating in the presentation of a The method describes genetic regulatory indication of the global behavior of the
working draft of the human genome last systems by piecewise-linear differential system can be obtained. In particular, the
June. The analysis of these huge amounts equations with favourable mathematical method determines steady-state volumes
of data involves such tasks as the properties. The phase space is subdivided and volume cycles that are reachable from
prediction of folding structures of proteins into volumes in which the equations an initial volume. The steady-state
and the identification of genes and reduce to simple, linear and orthogonal volumes and volume cycles correspond
regulatory signals. It is clear, however, differential equations imposing strong to functional states of the regulatory
that the structural analysis of sequence
data needs to be complemented with a
functional analysis to elucidate the role
of genes in controlling fundamental
biological processes.
One of the central problems to be
addressed is the analysis of genetic
regulatory systems controlling the
spatiotemporal expression of genes in an
organism. The structure of these
regulatory systems can be represented as
a) b)
a network of interactions between genes,
proteins, metabolites, and other small
molecules. The study of genetic
regulatory networks will contribute to our
understanding of complex processes like
the development of a multicellular
organism.
In addition to new experimental tools
permitting the expression level to be
rapidly measured in a massively parallel
way, computer tools for the modeling,
visualization, and simulation of genetic
regulatory systems will be indispensable.
Most systems of interest involve many
genes connected through cascades and
positive and negative feedback loops, so
that an intuitive understanding of their
dynamics is hard to obtain. As a
consequence of the lack of quantitative
c)
information on regulatory interactions,
traditional modeling and simulation
techniques are usually difficult to apply. Three stages in the simulation process as seen through the GNA user interface. (a) The
network of genes and regulatory interactions that is transformed into a mathematical model.
To counter this problem, we have
(b) The volume transition graph resulting from the simulation. (c) A closer look at the path in
developed a method for the qualitative
the volume transition graph selected in (b). The graph shows the qualitative temporal evolution
simulation of regulatory systems based
of two protein concentrations.
18
ERCIM News No. 43, October 2000SPECIAL THEME: BIOINFORMATICS
system, for instance a response to a realized by which a network of We plan GNA to evolve into an
physiological perturbation of the interactions between genes can be environment for the computer-supported
organism (a change in temperature or displayed, as well as the volume transition analysis of genetic regulatory networks,
nutrient level). graph resulting from the simulation. In covering a range of activities in the design
addition, the user can focus upon and testing of models. These activities,
The above method has been implemented particular paths in the graphs to study the such as the validation of hypothesized
in Java 1.2 in a program called GNA qualitative temporal evolution of gene models of regulatory networks by means
(Genetic Network Analyzer). GNA reads product concentrations in more detail (see of experimental data, will be accessible
and parses input files with the equations figures). through a user-friendly graphical
and inequalities specifying the model of interface. In parallel, we will apply the
the system as well as the initial volume. GNA has been tested using genetic method to the analysis of bacterial
An inequality reasoner iteratively regulatory networks described in the regulatory systems in collaboration with
generates the volumes that are reachable literature, such the example of lambda biologists at the Université Joseph Fourier
from the initial volume through one or phage growth control in the bacterium in Grenoble.
more transitions. The output of the Escherichia coli. Simulation experiments
Links:
program consists of the graph of all with random regulatory networks have
HELIX project: http://www.inrialpes.fr/helix/
reachable volumes connected by shown that, with the current
transitions. A graphical interface implementation, our method remains
Please contact:
Hidde de Jong – INRIA Rhône-Alpes
facilitating the interaction of the user with tractable for systems of up to 18 genes
Tel: +33 4 76 61 53 35
the program is under development. At involved in complex feedback loops.
E-mail: Hidde.de-Jong@inrialpes.fr
present, a visualization module has been
Bioinformatics for Genome Analysis in Farm Animals
by Andy S. Law and Alan L. Archibald
The Bioinformatics Group at the Roslin Institute and display tools required for mapping complex
develops tools and resources for farm animal genome genomes. The World Wide Web is used to deliver the
analysis. These encompass the databases, analytical resources to users.
The Bioinformatics Group at the Roslin observations. Data sharing between population separately. The database stores
Institute aims to provide access to research groups is particularly valuable details of markers and alleles. Genotypes
appropriate bioinformatics tools and in linkage mapping. Only by pooling data may be submitted through a simple web
resources for farm animal genome from the collaborating groups can interface that infers missing genotypes,
analysis. Genome research in farm comprehensive maps be built. checks for Mendelian inheritance and
animals is largely concerned with rejects data that contains inheritance
mapping genes that influence We developed resSpecies to meet this errors. Using a series of simple query
economically important traits. As yet, need. It uses a relational database forms, data can be extracted in the correct
there are no large-scale genome management system (RDBMS - format expected by a number of popular
sequencing activities. The requirements INGRES) with a web-based interface genetic analysis algorithms (eg crimap).
are for systems to support genetic implemented using Perl and Webintool This eliminates the possibility of cryptic
(linkage), quantitative trait locus (QTL), (Hu et al. 1996. WebinTool: A generic typographical errors occurring and
radiation hybrid and physical mapping Web to database interface building tool. ensures that the most up-to-date data is
and to allow data sharing between Proceedings of the 7th International available at all times.
research groups distributed world-wide. Conference and Workshop on Database
and Expert Systems (DEXA 96), Zurich, resSpecies is used to support Roslin’s
resSpecies – a Resource for Linkage September 9-13, 1996 pp 285-290). This internal programmes and several
and QTL Mapping makes international collaborations simple international collaborative linkage and
Genetic linkage maps are constructed by to effect. QTL mapping projects.
following the co-segregation of marker
alleles through multi-generation The relational design ensures that ARKdb – a Generic Genome
pedigrees. In quantitative trait locus complicated pedigrees can be represented Database
(QTL) mapping, the performance of the relatively simply. Populations are defined Scientists engaged in genome mapping
animals is also recorded. Both QTL and as groups of individuals. Within research also need access to contemporary
linkage mapping studies require databases resSpecies, access is granted to individual summaries of maps and other genome-
to store and share the experimental contributors/collaborators on each related data.
19
ERCIM News No. 43, October 2000SPECIAL THEME: BIOINFORMATICS
A comparison of
physical map and
genetic maps of
pig chromosome 8
with a genetic map
of cattle
chromosome 6.
The maps are
drawn ‘on-the-fly’
by the Anubis map
viewer using data
held in the ARKdb
pig and cattle
genome
databases.
We have developed a relational The Anubis Map Viewer need to fully integrate analytical tools
(INGRES) genome database model Visualisation is the key to understanding with the databases and display tools.
(ARKdb) to handle these data, along with complex data and tools that transform raw
web-based tools for data entry and data into graphical displays are The Roslin Bioinformatics Group has
display. The information stored in the invaluable. The Anubis map viewer was grown to eleven including software
ARKdb databases includes linkage and the first genome browser to be operable developers, programmers and database
cytogenetic map assignments, as a fully-fledged GUI (Graphical User curators. In the past we have received
polymorphic marker details, PCR Interface) over the WWW (URL support from the European Commission
primers, and two point linkage data. Each http://www.ri.bbsrc.ac.uk/anubis). It is and Medical Research Council. The group
observation is attributed to a reference used as the map viewer for ARKdb is currently funded by grants from the
source. Hot links are provided to other databases and the INRA BOVMAP UK’s Biotechnology and Biological
data sources eg sequence databases and database. We have recently launched a Sciences Research Council.
Medline (Pubmed). prototype java version of Anubis -
Links:
Anubis4 (http://www.ri.bbsrc.ac.uk/
http://www.roslin.ac.uk/bioinformatics/
The ARKdb database model has been arkdb/newanubis/).
implemented for data from pigs, chickens,
Please contact:
Alan L. Archibald – Roslin Institute
sheep, cattle, horses, deer, turkeys, cats, Future Activities
Tel: +44 131 527 4200
salmon and tilapia. The full cluster of We are developing systems to handle the
E-mail: alan.archibald@bbsrc.ac.uk
ARKdb databases are mounted on the data from radiation hybrid, physical
genome server at Roslin with subsets at (contig) mapping, expression profiling
Texas A+M and Iowa State Universities. (microarray) and expressed sequence tag
We have also developed The (EST) experiments. Exploitation of the
Comparative Animal Genome database wealth of information from the genomes
(TCAGdb) to capture evidence that of human and model organism is critical
specific pairs of genes are homologous. to farm animal genome research.
We are developing automated Artificial Therefore, we are exploring ways of
Intelligence methods to evaluate improving the links and interoperability
homology data. with other information systems. Our
current tools and resources primarily
address the requirements for data storage,
retrieval and display. In the future we
20
ERCIM News No. 43, October 2000SPECIAL THEME: BIOINFORMATICS
Modelling Metabolism Knowledge using Objects and Associations
by Hélène Rivière-Rolland, Loïc Taloc, Danielle Ziébelin, François Rechenmann and Alain Viari
A knowledge base to represent metabolism data has has been implemented by using an object/association
been developed by the Helix research team at INRIA technology developed at INRIA. Beside its use as a
Rhône-Alpes. This base provides access to general repository, the base may have applications
information on chemical compounds, biochemical in metabolic simulations and pathway reconstruction
reactions, enzymes, genes and metabolic pathways in newly sequenced genomes.
from fully sequenced micro-organisms. The model
The cellular metabolism can be defined as
the panel of all biochemical reactions
occurring in the cell. It consists of
molecular synthesis (anabolism) and
degradation (catabolism) necessary for cell
growth and division. These reactions drive
the energetic processes, the synthesis of
structural and catalytic components of the
cell and the elimination of cellular wastes.
A fairly large amount of metabolic data
is readily available, either in the literature
or in public data banks (eg the KEGG
project: http://star.scl.genome.ad.jp/kegg)
and this information will probably grow
Graphical representation of a simple metabolic pathway (sphingophospholipid biosynthesis):
the nodes represent chemical compounds (eg N-acylsphingosine) and the edges represent the
in the near future due to the development
biochemical reactions which transform these compounds. Each edge is labelled by a number
of new ‘large scale’ experimental
(E.C number) which identifies the enzyme that catalyses the reaction. The figure has been
technologies like DNA-arrays. Therefore,
automatically generated using the data stored in the database.
there is a need to organise this data in a
rational and formalised way, ie to model we attempted to develop a knowledge After implementing the data model in
our knowledge of metabolic data. The base of metabolic data. We wanted to AROM, we extracted the metabolic data
first goal is of course the storage and experiment a different representation from public sources (mostly KEGG) by
recovery of pertinent information. The model in which associations are explicitly using parsers and Unix shell scripts.
complexity of this kind of data and in represented as entities. To this purpose, Coherence of sequence data between data
particular the fact that some information we used the AROM system developed at banks has been checked by using home-
is held in the relationship between the INRIA (http://www.inrialpes.fr/ romans/ made sequences alignment programs
biological entities rather than in the pub/arom). The main originality of and/or Blast. At the present time we are
entities themselves, makes their selection AROM is the presence of two developing several graphical interfaces
and recovery difficult. Moreover, our complementary entities of representation: to this base. One will be devoted to
knowledge in this area is often incomplete classes and associations. As in any object- querying the knowledge base. Another
(elements are missing or pathways may oriented system, a class represents a set interface will be devoted to the automatic
be totally unknown in a newly sequenced of objects described by slots; but, in graphical representation of pathways
organism). A challenge is therefore to AROM, such a slot cannot refer to another which are complex non-planar directed
cope with this partial information and to object. This connection is done by means graphs (see Figure). At the present time
develop databases that could provide of associations which therefore denotes all the system (AROM and the inter faces)
some inference mechanisms to assist the a set of tuples (not necessarily only two) is implemented in JAVA and we plan to
discovery process. Finally, another of objects (associations are therefore n- put it into play through a web applet-
challenge is to link these data to other ary). As objects, tuples have their own server in a near future.
relevant genomic and biochemical slots and as classes, associations can be
Links:
information like protein structure, organised in hierarchies therefore
Action Helix:
regulation of gene expression, whole allowing for usual inheritance and
http://www.inrialpes.fr/helix.html
genome organisation (eg syntheny) and specialisation mechanisms. The explicit
Please contact:
evolution. representation of n-ary associations turned
Alain Viari – INRIA
out to be very useful for representing
Tel: +33 4 76 61 54 74
Following the pioneering work of P. Karp biological data. For instance, it makes the
E-mail: alain.viari@inrialpes.fr
and M. Riley with the Eco-Cyc system representation of alternative substrates of
(http://ecocyc.PangeaSystems.com/ecocyc) a metabolic reaction a much easier task.
21
ERCIM News No. 43, October 2000SPECIAL THEME: BIOINFORMATICS
Co-operative Environments for Genomes Annotation:
from Imagene to Geno-Annot
by Claudine Médigue, Yves Vandenbrouke, François Rechenmann and Alain Viari
‘Imagene’ is a a co-operative computer environment chromosomes. Its capabilities are currently extended
for the annotation and analysis of genomic sequences to handle both prokaryotic and eukaryotic data and
developed in collaboration between INRIA, Université to link pure genomic data to ‘post-genomic’ data,
Paris 6, Institut Pasteur and the ILOG company. The particularly metabolic and gene expression data.
first version of this software was dedicated to bacterial
In the context of large-scale genomic
sequencing projects the need is growing
for integration of specific sequence
analysis tools within data management
systems. With this aim in view, we have
developed the Imagene co-operative
computer environment dedicated to
automatic sequence annotation and
analysis (http://abraxa.snv.jussieu.fr/
imagene). In this system, biological
knowledge produced in the course of a
genome sequencing project (putative
genes, regulatory signals, etc) together
with the methodological knowledge,
represented by an extensible set of
sequence analysis methods, are uniformly
represented in an object oriented model.
Imagene view of a fragment of the B. subtilis chromosome: The display superimposes the
Imagene is the result of a five years
output of several methods. Red boxes represent putative protein coding region (gene); the
collaboration between INRIA, Université
blue boxes represent the result of a data bank similarity scan (here the Blastx program); the
Paris 6, the Institut Pasteur and the ILOG
yellow curve represents the coding probability as evaluated by using a Markov chain. The
translated protein sequence of the currently selected gene is shown in the insert.
company. The system has been
implemented by using an object oriented
model and a co-operative solving engine Imagene has been used within several the task-engine and the graphical user
provided by ILOG. In Imagene, a global bacterial genome sequencing projects interfaces in JAVA. Finally, our ultimate
problem (task) is solved by successive (Bacillus subtilis and Mycoplasma goal will be to integrate Geno-Annot
decompositions into smaller sub-tasks. pulmonis) and has proved to be within a more general environment
During the execution, the various sub- particularly useful to pinpoint sequencing (called Geno-*) in order to fully link all
tasks are graphically displayed to the user. errors and atypical genes. However this the pieces of genomic information
In that sense, Imagene is more transparent first version suffers several drawbacks. together (ie sequence data, metabolism,
to the user than a traditional menu-driven First it was limited to the representation gene expression etc). Geno-Annot is a
package for sequence analysis since all of prokaryotic data only, second the two years project that started in
the steps in the resolution are clearly development tools were commercial thus September 1999.
identified. Moreover, once a task has been giving rise to difficulties in its diffusion,
Links:
solved, the user can restart it at any point; last, it was designed to handle pure
Action Helix:
the system then keeps track of the sequence data from a single genome. In
http://www.inrialpes.fr/helix.html
different versions of the execution. This order to overcome these limitations, we
Imagene:
http://abraxa.snv.jussieu.fr/imagene
allows to maintain several hypothesis in undertook a new project (Geno-Annot)
parallel during the analysis. Imagene also through a collaboration between INRIA,
Please contact:
provides a user interface to display, on the Institut Pasteur and the Genome-
Alain Viari – INRIA
the same picture, the results produced by Express biotech compagny. As a first step,
Tel: +33 4 76 61 54 74
one or several strategies (see Figure). Due the data model was extended to eukaryotes
E-mail: alain.viari@inrialpes.fr
to the homogeneity of the whole software, and completely re- implemented using the
this display is fully interactive and the AROM system developed at INRIA
graphical objects are directly connected (http://www.inrialpes.fr/romans/pub/arom).
to their database counterpart. We are now in the process of re-designing
22
ERCIM News No. 43, October 2000SPECIAL THEME: BIOINFORMATICS
Arevir: Analysis of HIV Resistance Mutations
by Niko Beerenwinkel, Joachim Selbig, Rolf Kaiser and Daniel Hoffmann
To develop tools that assist medical practitioners in year by researchers at GMD, the University of Cologne,
finding an optimal therapy for HIV-infected patients – CAESAR, the Center of Advanced European Studies
this is the aim of a collaboration funded by Deutsche and Research, Bonn, and a number of cooperating
Forschungsgemeinschaft that has been started this university hospitals in Germany.
The Human Immunodeficiency Virus therapy changes after resistance testing. to understand these connections and that
(HIV) causes the Acquired There are several possible reasons for contribute directly to therapy
Immunodeficiency Syndrome (AIDS). therapy failure in this situation: the optimization. In a first step a database is
Currently, there are two types of drugs in occurrence of an HIV-strain resistant to set up in collaboration with project