Knowledge Discovery and Data-Mining in Biological Databases

hideousbotanistΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 4 μήνες)

102 εμφανίσεις

Knowledge Discovery and Data-Mining in
Biological Databases
Vladimir Brusic
and John Zeleznikow
The Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria, Australia.
School of Computer Science and Computer Engineering, La Trobe University, Bundoora,
Victoria, Australia. {}
Tutorial Notes for ISMBÕ98
The Sixth International Conference on Intelligent Systems for
Molecular Biology
Montr al, Qu bec, Canada
Sunday, June 28th - Wednesday, July 1st 1998
1. Overview 4
2. Introduction to KDD 6
2.1. Background 6
2.1.1. What is Knowledge?6
2.1.2. What is KDD?6
2.1.3. Why do we need KDD 6
2.1.4. KDD process 7
2.1.5. Data-mining 9
2.2. KDD Ñ An Interdisciplinary Topic 9
2.3. Basic Definitions 10
2.4. Data-Mining Step of the KDD Process 11
2.5. Data-Mining Methods 12
2.5.1. Practical goals of data-mining 12
2.5.2. Classification 12
2.5.3. Regression 12
2.5.4. Clustering 12
2.5.5. Summarisation 12
2.5.6. Dependency modelling 12
2.5.7. Change and deviation detection 13
2.6. Components of Data-Mining Algorithms 13
2.6.1. Model representation 13
2.6.2. Model-evaluation criteria 13
2.6.3. Search method 13
2.7. Data-Mining Tools: An Overview 13
2.7.1. Decision trees and rules 14
2.7.2. Nonlinear regression and classification methods 15
2.7.3. Example-based methods 16
2.7.4. Probabilistic graphic dependency models 17
2.7.5. Relational learning models 18
2.8. Comparative Notes on Data-Mining Methods 18
3. Domain Concepts from Biological Data and Databases 19
3.1. Bioinformatics 19
3.2. What do we need to know about biological data? 20
3.2.1. Complexity underlying biological data 20
3.2.2. Fuzziness of biological data 20
3.2.3. Biases and misconceptions 21
3.2.4. Noise and errors 21
3.2.5. How to design KDD process? 21
3.3. Database Notes 23
3.3.1. Database development trends 23
3.3.2. Intelligent information systems 23
3.4. Database-Related Issues in Biology 24
3.4.1. Integration of heterogeneous databases 24
3.4.2. Flexible access to biological databases 24
4. KDD and Data-Mining Developments in Biology 26
4.1. Annotation of masses of data 26
4.2. Structural and Functional Genomics 27
4.3. Protein structure prediction and modelling 27
4.4. Analysis of biological effects 27
4.5. Identification of distantly related proteins 27
4.6. Practical applications 28
5. Case Studies: Application of KDD in Immunology 28
5.1. Background 28
5.1.1. Biology 28
5.1.2. MHC/peptide binding problem 28
5.1.3. Peptide processing 29
5.1.4. Models 31
5.1.5. Data and analysis 31
5.1.6. Predictions 31
5.1.7. Data errors 32
5.2. Data-Mining Tools 33
5.2.1. Selection of the search method 33
5.2.2. Evaluation of performances of predictive models 33
5.3. PERUN 34
5.4. Study of Human TAP Transporter 35
6. Summary Notes 36
7. AuthorsÕ Affiliations 38
8. References 38
1. Overview
Biological databases continue to grow rapidly. This growth is reflected in increases in both
the size and complexity of individual databases as well as in the proliferation of new
databases. A huge body of data is thus available for the extraction of high-level
information, including the development of new concepts, concept interrelationships and
interesting patterns hidden in the databases. Knowledge Discovery in Databases (KDD) is
an emerging field combining techniques from Databases, Statistics and Artificial
Intelligence, which is concerned with the theoretical and practical issues of extracting high
level information (or knowledge) from volumes of low-level data. The examples of high-
level information include such forms that are more compact (eg. short reports), more
abstract (eg. descriptive models of the process that generated data) or more useful (eg.
predictive models for estimating values of the future cases) than the low-level data. The
term KDD refers to the multi-step process of extracting knowledge from databases which
involves a) data preparation, b) searching for patterns, c) knowledge evaluation, and d)
refinement. Fayyad et al., (1996) have specified the steps of the KDD process, namely:
learning the application domain, creating a target data set, data cleaning and pre-processing,
data reduction and projection, choosing the function of data-mining, choosing the data-
mining algorithms, data-mining, interpretation of results, and using discovered knowledge.
The KDD process is interactive in that human intervention might be required at any point
(eg. decisions driven by the knowledge external to the data set). It is also iterative and can
contain multiple loops between any two steps. At the core of KDD is data-mining - the
application of specific tools for pattern discovery and extraction. The practical aspects of
data-mining include dealing with issues such as data storage and access, scalability of
massive data sets, presentation of results and human-machine interaction.
Data is a set of facts stored in a database and pattern is an expression describing a subset of
the data or a model applicable to the subset. KDD is the non-trivial process of identifying
valid, novel, potentially useful, and ultimately understandable patterns in data. The KDD
process is non-trivial, implying that some form of search or inference is involved, not only
a straightforward computation of pre-defined quantities eg. computing the average value of
a set of numbers. Using standard algorithms such as BLAST or FASTA in comparing a
given biological sequence with database entries is not the same as performing a KDD,
although these algorithms may be used in particular steps of the KDD process. The
discovered patterns should be valid on new data with some degree of certainty. Various
measures of validity are available such as estimated prediction accuracy on new data, utility
or gain (eg. in dollar value, speed-up etc.). The estimation of novelty, usefulness or
understandability is much more subjective and depends on the purpose of the KDD. The
notion of interestingness (Silberschatz and Tuzhilin, 1997) is usually taken as an overall
measure of a pattern value, combining validity, novelty, usefulness and simplicity.
Interestingness functions can be defined explicitly or can be manifested implicitly through
an ordering or ranking of discovered patterns or models by the KDD system.
Studies of biological data involve accessing multiple databases using a variety of query
tools. The amounts of biological data are growing faster than the capability to analyse them.
KDD offers the capacity to automate complex search and data analysis tasks. We can
distinguish two types of goals of KDD systems: verification and discovery. With
verification, the system is limited to verifying the userÕs hypothesis. With discovery, the
system autonomously finds new patterns. Discovery can be further subdivided into
prediction and description (explanation) goals. Examples of KDD tasks in biology include
annotation, structure modeling, determination of function or functional sites, study of
structure-activity relationships and the analysis of disease phenotypes. The issues relevant
for KDD in biological databases include heterogeneity of the data sources, complexity of
the biological data and underlying knowledge, error/noise handling and validation of
results. Biological sources are highly heterogeneous, geographically dispersed, constantly
evolving, and often high in volume. KDD in molecular biology commonly includes mining
multiple heterogeneous databases. A flexible access to multiple heterogeneous biological
sources is facilitated through systems such as CORBA (Coppieters et al., 1997;
<>) or Klesli (Davidson et al., 1997),
which provide the set of standards and tools for integration of application programs and
multiple distributed heterogeneous data sources. Biological databases represent data from a
highly complex domain where a large body of the background knowledge is available
elsewhere. The integration of the background knowledge can be achieved through human
intervention at different stages of the KDD process. A more systematic approach involves
the design of knowledge bases, in which deeper knowledge of a domain is encoded and
can be utilised directly. An example of a knowledge base is the RIBOWEB system (Chen et
al., 1997). Biological data are inherently noisy, containing errors and biases. Filtering
errors and de-biasing data helps improve the results of KDD. This filtering can be
performed at every step of the KDD process and is often based on human decision or can
be internal to the data-mining algorithm. The validation is a critical issue for the data-mining
as well as the overall KDD process. Verification tasks themselves are a form of validation
and involve estimation of the quality of data fitting and involve statistical tests. Discovery
(in particular prediction) tasks, require careful validation. Theoretical validation methods
(Weiss and Kulikowski, 1991) include partitioning and testing data sets (eg. cross-
validation, bootstrapping, etc.) while experimental validation involves generation of new
experimental data for validation of the results of KDD.
Both general requirements of the KDD process and the specific requirements of the
application domain need to be considered in the design of a KDD process in biology. Data-
mining tasks have been defined in study of biological sequences, examples including the
finding of genes in DNA sequences (eg. Krogh et al., 1994), and regulatory elements in
genomes (eg. Brazma et al., 1997), and the knowledge discovery on transmembrane
domain sequences and signal peptide sequences (Shoudai et al., 1995). Numerous tools
suitable for data-mining in biology are available, yet the selection of an appropriate tool is a
non-trivial task. The KDD process provides for the selection of the appropriate data-mining
methods by taking into account both domain characteristics and general KDD process
This tutorial consists of four parts: a) introduction to KDD, b) discussion of the domain
concepts from biological data and databases, c) data-mining techniques in biology, and d) a
case study: application of KDD in immunology. The goal of the tutorial is to present KDD
methodology as an approach which is complementary to laboratory experiments and which
can accelerate the process of discovery in biology. This is achieved by both minimisation of
the number of necessary experiments and by improved capacity to interpret biological data.
The tutorial examples include those demonstrating successful applications of KDD to the
experiment planning (Brusic et al., 1998a), prediction of biological function and
description - postulating hypotheses (Brusic et al., 1998b). The tutorial also contains
pointers to the relevant literature.
The target audience includes biologists, medical researchers and computer scientists
interested in biological discovery.
2. Introduction to KDD
2.1. Background
This section provides basic definition of concepts from KDD, which will be discussed in
more detail in later sections.
2.1.1. What is Knowledge?
Knowledge is a form of information, besides raw data, interpreted data and expertise.
Conventional databases represent simple data types, such as numbers, strings and Boolean
values. However, we have need for more complex information such as processes,
procedures, actions, causality, time, motivations, goals, common sense reasoning (Chapter
9, Firebaugh, 1989) and, we add, structure and organisation. Term knowledge describes
this broader category of information.
2.1.2. What is KDD?
At an abstract level, the Knowledge Discovery from Databases (KDD) field is
concerned with the development of methods and techniques for making sense from data.
The basic problem addressed by this process is one of mapping low-level data into other
forms that might be: a) more compact Ñ such as a short report, b) more abstract Ñ such as
a descriptive approximation or model of the process that generated the data, or c) more
useful Ñ such as a predictive model for estimating the value of future cases. KDD is useful
where low-level data is difficult to understand and easily digest because it is ei ther too
voluminous (Fayyad et al., 1996) or too complex. If data are derived from a particularly
complex domain the KDD process is typically performed on small data sets, relative to the
complexity of the process that generated the data. At the core of the KDD process is the
application of specific data-mining methods for pattern discovery and extraction.
2.1.3. Why do we need KDD?
As (Fayyad et al., 1996) state, the traditional method of turning data into knowledge relies
on manual analysis and interpretation. Such an approach might well use deductive
databases, where the rules would be learned manually from interviewing experts (see
Chapter 8, Zeleznikow and Hunter 1994). The classical approach to data analysis relies
fundamentally on one or more analysts becoming intimately familiar with the data and
serving as an interface between the data and the users and products. Manual probing of a
data set is slow, expensive and highly subjective. With data volumes growing
dramatically, manual data analysis is becoming impractical in many domains. Databases are
increasing in size in two ways: the number N of records or objects in the database, and the
number d of fields or attributes to an object. In the domain of Astronomy, databases
containing the order of N=10
objects are becoming common. In medical diagnostic
applications there are databases containing even d=10
fields. It is even more complicated
in biological databases, where related data are dispersed across heterogeneous and
geographically scattered databases. In a database containing millions of records, with tens
or hundreds of fields, some form of automated analysis is essential.
Historically, the notion of finding useful patterns in data has been given a variety of names,
including: data-mining, knowledge extraction, information discovery, information
harvesting, data archaeology or data pattern processing. The term data-mining has been
primarily used by statisticians, data analysts and the Management Information Systems
communities. The phrase knowledge discovery in databases (Piatetsky-Shapiro,
1991) was developed to emphasise that knowledge is the end product of a data-driven
discovery. It has been popularised in the Artificial Intelligence and Machine Learning
2.1.4. KDD process
The term KDD refers to the multi-step process of extracting knowledge from databases
which involves a) data preparation, b) searching for patterns, c) knowledge evaluation, and
d) refinement. According to Fayyad et al., (1996), KDD refers to the overall process of
discovering useful knowledge from databases, and data-mining refers to a particular step in
this process. They defined knowledge discovery in databases as the non trivial process of
identifying valid, novel, potentially useful, and understandable patterns in data. The KDD
process involves ten steps, the first nine were defined in (Fayyad, et al., 1996) and the last
step by Zeleznikov and Stranieri (personal communication).
1.Learning the application domain Ñ this includes developing relevant prior
knowledge and identifying the goal and the initial purpose of the KDD process from
the user's viewpoint;
2.Creating a target data set Ñ including selecting a data set or focusing on a set
of variables or data samples on which the discovery is to be performed;
3.Data cleaning and pre-processing Ñ includes operations such as removing
noise or outliers if appropriate, collecting the necessary information to model or
account for noise and deciding on strategies for handling missing data fields;
4.Data reduction and projection Ñ includes finding useful features to represent
the data. With dimensionality reduction or transformation methods, the effective
number of variables under consideration can be reduced, or invariant
representations for the data can be found;
5.Choosing the function of data-mining Ñ includes deciding the purpose of
the model derived by the data-mining algorithm: summarisation, classification,
regression and clustering;
6.Choosing the data-mining algorithms Ñ includes selecting methods to be
used for searching for patterns in the data and matching a particular data-mining
method with the overall criteria of the KDD process. This process includes deciding
which models and parameters might be appropriate Ñ for example, models of
categorical data are different than models of vectors over the reals Ñ and matching
a particular data-mining method with the overall criteria of the KDD process Ñ for
example, the end user might be more interested in understanding the model than its
predictive capabilities;
7.Data-mining Ñ includes searching for patterns of interest in a particular
representational form or a set of such representations: including classification rules
or trees, regression, clustering, sequence modelling dependency and linear analysis;
8.Interpretation Ñ involves possible further iterations of any of steps (1) trough
(7). This step can also involve visualisation of the extracted patterns and models or
visualisation of the data given the extracted models;
9.Using discovered knowledge Ñ this step involves acting directly on the
discovered knowledge, incorporating the knowledge into another system for further
action, or documenting and reporting the knowledge. It also includes checking for
and resolving potential conflicts with previously believed (or extracted) knowledge.
10.Evaluation of KDD purpose. Newly discovered knowledge is often used to
formulate new hypotheses; also new questions may be raised using the enlarged
knowledge base. The additional KDD step has been defined (Zeleznikow and
Stranieri, personal communication). In this step the KDD process is evaluated for
possible further use in both refinement and expansion of the purpose of the KDD
process relative to the previous KDD cycle. The diagrammatic representation of the
KDD process is given in Fig. 1.
Figure 1.
Steps of the KDD process.
Develop understanding of the application
domain and relevant prior knowledge and
identify the goal of KDD
Create a target data set
Data cleaning and pre-processing
Data reduction and projection
Matching the goals of KDD
(step 1) to a particular data mining method
Choosing the data-mining algorithm(s)
and selecting methods for searching
data patterns.
Interpretation of discovered patterns
Using discovered knowledge
Evaluation of KDD purpose:
refinement and expansion
A single KDD cycle
Multiple KDD cycles
2.1.5. Data-mining
Data-mining is a problem-solving methodology that finds a formal description, eventually
of a complex nature, of patterns and regularities in a set of data. (Decker and Focardi,
1995) consider various domains which are suitable for data-mining Ñ including medicine
and business. They state that in practical applications, data-mining is based on two
assumptions. First, the functions that one wants to generalise can be approximated through
some relatively simple computational model with a certain level of precision. Second, the
sample data set contains sufficient information required for performing the generalisation.
Fayyad et al., (1996) see data-mining as the application of specific algorithms for extracting
patterns from data. The distinction between the KDD process and the data-mining step is
important. The additional steps in the KDD process, such as data preparation, data
selection, data cleaning, incorporation of appropriate prior knowledge, and proper
interpretation of the results of mining, are essential to ensure that useful knowledge is
derived from the data. Blind application of data-mining Ñ known as data dredging Ñ
can easily lead to the discovery of meaningless and invalid patterns.
2.2. KDD Ñ An Interdisciplinary Topic
KDD involves distinct research fields including machine learning, pattern recognition,
databases, statistics, artificial intelligence, knowledge acquisition for expert systems, data
visualisation and high-performance computing. The unifying goal is extracting high-level
knowledge from low-level data in the context of large data sets. The data-mining
component of KDD relies on techniques from machine learning, pattern recognition and
statistics to find patterns from data. KDD focuses on the overall process of knowledge
discovery from data, including:
a) how data is stored and accessed;
b) how process can be scaled to massive data sets and still run efficiently;
c) how results can be visualised;
d) how the overall man-machine interaction can be usefully modelled and supported;
e) how can useful patterns be found in data.
KDD places a special emphasis on finding understandable patterns that can be interpreted as
useful or interesting knowledge. Thus, for example, neural networks, although a powerful
modeling tool, are relatively difficult to understand compared to decision trees. KDD also
emphasises scaling and robustness properties of modeling algorithms for large or noisy
data sets.
Knowledge discovery from data is fundamentally a statistical endeavour. Statistics provides
a language and framework for quantifying the uncertainty that results when one tries to
infer general patterns from a particular sample of an overall population. The term data-
mining has had negative connotations in statistics since the 1960s, when computer-based
data analysis techniques were first introduced. The concern arose because if one searches
long enough in any data set Ñ even randomly generated data Ñ one can find patterns that
appear to be statistically significant, but, in fact, are not. Data-mining is a legitimate activity
as long as it is performed accurately Ñ which requires an appropriate consideration of the
statistical aspects of the given problem. KDD aims to provide tools to automate Ñ to the
degree possible Ñ the entire process of data analysis and the statistician's art of hypothesis
The problem of effective data manipulation when data cannot fit in the main memory is of
fundamental importance to KDD. Database techniques for gaining efficient data access,
grouping and ordering operations when accessing data, and optimising queries constitute
the basics for scaling algorithms to larger data sets. Most data-mining algorithms, from
statistics, pattern recognition, and machine learning assume data is in the main memory and
pay no attention to how the algorithm breaks down if only limited views of the data are
Data warehousing refers to the process of collecting and cleaning transactional data to make
it available for on-line analysis and decision support. Data warehousing helps set the stage
for KDD in two important ways: data cleaning and data access. Data cleaning: as
organisations are forced to think about a unified logical view of the wide variety of data and
databases they possess, they have to address the issues of mapping data to a single naming
convention, uniformly representing and handling missing data, and handling noise and
errors when possible. Data access: uniform and well-defined methods must be created for
accessing the data and providing access paths to data that was historically difficult to obtain
Ñ for example it may be stored off-line.
After storing and accessing data, we are ready to perform KDD! A popular approach for
analysis of data warehouses is called on-line analytical processing (OLAP) (Codd, 1993).
OLAP tools focus on providing multidimensional data analysis, which is superior to SQL
in computing summaries and breakdowns along many dimensions. OLAP tools are targeted
toward simplifying and supporting interactive data analysis, but the goal of KDD tools is to
automate as much of the process as is possible. Thus, KDD is a step beyond what is
currently supported by most standard database systems.
2.3. Basic Definitions
As stated previously, KDD is the non-trivial process of identifying valid, novel, potentially
useful, and ultimately understandable patterns in data. Data is a set of facts (stored in a
database) and pattern is an expression in some language describing a subset of the data or a
model applicable to the subset. Extracting a pattern also designates:
a) fitting a model to data;
b) finding structure from data;
c) making any high-level description of a set of data.
The KDD process is interactive and iterative (Brachman and Anand, 1996). The term
process implies that KDD comprises many steps, which involve: a) data preparation, b)
search for patterns, c) knowledge evaluation, and d) refinement; all repeated in multiple
iterations. By non-trivial, we mean that some search or inference is involved; that is, it is
not a straightforward computation of pre-defined quantities, such as computing the average
value of a set of numbers. The usage of standard algorithms such as BLAST or FASTA in
comparing a given biological sequence with database entries thus does not equate to
performing a KDD, although these algorithms may be used in particular steps of the KDD
The discovered patterns should be valid on new data with some degree of certainty.
Patterns should also be novel Ñ at least to the system Ñ and potentially useful. The
patterns should also be understandable Ñ certainly after some post-processing. We can
often define measures of certainty Ñ such as estimated prediction accuracy on new data Ñ
or utility Ñ such as gain (perhaps in dollars because of better predictions) or speed-up in
response time of a system. Notions such as novelty and understandability are much more
subjective. In certain contexts, understandability can be estimated by simplicity Ñ such as
the number of bits to describe a pattern. The notion of interestingness (Silberschatz and
Tuzhilin, 1997) is usually taken as an overall measure of pattern value, combining validity,
novelty, usefulness and simplicity. Interestingness functions can be defined explicitly or
can be manifested implicitly through an ordering placed by the KDD system on the
discovered patterns or models. A pattern is considered to be knowledge if it exceeds some
interestingness threshold. Knowledge in this definition is purely user oriented and domain
specific and is determined by whatever functions and thresholds the user chooses.
Data-mining is a step in the KDD process that consists of applying data analysis and
discovery algorithms that under acceptable computational efficiency limitations, produce a
particular enumeration of patterns (or models) over the data. The space of patterns is often
infinite and the enumeration of patterns involves some form of search in this space.
Practical computational constraints place severe limits on the sub-space that can be explored
by a data-mining algorithm.
The KDD process involves using the database along with any required selection, pre-
processing, sub-sampling, and transformations of it; applying data-mining methods
(algorithms) to enumerate patterns from it; and evaluating the products of data-mining to
identify the subset of the enumerated patterns deemed knowledge. The data-mining
component of the KDD process is concerned with the algorithmic means by which patterns
are extracted and enumerated from data.
The overall KDD process includes the evaluation and possible interpretation of the mined
patterns to determine which patterns can be considered new knowledge.
2.4. Data-Mining Step of the KDD Process
The data-mining component of the KDD process often involves repeated iterative
application of particular data-mining methods. The knowledge discovery goals are defined
by the intended use of the system. We can distinguish two types of goals: verification and
discovery. With verification, the system is limited to verifying the user's hypothesis. With
discovery, the system autonomously finds new patterns. We further subdivide the
discovery goal into prediction, where the system finds patterns for predicting the future
behaviour of some entities, and description where the system finds patterns for
presentation to a user in a human-understandable form.
Data-mining involves fitting models to or determining patterns from observed data. The
fitted models play the role of inferred knowledge. Knowledge can be deduced from axioms
and inference rules contained in deductive databases. Whether the models reflect useful or
interesting knowledge is part of the overall, interactive KDD process where subjective
human judgement is typically required. Two primary mathematical formalisms are used in
model fitting: statistical and logical. The statistical approach assumes non-deterministic
effects in the model, whereas a logical model is purely deterministic. We shall focus upon
the statistical approach to data-mining, which tends to be the most widely used basis for
practical data-mining applications given the typical presence of uncertainty in real-world
data-generating processes.
Most data-mining methods are based on tried and tested techniques from machine learning,
pattern recognition and statistics: classification, clustering, regression etc. The actual
underlying model representation being used by a particular data-mining method typically
derives from a composition of a small number of well-known options: a) polynomials, b)
splines, c) kernel and basis functions, and d) threshold - Boolean functions. Algorithms
differ primarily in the goodness-of-fit criterion used to evaluate model fit or in the search
method used to find a good fit.
2.5. Data-Mining Methods
In this section we identify two practical goals of data-mining: prediction and description.
These goals can be achieved by using various general data-mining methods which are
described below.
2.5.1. Practical goals of data-mining
The two high-level practical goals of data-mining are prediction and description. Prediction
involves using some variables or fields in the database to predict unknown or future values
of other variables of interest, and description focuses on finding human-interpretable
patterns describing the data. Although the boundaries between prediction and description
are not sharp, the distinction is useful for understanding the overall discovery goal. The
relative importance of prediction and description on particular data-mining applications can
vary considerably. The goals of prediction and description can be achieved using a variety
of particular data-mining methods.
2.5.2. Classification
Classification is learning a function that maps (classifies) a data item into one of several
predefined classes (Weiss and Kulikowski 1991). Examples of classification methods used
as part of knowledge discovery applications include the classifying of trends in financial
markets and the automated identification of objects of interest in large image databases.
2.5.3. Regression
Regression is learning a function that maps a data item to a real-valued prediction variable.
Examples include predicting the amount of biomass present in a forest given remotely
sensed microwave measurements, estimating the probability that a patient will survive given
the results of a set of diagnostic tests, predicting consumer demand for a new product as a
function of advertising expenditure and predicting time series where the input variables can
be time-lagged versions of the prediction variable.
2.5.4. Clustering
Clustering is a common descriptive task where one seeks to identify a finite set of
categories or clusters to describe the data. The categories can be mutually exclusive and
exhaustive or consist of a richer representation, such as hierarchical or overlapping
categories. Examples of clustering applications in a knowledge discovery context include
discovering homogeneous sub-populations for consumers in marketing databases and
identifying subcategories of spectra from infrared sky measurements. Closely related to
clustering is the task of probability density estimation, which consists of techniques for
estimating from data the joint multivariate probability density function of all the variables or
fields in the database.
2.5.5. Summarisation
Summarisation involves methods for finding a compact description for a subset of data. A
simple example would be tabulating the mean and standard deviation for all fields. More
sophisticated methods involve the derivation of summary rules, multivariate visualisation
techniques, and the discovery of functional relationships between variables. Summarisation
techniques are often applied to interactive exploratory data analysis and automated report
2.5.6. Dependency modelling
Dependency modelling consists of finding a model that describes significant dependencies
between variables. Dependency models exist at two levels: a) the structural level of the
model specifies Ñ often in graphic form Ñ which variables are locally dependent on each
other; and b) the quantitative level of the model specifies the strengths of the dependencies
using some numeric scale. Probabilistic dependency networks (PDNs) use conditional
independence to specify the structural aspect of the model and probabilities or correlations
to specify the strengths of the dependencies. PDNs are finding applications in the
development of probabilistic medical expert systems from databases, information retrieval,
and the modelling of the human genome.
2.5.7. Change and deviation detection
Change and deviation detection focuses on discovering the most significant changes in the
data from previously measured or normative values.
2.6. Components of Data-Mining Algorithms
We can identify three primary components in any data-mining algorithm: a) model
representation, b) model evaluation, and c) search.
2.6.1. Model representation
Model representation is the language used to describe discoverable patterns. If the
representation is too limited, then no amount of training time or examples can produce an
accurate model for the data. It is important that a data analyst fully comprehend the
representational assumptions that might be inherent in a particular method. It is equally
important that an algorithm designer clearly state which representational assumptions are
being made by a particular algorithm. Increased representational power for models
increases the danger of overfitting the training data, resulting in reduced prediction accuracy
on unseen data.
2.6.2. Model-evaluation criteria
The criteria for model evaluation are quantitative statements Ñ fit functions of how well a
particular pattern (a model and its parameters) meets the goals of the KDD process.
Predictive models are often judged by the empirical prediction accuracy on some test set.
Descriptive models can be evaluated along the dimensions of predictive accuracy, novelty,
utility, and understandability of the fitted model.
2.6.3. Search method
Search consists of two components: parameter search, and model search. When the model
representation Ñ or family of representations Ñ and the model-evaluation criteria are
determined and fixed, then the data-mining problem is reduced to an optimisation task:
namely finding the parameters and models from the selected family that optimise the
evaluation criteria. In parameter search, the algorithm must search for the parameters that
optimise the model-evaluation criteria given observed data and a fixed model representation.
Model search occurs as a loop over the parameter-search method: the model representation
is changed so that a family of models is considered.
2.7. Data-Mining Tools: An Overview
Whilst not claiming to conduct a wide-ranging survey of data-mining methods, here we
briefly describe several popular techniques including: a) decision trees and rules, b) non-
linear regression and classification methods, c) example-based methods, d) probabilistic
graphic dependency models, and e) relational learning models. The topics on data mining
methods and pattern recognition techniques for specific use with biological databases have
been covered in detail in the tutorial series of ISMB-98 (Baldi and Brunak 1998; Lawrence,
Brazma and
1998; B sching and Schleiermacher,
1998). In
we provide an overview o f data-mining methods intended to help understand
their general
properties and facilitate the selection of a ÒrightÓ method when required.
2.7.1. Decision trees and rules
Decision trees and
two) splits
have a simple
making the inferred model relatively
easy for
user to
comprehend. However,
the restriction to a particular tree o r rule representation c an
significantly restrict the functional
form and approximation
power of
model. An
example of a decision
tree i s given in
Fig. 2.
Decision-tree and rule-induction algorithms
depend on likelihood-based model-evaluation
methods, with varying degrees of
sophistication in terms o f penalising model complexity.
Greedy search
methods, which
growing and pruning
rule and
are typically
used t o
explore t he
superexponential space of possible
Trees and rules
are primarily
used for
predictive modelling Ñ
as occurred i n IKBALS (Zeleznikow
et al
., 1994) and
SplitÑ Up projects (Stranieri
et al.,
1998) Ñ both for
classification and
although they can also be applied to summary descriptive mod elling.
Figure 2
. An example of a decision tree
regions. This
decision tree uses
contains five test nodes.
from (Salzberg, 1995)
where detailed
description of
feature measures can be found.
is the process in which a computer program generates rules from sample cases or
data. A
rule induction system i s given examples o f a problem where
the outcome i s
These examples are
called the
set Õ.
When i t
been given several
examples the rule induction system then is asked to
that fit the example
The rules can be used to assess other cases where the outcom e is not known.
Fourier-3 < 147.0
Fourier-3 < 199.0
Noncoding Coding
Diamino acid usage < 2.911
Hexamer-2 < -2.583
Hexamer-1 < -2.708
At the basis of a rule induction system is an algorithm which is used to induce the rules
from the examples. An example is the ID3 algorithm of (Quinlan 1986). However, all
induction algorithms work in a similar fashion and are all based in some way on statistical
analysis of the training set. Advantages of machine induction include:
a) The ability to deduce new knowledge. It may be possible to list all the factors
influencing a decision without understanding their impacts; and
b) Once rules have been generated they can be reviewed and modified by the domain
There are however some difficulties in implementing rule induction systems. These include:
a) Some induction programs or some training sets may generate rules that are not easy
for humans to understand;
b) Rule induction programs do not select the attributes. Hence, if the domain expert
chooses inappropriate attributes in the creation of the training set, then the rules
induced are likely to be of little value;
c) The method is only useful for rule based, classification type problems;
d) The number of attributes must be fairly small; and
e) The training set should not include cases that are exceptions to the underlying rules.
In biology, this requirement is difficult to fulfil.
An example of a tree-based application in biology is BONSAI Garden System (Shoudai et
al., 1995). BONSAI system uses positive and negative examples to produce decision trees
and has been used to discover knowledge on transmembrane domain sequences and signal
peptide sequences by computer experiments. Decision trees have also been used for
determination of protein coding regions (Salzberg, 1995).
2.7.2. Nonlinear regression and classification methods
These methods consist of a family of techniques for prediction that fit linear and non-linear
combinations of basis functions Ñ sigmoids, splines, polynomials Ñ to combinations of
the input variables. Examples include: a) feed-forward neural networks, b) adaptive spline
methods, and c) projection pursuit regression.
In terms of model evaluation, although neural networks of the appropriate size can
universally approximate any smooth function to any desired degree of accuracy, relatively
little is known about the representation properties of fixed-size networks estimated from
finite data sets. Also, the standard squared error and cross-entropy loss functions used to
train neural networks can be viewed as log-likelihood functions for regression and
classification respectively.
A neural network receives its name from the fact that it resembles a nervous system in the
brain. It consists of many self-adjusting processing elements cooperating in a densely
interconnected network (see Fig. 7c). Each processing element generates a single output
signal which is transmitted to the other processing elements. The output signal of a
processing element depends on the inputs to the processing element: each input is gated by
a weighting factor that determines the amount of influence that the input will have on the
output. The strength of the weighting factors is adjusted autonomously by the processing
element as data is processed. Back-propagation is a parameter-search method that performs
gradient descent in parameter (weight) space to find a local maximum of the likelihood
function starting from random initial conditions. Non-linear regression methods, although
powerful in representational power, can be difficult to interpret. There are many examples
of neural network applications in biology starting from early 1980s (eg. see Stormo et al.,
1982). We will later discuss in detail the PERUN system (Brusic at al., 1998a) which
utilises an evolutionary algorithm and
artificial neural
networks for
prediction of
immunologically interesting peptides.
2.7.3. Example-based methods
The representation in example-based
methods is simple:
representative examples from
the database to approximate a model: that
predictions o n new examples are
derived from the properties of similar
examples i n
the model
attribute values are known
. Techniques include: a) nearest neighbour
classification, b)
regression analysis,
and c) case based
use of
example-based methods in
biology has
been quite limited, restricted by the complexity of the
domain. A
example from biology is the use of case-based reasoning in prediction of protein secondary
structure (see Leng
et al.,
Case based reasoning is
the catch-all-term
for a number of
techniques of representing and
reasoning with prior
experience to analyse or
solve a new problem. It
may include
explanations of why previous experiences are or are not similar to
present problem and
includes techniques of adapting past
solutions to
meet the
requirements of the present
problem. The diagram in Fig. 3 indicates the case based reasoning cycle.
Figure 3.
Case based reasoning cycle. Adapted from (Kolodner 1993).
Case based reasoners can:
a) arrive a t conclusions based on a number of
rather than the entire
body of
possibly contradictory and complex rules;
b) interpret open textured concepts by using analogy;
c) in sharp
contrast to rule based
the more information that is
stored i n the
case based reasoner the more accurate it tends to be; and
d) improve the knowledge acquisition process since
the notion o f a
precedent o r
experience i s an intuitive one
for knowledge engineers and
domain experts
Propose solution
adapt justify
In attempting to provide some standards for discussing case based reasoning, Ashley
(1991) has identified five case based reasoning paradigms:
a) Statistically oriented paradigm. In this paradigm, cases are used as data points for
statistical generalisation. The case based reasoner computes conditional
probabilities that a problem should be treated similarly to previously given cases.
MBRTALK system (Stanfill and Waltz, 1986) pronounces novel words.
b) Model based paradigm. This paradigm assumes that there is a strong causal model
of the domain task. It generally involves selecting among partially matched cases,
in which symbolic reasoning is used to determine the difference between the given
problem and the retrieved cases. This symbolic reasoning is performed in the
context of the given domain model. CASEY (Koton, 1988) and FLORENCE
(Bradburn et al., 1993) are the applications in the domain of health care diagnosis
which use this paradigm.
c) Planning or design oriented paradigm. In the planning oriented paradigm, cases are
instantiated. They record a past problemÕs solution and are used as templates to
map the solution on to a new problem. The retrieved plan must be adapted to solve
the new case, a task that may require solving new problems. PERSUADER
(Sycara, 1990) and CHEF (Hammond, 1989) are examples of case based reasoners
that use this paradigm.
d) Exemplar based paradigm. In the exemplar based paradigm, cases are exemplars of
concepts. Problem solving involves classification. Often, the concepts cannot be
satisfactorily defined by a set of necessary conditions, and require a domain expert
to explain why a case exemplifies a concept. PROTOS (Bareiss et al., 1988) is an
example of a case based reasoner that uses this paradigm.
e) Precedent based paradigm. In the precedent based paradigm, cases are precedents
employed in arguments by analogy to justify assertions about how to decide a
problem. Since analogies may lead to various precedents which have differing
outcomes, the reasoner must justify its conclusion as well as giving the competing
precedents. It does so by taking one side's viewpoints and discrediting analogies to
opposing precedents. An example of this approach is GREBE (Branting, 1991).
Case based reasoners can be built much more quickly than rule based reasoners and are
much easier to maintain. The addition of a new rule to a rule based system can require the
modification of several other rules. Addition of cases to a case library rarely involves
modification of the library. Model based reasoning is based on knowledge of the structure
and behaviour of the devices the system is designed to understand.
2.7.4. Probabilistic graphic dependency models
Graphic models specify probabilistic dependencies using a graph structure. The model
specifies which variables are directly dependent on each other. Typically, these models are
used with categorical or discrete-valued variables, but extensions to special cases, such as
Gaussian densities, for real-valued variables are also possible. These models were initially
developed within the framework of probabilistic expert systems; the structure of the model
and the parameters (the conditional probabilities attached to the links of the graph) were
elicited from experts.
There has been much recent work Ñ in both Artificial Intelligence and Statistics Ñ on
methods whereby both the structure and the parameters of graphic models can be learned
directly from databases. Bayesian methods provide a formalism for reasoning about partial
beliefs under conditions of uncertainty. In this formalism, propositions are given numerical
values, signifying the degree of belief accorded to them. BayesÕ theorem is an important
result in probability theory, which deals with conditional probability. It is useful in dealing
with uncertainty, as well as the use of Bayesian inference networks for information
Bayes theorem states that:
|J) = Pr(J|A


Pr {(J|A
)} ]
and thus allows one to evaluate certain conditional probabilities given other conditional
Model-evaluation criteria are typically Bayesian in form, and parameter estimation can be a
mixture of closed-form estimates and iterative methods depending on whether a variable is
directly observed or hidden. Model search can consist of greedy hill-climbing methods over
various graph structures. Prior knowledge, such as a partial ordering of the variables based
on causal relations, can be useful in terms of reducing the model search space.
Although still primarily in the research phase, graphic model induction methods are of
particular interest to KDD because the graphic form of the model lends itself easily to
human interpretation.
Bayesian concepts have been extensively used in biological sequence analysis. Examples
include determination of evolutionary distances in aligned sequences (Agarwal and States,
1996) and finding regulatory regions in DNA (Crowley et al., 1997). Bayesian inference
algorithms are covered in ISMB tutorial series (Lawrence, 1998).
2.7.5. Relational learning models
Although decision trees and rules have a representation restricted to propositional calculus,
relational learning Ñ better known as inductive logic programming Ñ uses the
more flexible pattern language of first order predicate calculus. A relational learner can
easily find formulae such as X = Y. Most recent research on model-evaluation methods for
relational learning is logical in nature. Increased representational power of relational models
comes at the price of significant computational demands in terms of search.
2.8. Comparative Notes on Data-Mining Methods
Logic and rule based systems are easy to build and development shells are available which
can speed the process of building commercial decision support systems but are limited in
reasoning ability - they require interactive input from human experts. We advocate the
use of combined systems, which can perform analogical, inductive and
deductive reasoning. The logic of exploratory data analysis has been studied
extensively, for an initial reference see (Yu, 1994).
A disadvantage of example-based methods, as compared with tree-based, is that a well-
defined distance metric for evaluating the distance between data points is required. Model
evaluation is typically based on cross-validation estimates of a prediction error (Weiss and
Kulikowski 1991): parameters of the model to be estimated can include the number of
neighbours to use for prediction and the distance metric itself.
Non-linear regression methods are relatively easy to build and maintain, and can tolerate
noisy data, however they require relatively large data sets and it is often difficult to extract
relevant rules from the model.
Example-based methods are often asymptotically powerful in terms of approximation
properties, but, conversely can be difficult to interpret because the model is implicit in the
data and not explicitly formulated Ñ as is the case with neural networks. Case based
reasoning, on the other hand, offers the following natural techniques for realising expert
systems goals:
a) compiling past solutions,
b) avoiding past mistakes,
c) interpreting rules,
d) supplementing weak domain models,
e) facilitating explanation, and
f) supporting knowledge acquisition and learning.
Human knowledge acquisition often involves the use of experiences and cases; case based
reasoning often accurately models the manner in which humans reason.
Compared to both rule based system and non-linear regression methods, case based
reasoners have disadvantages in that they are:
a) hard to build,
b) complicated to maintain, and
c) more likely to be research prototypes than available as commercially useful systems.
The advantage of probabilistic methods is that they utilise well-defined theoretical
background - Bayesian concepts. The disadvantage of probabilistic methods is that they
require correctly assigned probabilities, which are often not clearly assignable in particular
biological cases. This requires good understanding of the nature of data, which is not a
requirement in non-linear regression methods.
Each data-mining technique typically suits some problems better than others. For example,
decision tree classifiers can be useful for finding structure in high-dimensional spaces and
in problems with mixed continuous and categorical data Ñ because tree distances do not
require distance metrics. However, classification trees might not be suitable for problems
where the true decision boundaries between classes are described by a polynomial Ñ for
example one of a second order. There is no universal data-mining method and
choosing a particular algorithm for a particular application is an art rather
than a science. In practice, a large portion of the application effort should
go into properly formulating the problem rather than into optimising the
algorithmic details of a particular data-mining method.
3. Domain Concepts from Biological Data and Databases
3.1. Bioinformatics
Biological databases continue to grow rapidly. This is reflected in increases in both the size
and complexity of individual databases as well as in the proliferation of new databases. We
have ever-increasing requirements for both speed and sophistication of data analysis to
maintain the ability to effectively use the available data. Bioinformatics is a field emerging
at the overlap between biology and computer science. Biological science provides deep
understanding of this complex domain, while computer science provides effective means to
store and analyse volumes of complex data. Combining the two fields gives the potential
for great strides in understanding biological systems and increasing the effectiveness of
biological research. The difficulties in effective use of bioinformatic tools arise at both
ends: an average biologist has a limited understanding of sophisticated data analysis
methods, of their applicability and limitations, while an average computer scientist lacks
understanding of the depth and complexity of biological data. Bioinformaticians need to
develop an overlap of understanding between the two fields.
KDD process provides a framework for efficient use of bioinformatics resources in both
defining meaningful biological questions and obtaining acceptable answers.
3.2. What do we need to know about biological data?
The four most important data-related considerations for the analysis of biological systems
are understanding of: a) the complexity and hierarchical nature of processes that generate
biological data, b) fuzziness of biological data, c) biases and potential misconceptions
arising from domain history, reasoning with limited knowledge, a changing domain and
methodological artefacts, and d) the effects of noise and errors. The illustration the above
points is given our case study example. Despite a broad awareness, biological-data-specific
issues have not been reported extensively in the bioinformatics literature. This awareness is
exemplified in the words of Altschul et al. (1994): ÒSurprisingly strong biases exist in
protein and nucleic acid sequences and sequence databases. Many of these reflect
fundamental mosaic sequence properties that are of considerable biological interest in
themselves, such as segments of low compositional complexity or short-period repeats.
Databases also contain some very large families of related domains, motifs or repeated
sequences, in some cases with hundreds of members. In other cases there has been a
historical bias in the molecules that have been chosen for sequencing. In practice, unless
special measures are taken, these biases commonly confound database search methods and
interfere with the discovery of interesting new sequence similarities.Ó
3.2.1. Complexity underlying biological data
Biological data are sets of facts stored in databases which represent measurements or
observations of complex biological systems. The underlying biological processes are
highly interconnected and hierarchical; this complexity is usually not encoded in the data
structure, but is a part of ÒbackgroundÓ knowledge. Knowledge of the biological process
from which data are derived enables us to understand the domain features that are not
contained in the data set. Raw information thus has a meaning only in the broader context,
understanding of which is a prerequisite for asking ÒrightÓ questions and subsequent
selection of the appropriate analysis tools. According to Benton (1996), the complexity of
biological data is due both to the inherent diversity and complexity of the subject matter,
and to the sociology of biology.
3.2.2. Fuzziness of biological data
Biological data are quantified using a variety of direct or indirect experimental methods.
Even in a study of a clearly delineated biological phenomenon a variety of experimental
methods are usually available. An experimental method is considered useful if a correlation
can be established between its results and a studied phenomenon. This correlation is rarely,
if ever, perfect. Distinct experimental methods in the study of the same biological
phenomenon would generally produce sets of results that overlap, but not fully. Comparing
these results involves scaling and granularity issues. Within the same experimental method,
differences of results arise from our inability to reproduce identical conditions (eg.
temperature, pH, use of different cells or cell lines, use of chemicals from different
suppliers etc.). Quantification of the results is commonly a result of a human decision or it
may vary due to calibration of equipment. A reported quantitative result is typically the
average value of several independent experiments. Quantitative biological data are fuzzy
due both to inherent fuzziness of the biological systems themselves, and to the imprecision
of the methods used to collect and evaluate data. Quantitative biological data therefore
represent approximate measurements. On the other hand, the classes to which qualitative
biological data are assigned are arbitrary, but objective in that they represent some
biological fact. Biological research is largely driven by geographically dispersed
individuals, who use unique experimental protocols and thus biological experimental data
are produced with neither standard semantics nor syntax (Benton, 1996). Understanding
the fuzzy nature of biological data is therefore crucial for the selection of appropriate data
analysis tools.
3.2.3. Biases and misconceptions
Biological data are subject to strong biases due to ether their fundamental properties,
presence of large families of related motifs or historical reasons (Altshul et al., 1994). A set
of biological data rarely represents a random sample from the solution space. Typically,
new results are generated around previously determined data points. Some regions of the
solution space are therefore explored in depth, while some regions remain unexplored.
Historical reasons are a common cause of such biases, where a set of rules might be
defined in an attempt to describe a biological system. If these rules get accepted by a
research community, further research will get directed by applying these rules. If those
rules describe a subset of the solution space, the consequence is the refinement of the
knowledge of the subset of solutions that satisfies the rules, while the rest of the solution
space is largely ignored. Similarly, reasoning with limited knowledge can lead to over- or
under-simplification errors. A careful assessment of the relative importance of each data
point is thus necessary for the data analysis. Improvements in the technology also influence
biological data. Older data are often of lower granularity both quantitatively and
qualitatively, while newer data are often of higher precision, due to both expanded
background knowledge and improved experimental technology.
3.2.4. Noise and errors
Sources of noise in biological data include errors of experimentation, measurement,
reporting, annotation and data processing. While it is not possible to eliminate errors from
data sets, a good estimate of the level of noise within the data helps selection of the
appropriate method of data analysis. Due to the complexity of biological systems,
theoretical estimation of error levels in the data sets is difficult. It is often possible to make
a fair estimate of the error level in biological data by interviewing experimental biologists
who understand both the process that generated that data and the experimental
methodology. In the absence of a better estimate it is reasonable to assume the error level in
biological data at 5%.
3.2.5. How to design KDD process?
When sufficient data are available and the biological problem is well-defined, standard
statistical methodology should be applied. A field where this approach has been routinely
used is epidemiology (see Coggon et al., 1997). Most of biological research, particularly in
molecular biology, is conducted in domains characterised by limited background
knowledge and by data from various sources and of variable accuracy. In such cases the
artificial intelligence techniques are more useful. To facilitate bioinformatic analysis of
biological systems, we have defined a Data Learning Process (DLP), comprised of a series
of steps (Fig. 4). The DLP steps are: a) develop understanding of the biological system and
methodological processes that generate data, b) develop a standardised fuzzy representation
of the data, c) relate data from various sources using this standardised representation, d)
identify potential sources of biases in data, e) assess the validity of relevant models
reported in the literature, f) estimate the amount and types of errors in the data sets, and g)
integrate knowledge acquired in previous steps in some coherent form (eg. model or
Iterative cycling - refinement between any
two steps a) through f)
can be
performed. Performing
the DLP
steps requires
significant inputs from both biologists and
computer scientists and must involve two-way communication.
Figure 4.
Data learning process.
develop understanding of the biological system
and experimental methods that generate data
develop standardised fuzzy measures of the data
relate data from various sources
using the standardised measure
identify potential sources of biases in data
assess the validity of relevant
models reported in the
estimate the amount and types of
errors in the available data sets
integrate knowledge acquired in previous steps
in some coherent form (eg. model or description)
working conceptual
model or description

3.3. DatabaseNotes
The term database refers to a collection of data that is managed by a database management
system. A database management system is expected to:
a) Allow
users to
new databases and specify
(logical structure of
the data), using a specialised language called a
b) Give users the ability to query and modify
data, using an
appropriate language
known as a
query language
data manipulation language
c) Support
storage of very large amounts of
data, over a long period of time,
keeping it secure from accident or unauthorised use and allowing efficient access to
the data for queries and database modifications.
d) Control access to
from many
users at once, without
allowing the actions of
one user to
affect other
and without allowing simultaneous accesses to
accidentally corrupt the data.
Database management systems
a) The ability to
manage persistent

b) The ability
to access large amounts of data

c) Support for at least one

abstraction) through which the
data. A
Database Management
System provides at
least one
abstract model of data that allows the user to see data in understandable forms.
d) Support for certain high level languages

that allow the user to define
the structure of
data, access and manipulate data.
Transaction Management:
The capability to provide correct concurrent access to the
database by many users at once.

Security and Integrity
: The ability to limit access to
data to
and the
ability to check the validity of the data.
The ability to recover from systems failure without losing data.
To make
access to files easier a
Database Management
System provides a
manipulation language
to express operations on files.
The conceptual database level is an
abstraction of the real world.
3.3.1. Database development trends
The trend of technology for the next few years will be dominated by:
a) Distributed, heterogeneous environments;
b) Open systems;
c) More functionality;
d) Parallel Database Management.
Application areas will include:
a) Engineering design and management;
b) Office Information Systems;
c) Decision Support Systems;
d) The Human Genome Initiative.
The next generation of databases will include:
a) Active Databases;
b) Multimedia Databases;
c) Scientific and Statistical Databases;
d) Spatial and Temporal Databases
3.3.2. Intelligent information systems
Brodie (1993) proposed that
the intelligent information
systems of
the twenty
first century
will need to be both distributed and
heterogeneous. Papazoglou
et al.
(1992) discussed in

detail how to construct intelligent cooperating information systems. The three major
ingredients that any such system must have are interconnectivity, interoperability and
cooperation. Interconnectivity implies the existence of some standard communication
network to interconnect the information system nodes and the ability of these nodes to
communicate by exchanging messages. Interoperability involves the ability of two or more
systems to work together to execute well defined and delimited tasks collectively.
Cooperation is the process in which a set of information systems utilise refined intelligence
to exchange beliefs, reason about each other and about one anothers conceptions, and in
general, ÔdiscussÕ how they can coordinate their activities to contribute to an information
intensive problem. Intelligent cooperating information systems are an extension of
distributed artificial intelligence and distributed database systems.
3.4. Database-Related Issues in Biology
Hundreds of biological data repositories are publicly available, containing large quantities
of data. A comprehensive listing of biological databases is available at Infobiogen
<>. The ability to access
and analyse that data has become crucial in directing biological and medical research. The
Internet and WWW facilitate access to data sources and also provide data analysis services.
The main issues in biological databases are : a) integration of multiple data sources, and b)
flexible access to these sources.
3.4.1. Integration of heterogeneous databases
Markowitz, (1995), uses a definition of a database as a data repository which provides a
view of data that is: a) centralised, b) homogeneous, and c) which can be used in multiple
applications. The data in a database are structured according to schema (database
definition), which is specified in a data definition language, and are manipulated using
operations specified in a data manipulation language. Data model defines the semantics
used for data definition and data manipulation languages. Biological databases are
characterised by various degrees of heterogeneity in that they:
a) encode different views of the biological domain,
b) utilise different data formats,
c) utilise various database management systems,
d) utilise different data manipulation languages,
e) encode data of various levels of complexity, and
g) are geographically scattered.
The most popular format for distribution of biological databases is a flat file format. The
advances in understanding biological processes induce frequent changes in flat file formats
in use (Coppieters et al., 1997). Popular formats for biological databases also include,
Sybase relational DBMS, Sybase/OPM (Chen et al., 1995) and ACeDB (Durbin and
Thierry-Mieg, 1991-), among others.
A comprehensive study of a particular molecular biology domain would involve analysis of
data from multiple sources which contain related data that overlap to some degree. Attempts
were made to overcome the problems arising from heterogeneity of the data sources and
access tools, including (Markowitz, 1995):
a) consolidating databases into a single homogeneous database,
b) consolidating databases via imposing a common data definition language, data
model or database management system,
c) database federations and connecting databases via WWW by maintaining hyperlinks
between component databases, which preserve their autonomy,
d) data warehouses in which data from federated databases are also loaded into a
central database (eg. Integrated Genomic Database, Ritter et al., 1994), and
e) multidatabase systems which are collections of loosely coupled databases which can
be queried using a common query language (eg. in Kleisli, Davidson et al., 1997)
or both described and queried by using a common data model (eg. in Chen et al.,
The consolidating options failed because of cost and lack of cooperation. Federated
databases allow interactive querying of multiple databases, however with a limited ability to
perform complex queries. From the KDD perspective, data warehouses and particularly
multidatabase systems are the most interesting. Multidatabase browsers which facilitate
retrieval from multiple databases and cross-referencing include SRS (Etzold et al., 1996),
Entrez (Schuler et al., 1996), DBGET (Migimatsu and Fujibuchi, 1996) and ACNUC
(<>), among others. Multidatabase browsers, however,
do not allow formulation of complex queries such as those required in KDD process.
3.4.2. Flexible access to biological databases
KDD requirements include both flexible access to and performing of complex queries on
multidatabase systems. Those requirements facilitate data preparation phase of a KDD
process - (data preparation phase includes steps 2, 3 and 4, Fig. 1). A flexible access to the
diverse biological sources is facilitated through systems such as CORBA
(; Coppieters et al., 1997) or Klesli
(Davidson et al., 1997).
CORBA (Common Object Request Architecture) defines a set of standards which constitute
a coherent framework in which independent data sources and their services can be
assessed. The standards include: a) a formal language, b) the interface definition language
(IDL) in which data and services are specified, and c) the object request broker (ORB)
which is necessary to realise these services. CORBA framework has been used for
integration and interoperability of biological data resources at the European Bioinformatics
Institute (Coppieters et al., 1997). However, according to Kosky et al., (1996) the IDL is
not appropriate for defining database schemes and the attempts were made to combine
CORBA with their OPM (Object Protocol Model). CORBA-based technology has been
used for the design and implementation of genome mapping system (Hu et al., 1998) with
emphasis on database connectivity and graphical user interfaces.
BioKleisli <> offers high-level flexible
access to human genome and other molecular biological sources. It comprises:
a) a self-describing data model for complex structured data,
b) a high-level query language for data transformation, and
c) flexible yet precise control in ad-hoc queries.
In Kleisli environment, a typical query implementation time is reduced from weeks to days
(some times, hours). The architecture of Kleisli system is given in Fig. 5.
By definition, KDD process is a non-trivial, which implies complex queries to data
sources. The utilisation of standards and tools such as these contained in CORBA or Kleisli
systems will be essential for the future development of integrated biological applications
and consequently for the design of KDD applications in biology.
Figure 5.
Architecture of the Kleisli system which facilitates combining
transformation of
multiple sources. Adapted from <>.
4. KDD and Data-Mining Developments in Biology
Biological data accumulate exponentially in
both volume and complexity.
knowledge increases likely.
Automation o f
knowledge discovery i s a part o f this
The fields
the application o f KDD
shows a n
increasing importance include: annotation o f
masses of data,
structural and functional
protein structure prediction and modelling, analysis of
biological effects
(function, signalling patterns
identification o f distantly related
applications (eg. drug design).
4.1. Annotation of masses of data.
The current estimate of the doubling time in both number of entries and sequence base-pairs
in DNA
databases is 14-24
This i s largely because o f
automated generation o f
Expressed Sequence Tags
which now
comprise more than 2/3 o f the database
Only 5 % of estimated 1 0
human genes have been currently annotated. The
components of the gene
discovery include a) gene identification, b) gene characterisation,
and c) gene
expression. A
significant effort
been directed
tools for
These tools
include GRAIL (Uberbacher
et al.,
1996) and
Gene Index browser
et al.,
1998). A
review on
computational gene discovery can be
found in (Rawlings and
Searls, 1997).
et al.,
Data Drivers
Open Socket
Stream (pipe)
give example o f the
discovery o f novel genes
facilitated b y
from t h e
databases .
4.2. Structural and Functional Genomics
refers t o mapping, sequencing and analysis of
set o f genes a n d
chromosomes in organisms. An initial phase o f a genome analysis i s construction o f high-
resolution genetic, physical and transcript maps o f an organism - structural
genomics. T h e
advanced stage comprises
assessment of
gene function b y making
use of t h e
information and reagents provided b y structural
According to Hiether and
Boguski (1997), Ò
Computational biology will perform a
role i n
has been
characterised b y
management, functional
genomics will b e
characterised b y
sets for
particularly valuable
genomics promises t o
rapidly narrow the gap
between sequence
and t o
yield new insights
into the behaviour of biological
syst e ms
Ó. A framework f o r
genomic analysis has been outlined in (Tatusov
et al.,
4.3. Protein structure prediction and modelling
The structure of a protein can elucidate its function, in both general and specific
terms, a n d
its evolutionary history (Brenner
e t al.,
1996). Numerous
methods have been developed
for protein structure analysis in last two decades
see sections I V and V o f Methods i n
Enzymology, Vol. 266, 1996). Nevertheless, we
still lack the
knowledge o f
the structure
the majority o f
known proteins. Secondary
and tertiary
structures of only 33% of a l l
sequences in SWISS-PROT database are currently available - s ee HSSP database (Dodge
4.4. Analysis of biological effects
are characterised b y
high degree o f complexity and
involved are usually multi-step and involve multiple
Sequence databases contain
little, if
knowledge o n
systems and
but contain
voluminous low-level
The biological effects need to be
studied a t higher
level o f
The relevant data are available either a s expert
knowledge o r i n
the literature
The high-level structure can be encoded in a
form o f a knowledge base o r a s a
model which can then be used to formulate and perform complex
queries. A n
example o f a
knowledge base i s
system (Chen
et a l
., 1997). Promising results i n
modelling HIV infections were produced b y rule-based
cellular automaton
which addresses the complexities of the immune system (Siebu rg
et al.,
4.5. Identification of distantly related proteins
This i s a notoriously
difficult field
which i s
likely to continue to test the limits i n
development o f data-mining
methods. This
field also
provides a n unifying
for t h e
fields described above (sections
4.1 - 4.4).
Distant relations between biological
the main clues
identification and characterisation o f novel
sequences i n t h e
approaches include sequence similarity
determination o f amino
motifs, determination o f
conserved domains, and
matching sequence
patterns. T h e
primary goal i s
the determination o f
display low similarity, but which a r e
significantly related.
discussion of issues i n
detection o f distant similarities can b e
found in (Catell
et al.,
1996). More sophisticated methods such as Hidden
Markov Models
(eg. Krogh
e t al.,
are gaining
popularity. Sequence
pattern discovery methods a r e
covered in ISMB tutorial series (Brazma and Jonassen 1998).
4.6. Practical applications
Bioinformatics is becoming an important field in drug and va ccine design.
Determination o f
novel compounds
and agricultural
industries includes
simultaneous screening o f very large
samples, such as compound
collections and
combinatorial libraries, termed
High Throughput Screening
The main challenge i n
drug discovery research is t o
identify rapidly novel lead
compounds. HTS produces
enormous amounts of data which are generally not matched wit h the ability to analyse these
data, creating a bottleneck. KDD and data-mining techniques will keep playing increasingly
important role in this
mining techniques have been established f o r
determination of peptide candidates
vaccines and immunotherapeutic
drugs (eg. Brusic
et al.,
1994a;1998a;1998b) and will be discussed in more detail in the next section.
5. Case Studies: Application of KDD in Immunology
5.1. Background
I n this section w e provide
background information which i s intended to
understanding of our case
studies. This
information summarises
first step o f
the K D D
process ÔLearning the application domain Õ.
5.1.1. Biology
T cells of the immune system in vertebrates recognise short antigenic peptides derived from
the degradation o f
These peptides are presented o n the
surface o f
presenting cells to
the T cells b y MHC (major histocompatibility complex) molecules
(reviewed in Rammensee
et a l
., 1993; Cresswell,
MHC molecules bind peptides
produced mainly by intracellular (MHC class I) or extracellu lar (MHC class
of proteins. A
cancer cell o r a cell
infected b y a
virus, for
example, presents a subset of
that are different
from those presented b y a
cell. I n a
cells displaying
Ôforeign Õ
antigenic peptides are
destroyed b y
the immune
therefore a s recognition labels
for t h e
immune system and are keys in the mechanism o f triggering and regulation o f the
that mediate a n immune reaction are termed T-cell
The ability t o determine T-cell
epitopes i s therefore
for our understanding how t h e
system functions and opens
ways towards
design of
drugs a n d
5.1.2. MHC/peptide binding problem
MHC molecules play a
central role in immune interactions a t the molecular
level. Binding
o f a peptide to a MHC
molecule i s mediated
through bonds
between the
groove of t h e
MHC molecule and the peptide backbone as well as through int eraction between side chains
of amino acids that form a peptide and specific pockets with in the
groove (Bjorkman
e t a l.,
1987; Brown
et al.,
1993). Peptide/MHC binding is thus influenced b y its
overall structure
and by the side chains of the individual amino acids. Contri bution of individual amino acids
in particular positions within peptides may have positive, n eutral or negative contribution to
These contributions have been exemplified in binding motifs (Rammensee
et al.,
1995). Binding motifs provide a qualitative description of the contribution to binding
of each amino acid
possible 20) a t a
particular position within MHC-binding
More than
variants o f MHC molecules are
known in humans (see
P.J. - Histo database).
Different MHC molecules
bind peptide
that may be distinct o r
may overlap to various degrees.
Class I
molecules are
expressed on
the majority o f the cells in the
organism and display
peptides derived from degradation o f proteins in
These peptides are mainly 9
amino acids long ranging from 8 to
with a very few exceptions.
whole length o f
class I peptides i s
accommodated within the
binding cleft. Class I MHC
complexed b y a peptide are recognised b y cytotoxic T cells
kill cells
foreign peptides. Class II MHC
molecules are
expressed on
antigen presenting
cells, a
subset of
cells that specifically interact
with T cells. T
cells that
recognise class I I MHC
molecule/peptide complexes perform regulatory function within the immune
response. T he
binding cleft of class II MHC molecules accommodates a 9-mer binding core with
extending out o f the
The majority o f peptides
bind class I I MHC molecules are
10-30 amino acids long (Chicz
et al
., 1993).
5.1.3. Peptide processing
Class I
antigenic peptides are generated b y the degradation o f cytosolic
proteins and in
order t o be presented t o T
cells need to gain entry to the endoplasmic reticulum
(ER), t he
site where
peptide binding to MHC
class I occurs.
TAP i s a transmembrane protein
responsible for
transport o f
antigenic peptides into the ER
(Germain, 1994). The
schematic representation of class I peptide processing is gi ven in Fig. 6.
Figure 6.
A schematic view of the
MHC class I presentation pathway. A
cytoplasmic protein is degraded and peptide fragments transported to
the ER via TAP or
alternative routes.
The peptide binds to
MHC molecules in the ER,
where additional
trimming of
peptide to
correct size
may occur.
Trimmed peptide
may also be
re-exported to
the cytosol.
Once peptide is bound to a MHC
molecule, the complex is exported to the cell surface for presentation to T cells. The peptide-MHC complex
might be recognised by an appropriate T cell or be internalised back into the cell and degraded.
ER - binding
to MHC
Transport to
cell surface
T cell
ER peptide
Figure 7.
for prediction of MHC-binding peptides. A) An example of a binding
which indicates the positions and amino acids of main anchors, preferred and
residues. B) A matrix
quantifies a
contribution to MHC/peptide binding of
acid at each
position of a
peptide. The predicted binding affinity is calculatedas a sum of coefficients for amino acids
within a
D) An ANN model
used to
learn MHC-binding patterns
comprising 180 input
units, 2
hidden layer
units and a single output unit. A representation of an individual amino acidis a binary vector of length 20.
Relative position
1 2 3 4 5 6 7 8 9
Anchor (bold),
F,W N,S pol.* pol.*
preferred or
,I I,L T,Q chg.* ali.*
forbidden (italic) L,V D,E H,R ali.* K
residues M no
*pol.: polar; chg.:charged; ali.:aliphatic residues.
A) A Binding Motif of human HLA-DRB1*0401 (Rammensee
et al
., 1995)
B) A Quantitative Matrix of human HLA-DRB1*0401 (adapted from Hammer
et al
., 1994)
Position Amino acid
P1 * * * * 0 * * -10 * -10 -10 * * * * * * -10 0 0
P2 0 0 -13 1 8 5 8 11 11 10 11 8 -5 12 22 -3 0 21 -1 9
P3 0 0 -13 -12 8 2 2 15 0 10 14 5 3 0 7 2 0 5 0 8
P4 0 0 17 8 -8 -15 8 8 -22 -6 14 5 -21 11 -15 11 8 5 -12 -10
P5 0 0 -2 -1 3 2 -1 1 3 1 3 2 5 1 0 4 6 4 -1 -2
P6 0 0 0 -12 -13 -11 -16 -2 -23 -13 -13 17 1 -12 -22 17 19 13 -9 -11
P7 0 0 -11 -2 -8 -15 -8 -2 -12 4 7 -1 -3 -5 -12 -4 -2 5 -13 -7
P8 0 0 -11 -2 1 -5 0 -1 9 6 4 7 -2 16 7 6 5 4 6 13
P9 0 0 -25 -18 -8 -2 3 -4 -9 -13 -4 -11 -16 7 -9 12 -3 5 -3 -15
* forbidden amino acid
1 2 3 4 5 6 7 8 9
Binding prediction score
for 9-mer peptide
C) An ANN model for prediction of MHC-binding peptides (see Brusic
et al
., 1998a)

5.1.4. Models
Three types of models that incorporate biological knowledge have been used for prediction
of MHC binding peptides: binding motifs (Rammensee et al., 1995), quantitative matrices
(Parker et al., 1994; Hammer et al., 1994) and artificial neural networks (Brusic et al.,
1994a; Brusic et al., 1998a). Binding motifs (Fig. 7a) are the simplest models, which
represent peptide anchoring patterns and the amino acids commonly observed at anchor
positions. Quantitative matrices (Fig. 7b) provide coefficients that quantify contribution of
each amino acid at each position within a peptide to MHC/peptide binding. Matrices encode
higher complexity than binding motifs but ignore the effect of the overall structure of
peptide, such as influences of neighbouring amino acids. We can encode an arbitrary level
of complexity in artificial neural network (ANN) models (Fig. 7c) by varying the number
of hidden layer nodes or the number of hidden layers. ANN models can therefore encode
both the effects of the overall peptide structure and of individual amino acids to
MHC/peptide binding. If sufficient data are available, more complex models perform
better, as shown in a comparative study (Brusic et al., 1998a). On the other hand, it is not
beneficial to use models whose complexity exceeds the complexity of the process that
generated data. This will increase required amounts of data for model building and possibly
worsen the predictive performance of the model.
5.1.5. Data and analysis
The purpose of predictive models of MHC/peptide interactions is to help determine peptides
that can bind MHC molecules and therefore are potential targets for immune recognition in
vivo. Various experimental methods have been developed to measure (directly or indirectly)
peptide binding to MHC molecules. Van Elsas et al. (1996) reported the results of three
experimental binding methods in determining T-cell epitopes in a tumour-related antigen
(Melan-A/MART-1) in context of human MHC molecule HLA-A*0201. The summary of
their report is given in Fig. 8, being an instance of poor correlation of results between
various experimental binding methods. In the development of predictive models, we want
to maximally utilise available data. Combining data from multiple experimental methods
requires dealing with imprecise and inexact measurements. For MHC binding, fuzzy
measures of high-, moderate-, low- and zero-affinity binding have been commonly used.
The application of fuzzy logic (Zadeh, 1965) enables quantification of fuzzy data sets and
the extraction of rules for model building. Artificial neural networks are particularly useful
for extracting rules from fuzzy data (Kosko, 1993) and have been successfully used for
prediction of MHC binding peptides (reviewed in Brusic and Harrison, 1998). By
trimming ANN models of MHC/peptide binding we can demonstrate that binding motifs
and quantitative matrices represent different levels of complexity of the same model,
showing that the basic rules of MHC/peptide interactions, ie. background knowledge, has
been captured in these models.
5.1.6. Predictions
A decade after the basic function of MHC molecules was described (Doherty and
Zinkernagel, 1975), a small database of T-cell epitopes was compiled, followed by
propositions of predictive models of T-cell epitopes. One such model (DeLisi and
Berzofsky, 1985) was based on the assumption that a T-cell epitope forms an amphipatic
helix (a helical structure of a peptide which has one side hydrophilic ie. attracts water
molecules and the other hydrophobic ie. repels water molecules), which binds into the
groove of MHC molecules. Although the amphipatic model was incorrect, it was used for a
decade. Those predictions that were fortuitously correct were also preferentially reported in
the literature, reinforcing the presumed usefulness of the model. It was another decade
before the models based on detailed knowledge of peptide/MHC interactions emerged
(reviewed in Brusic and Harrison, 1998). Biases in data arise from a non-critical usage of
proposed binding motifs (PBM) which reinforces data around peptides that conform well
with PBM. There are many examples of peptides that do not conform to the PBM, yet bind
corresponding MHC
molecule; many o f
these peptides are
also reported a s
epitopes (see Brusic
et al
., 1994b;1998c).
Prediction of T-cell epitopes is possible only relative to s pecific MHC alleles.
peptide binding to the MHC molecule is a necessary, but not sufficient condition
for its ÔT -
cell epitopicity Õ. To be a T-cell epitope, a
peptide must be recognised b y a
matching T
and thus the T-cell epitopicity of a peptide can only be determined in the context o f a target
biological system (an organism or a particular cell line). T he prediction o f
epitopes i s
often confused with
the prediction o f
MHC-binding peptides. In
determination o f
prediction o f MHC binding peptides equates to
narrowing of
pool o f
potential of T-cell epitopes.
Figure 8.
A summary comparison of the results
three experimental methods for determination of
A*0201 binding peptides from a tumour antigen MART-1 (adapted from van Elsas et
al., 1996). The
measures of binding affinity (high, moderate, low and none) are used at the vertical scale. Binding results for
well, while those
for MART-1
peptides correlate
poorly. The
diamonds indicate T-cell
5.1.7. Dat a e r r o r s
Noise and errors i n
the data
ability to derive
useful models. Brusic
et a l.
(1997) studied
the effect o f
noise i n
sets on
development o f quantitative matrix
that the moderate level of
noise significantly affects
ability t o
develop matrix
models. For example, 5% of erroneous
data in a data
set will double t he
number o f
relative to a Ôclean Õ data
required to build a matrix model o f a
pre-set accuracy. O n
the other
hand, 5% of errors does
not significantly affect the overall
success of prediction of ANN models due to
their ability to handle imperfect o r incomplete
data (Hammerstrom, 1993).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Competitive binding
Half-time dissociation
Peptides from Melan-A/MART-I antigen (1Ð26) and control peptides (27Ð31)
5.2. Data-Mining Tools
Data-mining tools ie. models of MHC/peptide interactions can be determined
experimentally, can be data-driven or both. Binding motifs are determined experimentally
either by pool sequencing (Falk et al., 1991) or by amino acid substitution studies (eg.
Harrison et al., 1997). Quantitative matrices can be determined experimentally by
systematic substitution studies (eg. Hammer et al., 1994), or by using computational
search methods and data from databases (eg. Brusic et al., 1997). ANN-based models are
purely data-driven. A critical issues in building predictive MHC-binding models are:
a) selection of the search method, and
b) evaluation of performance of the model.
5.2.1. Selection of the search method
An experimental ÔsearchÕ for parameters of the MHC/peptide models is expensive - it
requires synthesis and testing of multiple peptides. Further disadvantage is that
experimentally derived models also encode methodoligical artefacts - peculiarities and
biases specific to a particular experimental method (see sections 3.2.2, 3.2.3 and 4.1.5.).
If sufficient data are available, data-driven methods for determination of the parameters of
the model (eg. coefficients of a quantitative matrix) are cheap and useful. Building data-
driven models require, however, understanding of the nature of data and careful pre-
Because of ambiguities resulting from the variable length of reported binders and the
uncertain location of their core regions, peptides tested experimentally for binding and used
to build a predictive model require pre-processing by alignment relative to their binding
anchors. For MHC class I peptides this is a simple problem because of the presence of
well-defined anchor positions and minimal variability in peptide length. MHC class II
binding peptides, however, have more degenerate motifs and the alignment is a non-trivial
problem. A solution for the alignment problem is described in section 4.3, in which we
describe our PERUN method.
When an acceptable peptide alignment step is completed, we can proceed with
determination of model parameters. A common initial suggestion is to use the frequencies
of amino acids for data fitting. A data set of peptides known to bind a particular MHC
molecule would typically contain sizeable subsets which contain variants of a single peptide
(eg. single amino acid substitutions). This problem can be rectified by various weighting or
scaling schemes. We suggest the use approaches which optimise the overall predictive
performance of the model.
5.2.2 Evaluation of performances of predictive models
MHC binding data contain biases that affect the performance of prediction systems. Data
representation must encode all relevant features needed for prediction which determines the
minimal representation. On the other hand, too large representation may introduce
unnecessary complexity that can adversely affect performance of the prediction system. The
most commonly used measure of prediction system is error rate. True error rate may be
different than the apparent one and depends on factors including the number, quality and
statistical distribution of available data and estimation technique (Weiss and Kulikowski,
1991). Examples of theoretical validation of models applied to prediction of MHC-binding
peptides include:
a) single train-and-test partition (eg. Brusic et al., 1994a),
b) bootstrapping (Adams and Koziol, 1995), and
c) 10-fold cross-validation (Brusic et al., 1997;1998a)
Systematic experimental validation o f
MHC-binding data-mining tools i s
validation includes
synthesis of
peptides (usually overlapping peptides
span the length of a protein antigen) and
testing. We performed a
systematic experimental validation o f predictive models in
et al.,
1998a) for a
human class I I
HLA-DRB1*0401. W e
compared performances o f
quantitative-matrix-based and
(evolutionary algorithm/ANN)
methods. A
popular approach includes synthesis and subsequent
experimental testing o f
positives (eg. Salazar-Onfray
et al.,
5.3. PERUN
PERUN is a hybrid
the prediction of peptides
bind to MHC class I I
et al.,
1998a). It
utilises: a) available experimental data
and expert
knowledge of binding motifs, b) an evolutionary algorithm to derive
matrices, c )
alignment (quantitative) matrices
peptide alignment, and d ) an
ANN for classification.
The key elements of PERUN are depicted in
Figure 9. We tested
the ability o f
predict peptides
bind to HLA-DRB1*0401 human MHC class II
molecule associated
with insulin-dependent diabetes and rheumatoid arthritis and
validated prospectively its
accuracy. PERUN
combines high accuracy o f predictions with
the ability t o
integrate new data and self-improve.
Figure 9.
The PERUN algorithm for
determination of MHC-binding peptides. Experimental
data are
taken from the
et a
l., 1998c). Peptides
were aligned by an
algorithm and then used to train an ANN. Trained ANN is then used to predict binders and new experimental
data is used for refinement of the model. For further details see (Brusic
et al.,
peptide list
for peptide
binding data
The predictive performance of three methods was compared using Relative Operating
Characteristic (ROC) analysis (Swets, 1988), Table 1, and experimental binding results for
62 peptides spanning protein antigen tyrosine phosphatase (Honeyman et al., 1997).
Aroc for arbitrary binding-definition threshold
Prediction method Low-affinity Moderate-affinity High-affinity
PERUN 0.73 (0.06) 0.86 (0.06) 0.88 (0.06)
MATRIX 0.73 (0.06) 0.82 (0.07) 0.87 (0.07)
MOTIF 0.63 (0.07) 0.69 (0.09) 0.74 (0.1)
Table 1. Comparison of performance of three predictive methods. The measure of performance is the area
under the ROC curve with standard error area given in parentheses. Value of Aroc=0.5 indicates random
guessing while Aroc=1 indicates correct prediction for all test cases. Empirically, values of Aroc>0.7 are
considered as significant (Swets, 1988). The analysis was performed by comparing predictions at three
arbitrarily defined thresholds for the definition of binding peptides. The models used for comparison are
depicted in Fig. 7a (motif), Fig 7b (matrix) and Fig 9 (PERUN).
5.4. Study of Human TAP Transporter
We used techniques of knowledge discovery in databases (KDD) to study the nature of
binding of peptides to the human transporter associated with antigen processing - TAP (see
section 4.1.3). An artificial neural network-based computer model of peptide-TAP binding
was built and used in the data mining step of a KDD process. An iterative refinement of the
predictive ANN was performed (Table 2), before the model was considered Ôsufficiently
accurateÕ. A database of HLA-binding peptides (Brusic et al., 1998c) was mined for
patterns of TAP binding. KDD process applied in this project is shown in Fig. 10.
We found that the affinity of HLA-binding peptides for TAP differs according to the HLA
molecule concerned: B27-supertype, A3-supertype or HLA-A24 binding peptides have
high affinity for TAP, whereas most peptides that bind A2-supertype, B7-supertype or
HLA-B8 lack affinity for TAP (Table 3). An equivalent experimental study would require
several thousand experiments using a variety of HLA molecules and is not feasible.
Model Training
Aroc Correlation
Initial 272 80 0.82, 0.89, 0.92 0.74
Cycle 1 394 100 0.81, 0.86, 0.95 0.64
Cycle 2 494 494
0.90, 0.95, 0.96 0.83
Table 2. The sizes of the training and test sets for the ANN models, and measures of the model
performances. The respective values are for three different thresholds for definition of affinity of TAP-
binding peptides (low, moderate and high),
P<0.0001 in all cases.
Testing was performed by 10-fold cross-
validation (described in Weiss and Kulikowski, 1991).
Figure 10.
The algorithm of the KDD process applied to our study of TAP. The Ôinternal new dataÕ are
experimental data generated during the model refinement process, while Ôexternal new dataÕ were acquired
from the external sources.
6. Summary Notes
Knowledge discovery in
databases (KDD) is the
process of
valid, novel, potentially useful understandable patterns in data.
process involves
steps: learning
the application
creating a target data
data cleaning
data reduction and
projection, choosing
the function of
data-mining, choosing
the data-mining
algorithms, data-mining, interpretation, using discovered
and evaluation
of KDD purpose.
goals of data-mining
are prediction and
description. These goals
can be
using a
variety of data-mining methods
such as classification,
clustering, summarisation, dependency modelling,
probabilistic dependency
networks, and change and deviation detection.
mining tools
include: a) decision trees and
rules, b)
nonlinear regression
methods, c)
methods, d)
probabilistic graphic dependency
and e) relational learning models.
There is no universal data-mining method and
choosing a
particular algorithm
for a
particular application is an art rather than a science. In practice, a
large portion of the
Working model
Refined model
peptide data
Initial model
Internal new data
External new data
Suggest further
Compare to reported
experimental evidence
(partial validation)
list of non-

application effort should go into properly formulating the problem rather than into
optimising the algorithmic details of a particular data-mining method.
6.6.The four most important data-related considerations for the analysis of biological
systems are understanding of: a) the complexity and hierarchical nature of processes
that generate biological data, b) fuzziness of biological data, c) biases and potential
misconceptions, and d) the effects of noise and errors.
6.7.Biological databases are geographically scattered and highly heterogeneous in
content, format, and their tools for data management and manipulation. The main
database-related issues in biology are integration of heterogeneous data sources and
flexible access to these sources. The utilisation of standards and tools such as these
contained in CORBA or Kleisli systems will be essential for the future development
of integrated biological applications and consequently for the design of KDD
6.8.The fields where the application of KDD methodologies is growing in importance
include: annotation of masses of data, structural and functional genomics, protein
structure prediction and modelling, analysis of biological effects (function),
identification of distantly related proteins, and practical applications (eg. drug
HLA HLA Proportion of peptides by their TAP-binding affinity (%)
allele supertype
non-binder low moderate high
A*0201 A2 54 37 8 1
A2 33 62 5 0
A*0301 A3 29 29 29 14
A*1101 A3 38 24 36 2
A*3301 A3 24 45 21 9
A*6801 A3 21 50 24 5
B*0702 B7 75 22 2 2
B*3501 B7 67 28 2 2
B*5101 B7 81 11 8 0
B*5102 B7 81 13 6 0
B*5103 B7 80 13 7 0
B*5301 B7 77 21 2 0
B*5401 B7 85 13 2 0
B27 0 17 54 29
B*2705 B27 4 24 48 24
A*2402?15 48 25 11
A2 A2 45 37 16 2
A1 A3 11 36 44 8
A3 A3 24 35 28 13
A11 A3 27 46 23 4
A31 A3 18 42 32 8
B27 B27 8 16 47 29
B51 B7 77 20 3 0
B7 44 40 16 0
58 38 0 4
63 28 8 1
Table 3. Percentages of known specific-HLA binders classified according to their predicted TAP binding
Supertypes A2, A3 and B7 were defined as in (Bertoni et al., 1997) and we assigned B27.
stands for a set of 1000 randomly generated peptides.
Number of tested peptides <30.
7. AuthorsÕ Affiliations
Vladimir Brusic, The Walter and Eliza Hall Institute of Medical Research, Melbourne,
Australia. (, Current address:
Kent Ridge Digital Labs, 21 Heng Mui Keng Terrace, Singapore 119613.
Vladimir Brusic has just taken a bioinformatics research position at the Kent Ridge Digital
Labs in Singapore. He has been a bioinformatician at the Walter and Eliza Hall Institute in
Melbourne, Australia since 1989. He received the Masters degree in Biomedical
Engineering from the University of Belgrade in 1988 and the degree of Master of Applied
Science in Information Technology from the Royal Melbourne Institute of Technology in
1997. His research interests include computer modeling of biological systems,
computational immunology and complex systems analysis. He is a creator of the
MHCPEP, an immunological database. He has developed several data mining tools in
immunology which have been successfully applied to the prediction and determination of
vaccine targets in autoimmunity, cancer and malaria research.
John Zeleznikow, School of Computer Science and Computer Engineering, La Trobe
University, Melbourne, Australia. (
John Zeleznikow is a Senior Lecturer at La Trobe University where he teaches Database
and Information Systems subjects. He received the PhD degree in Mathematics from
Monash University, Melbourne in 1980. His research interests include building Intelligent
Information Systems, Knowledge Discovery and Knowledge Representation and
Reasoning, among others. He has developed several KDD applications in the legal domain
and more recently expanded his work to the biological domain. He has co-authored with
Dan Hunter a book on building intelligent legal systems. He was a general chair of the
Sixth International Conference on Artificial Intelligence and Law. He has recently given
tutorials at the 7th and 9th international Legal Knowledge Based Systems Conferences, the
IFIP World Computer Congress and the 14th and 15th British Expert Systems
8. References
Adams H.P. and Koziol J.A. (1995). Prediction of binding to MHC class I molecules.
Journal of Immunological Methods, 185:181-190.
Agarwal P. and States D.J. (1996). A Bayesian evolutionary distance for parametrically
aligned sequences. Journal of Computational Biology, 3(1):1-17, 1996.
Altschul S.F., Boguski M.S., Gish W. and Wootton J.C. (1994). Issues in searching
molecular sequence databases. Nature Genetics,. 6(2):119-129.
Ashley K.D. (1991). Modelling legal argument: Reasoning with cases and hypotheticals.
MIT Press, Cambridge.
Baldi P. and Brunak S. (1998). Bioinformatics: the Machine Learning Approach. MIT
Press, Cambridge.
Bareiss E.R., Porter B.W. and Wier C.C. (1988). Protos: An exemplar-based learning
apprentice. International Journal of Man-Machine Studies, 29:549-561.
Benton D. (1996). Bioinformatics - principles and potential of a new multidisciplinary tool.
Trends in Biotechnology ,14:261-272.
Bertoni R., Sidney J., Fowler P., Chesnut R.W., Chisari F.V. and Sette A. (1997).
Human histocompatibility leukocyte-binding supermotifs predict broadly cross-reactive
cytotoxic T lymphocyte responses in patients with acute hepatitis. Journal of Clinical
Investigation, 100:503-513.
Bjorkman P.J., Saper M.A., Samraoui B., Bennett W.S. and Strominger J.L. (1987).
Structure of the human class I histocompatibility antigen, HLA-A2. Nature,
Brachman R. and Anand T. (1996). The process of knowledge discovery in databases: a
human centered approach. In Fayyad U., Piatetsky-Shapiro G., Smyth P. and
Uthurusamy R. (eds.) Advances in Knowledge Discovery and Data Mining, pp. 37-58.
AAAI Press, Menlo Park, California.
Bradburn C., Zeleznikow J. and Adams A. (1993). FLORENCE: synthesis of case-based
and model-based reasoning in a nursing care planning system. Journal of Computers in
Nursing, 11(1):20-24.
Branting L.K (1991). Building explanations from rules and structured cases. International
Journal of Man-Machine Studies, 34:797-838.
Braren R., Firner K., Balasubramanian S., Bazan F., Thiele H.G., Haag F. and Koch-
Nolte F. (1997). Use of the EST database resource to identify and clone novel
mono(ADP-ribosyl)transferase gene family members. Advances in Experimental
Medicine and Biology, 419:163-168.
Brazma A. and Jonassen I. (1998). Sequence Pattern Discovery Methods. ISMB-98
tutorial series.
Brazma A., Vilo J., Ukkonen E. and Valtonen K. (1997). Data-mining for regulatory
elements in yeast genomes. ISMB, 5:65-74.
Brenner S.E., Chotia C., Hubbard T.J.P. and Murzyn A. (1996). Understanding Protein
Structure: Using Scop for Fold Interpretation. Methods in Enzymology, 266:635-643.
Brodie M. (1993). The Promise of Distributed Computing and the Challenge of Legacy
Information Systems. In Hsiao D., Neuhold E.J. and Sacks-Davis R. (eds.). Semantics
of Interoperable Database Systems (DS-5), IFIP Transactions A-25, pp 1-32, North
Holland, Amsterdam.
Brown J.H., Jardetzky T.S., Gorga J.C., Stern L.J., Urban R.G., Strominger J.L. and
Wiley D.C. (1993). Three-dimensional structure of the human class II
histocompatibility antigen HLA-DR1. Nature, 364(6432):33-39.
Brusic V. and Harrison LC. (1998). Computational methods in prediction of MHC-binding
peptides. In Michalewicz M. (ed), Advances in Computational Life Sciences: Humans
to Proteins, pp. 213-222, CSIRO Publishing, Melbourne.
Brusic V., Rudy G. and Harrison L.C. (1994a). Prediction of MHC binding peptides
using artificial neural networks. In Stonier R. and Yu X.H. (eds.), Complex Systems:
Mechanism of Adaptation. IOS Press, Amsterdam/Ohmsha, Tokyo, pp. 253-260.
Brusic V., Rudy G. and Harrison L.C. (1994b). MHCPEP - a database of MHC binding
peptides. Nucleic Acids Research, 22:3663-3665.
Brusic V., Rudy G. and Harrison L.C. (1998c). MHCPEP - a database of MHC-binding
peptides: update 1997. Nucleic Acids Research 26:368-371.
Brusic V., Rudy G., Honeyman M.C., Hammer J. and Harrison L.C. (1998a). Prediction
of MHC class-II binding peptides using an evolutionary algorithm and artificial neural
network. Bioinformatics, 14:121-130.
Brusic V., van Endert P., Zeleznikow J., Daniel S., Hammer J. and Petrovsky N.
(1998b). A Neural Network Model Approach to the Study of Human TAP Transporter.
Manuscript in preparation.
Brusic V., Schnbach C., Takiguchi M., Ciesielski V. and Harrison L.C. (1997).
Application of genetic search in derivation of matrix models of peptide binding to MHC
molecules. ISMB, 5:75-83.
Bschking C. and Schleiermacher C. (1998). WWW-Based Sequence Analysis. ISMB-98
tutorial series.
Cattell K., Koop B., Olafson R.S., Fellows M., Bailey I., Olafson R.W. and Upton C.
(1996). Approaches to detection of distantly related proteins by database searching.
BioTechniques, 21(6):1118-1125.
Chen R.O., Feliciano R. and Altman R.B. (1997). RIBOWEB: linking structural
computations to a knowledge base of published experimental data. ISMB, 5:84-87.
Chen I.A., Kosky A., Markowitz V.M. and Szeto E. (1995). OPM*QS: The Object-
Protocol Model Multidatabase Query System. Technical Report LBNL-38181.
Chicz R.M., Urban R.G., Gorga J.C., Vignali D.A., Lane W.S. and Strominger J.L.
(1993) Specificity and promiscuity among naturally processed peptides bound to HLA-
DR alleles. Journal of Experimental Medicine, 178:27-47.
Codd E.F. 1993. Providing OLAP (On-Line Analytical Processing) to User-Analysts: An
IT Mandate. E. F. Codd and Associates.
Coggon D., Rose G. and Barker D.J.P. (1997). Epidemiology for the Uninitiated. Fourth
edition. BMJ Publishing Group, <>
Coppieters J., Senger M., Jungfer K. and Flores T. (1997). Prototyping Internet Services
for Biology based on CORBA. European Bioinformatics Institute, Hinxton, UK.
Cresswell P. (1994). Assembly, transport, and function of MHC class II molecules.
Annual Review of Immunology, 12:259-293.
Crowley E.M., Roeder K. and Bina M. (1997). A statistical model for locating regulatory
regions in genomic DNA. Journal of Molecular Biology, 268(1):8-14.
Davidson S.B., Overton C., Tannen V. and Wong L. (1997). BioKleisli: a digital library
for biomedical researchers. Journal of Digital Libraries 1(1):36-53.
Decker K. M. and Focardi, S. (1995). Technology Overview: A Report on Data-mining.
Technical Report 95-02. Swiss Scientific Computing Centre, CSCS-ETH, CH-6928,
Manno, Switzerland.
DeLisi C. and Berzofsky J.A. (1985). T-cell antigenic sites tend to be amphipathic
structures. Proceedings of the National Academy of Sciences of the United States of
America 82(20):7048-7052.
Dodge C., Schneider R. and Sander C. (1998). The HSSP database of protein structure-
sequence alignments and family profiles. Nucleic Acids Research 26(1):313-315.
Doherty P.C. and Zinkernagel R.M. (1975). A biological role for the major
histocompatibility antigens. Lancet 1(7922):1406-1409.
Durbin R. and Thierry Mieg J. (1991-). A C. elegans Database. Documentation, code and
data available from anonymous FTP servers: <>, <>
and <> .
Eckman B.A., Aaronson J.S., Borkowski J.A., Bailey W.J., Elliston K.O., Williamson
A.R. and Blevins R.A. (1998). Bioinformatics, 14:2-13.
Etzold T., Ulyanov A. and Argos P. (1996). SRS: information retrieval system for
molecular biology data banks. Methods in Enzymology 266:114-28.
Falk K., Rotzschke O., Stevanovic S., Jung G. and Rammensee H.G. (1991). Allele
specific motifs revealed by sequencing of self-peptides eluted from MHC molecules,
Nature, 351:290-296.
Fayyad U., Piatetsky-Shapiro G. and Smyth P. (1996). From data-mining to knowledge
discovery. AI Magazine, 17(3), 37-54.
Germain R.N. (1994). MHC-dependent antigen processing and peptide presentation:
providing ligands for T lymphocyte activation. Cell, 76:287-299.
Firebaugh M.W. (1989). Artificial intelligence. A knowledge-based approach. PWS-Kent,
Hammer J., Bono E., Gallazzi F., Belunis C., Nagy Z. and Sinigaglia F. (1994). Precise
prediction of MHC class II-peptide interaction based on peptide side chain scanning.
Journal of Experimental Medicine, 180:2353-2358.
Hammerstrom D. (1993) Neural networks at work. IEEE Spectrum, 30:26-32.
Hammond K.J. (1989). Case-based planning: Viewing planning as a memory task.
Academic Press, San Diego.
Harrison L.C., Honeyman M.C., Tremblau S., Gregori S., Gallazzi F., Augstein P.,
Brusic V., Hammer J. and Adorini L. (1997). Peptide binding motif for I-A
, the class
II MHC molecule of NOD and Biozzi ABH mice. Journal of Experimental Medicine,
Hiether P. and Boguski M. (1997). Functional Genomics: ItÕs All How You Read It.
Science, 278:601-602.
Honeyman M., Brusic V. and Harrison L.C. (1997). Strategies for identifying and
predicting islet autoantigen T-cell epitopes in Insulin-dependent Diabetes Melitus.
Annals of Medicine, 29:401-404.
Hu J., Mungall C., Nicholson D. and Archibald A.L. (1998). Design and implementation
of a CORBA-based genome mapping system prototype. Bioinformatics, 14(2):112-
Kolodner J. (1993). Case based reasoning. Morgan Kaufmann, Los Altos.
Kosko B. (1993). Fuzzy Thinking. The New Science of Fuzzy Logic. Harper Collins
Publishers, Glasgow.
Kosky A., Szeto E., Chen I.A. and Markowitz V.M. (1996). OPM data management tools
for CORBA compliant environments. Technical Report LBNL-38975. Lawrence
Berkeley National Lab. <>.
Koton P. (1988). Reasoning about evidence in causal explanation. In Proceedings of the
Sixth National Conference on Artificial Intelligence (AAAI-88), pp. 256-261, Los
Altos, Morgan Kaufmann.
Krogh A., Mian I.S. and Haussler D. (1994) A hidden Markov model that finds genes in
E. coli DNA. Nucleic Acids Research, 21:4768-4778.
Lawrence C.E. (1998). Bayesian inference algorithms. ISMB-98 tutorial series.
Leng B., Buchanan B.G. and Nicholas H.B. (1994). Protein secondary structure
prediction using two-level case-based reasoning. Journal of Computational Biology,
Markowitz V.M. (1995). Heterogeneous Molecular Biology Databases. Journal of
Computational Biology, 2(4):537-538.
Migimatsu H. and Fujibuchi W. (1996). Version 2 of DBGET. In How to Use
DBGET/LinkDB. <>
Papazoglou M.P., Laufmann S. and Sellis T.K. (1992). An Organizational Framework for
Cooperating Intelligent Information Systems. International Journal of Intelligent and
Cooperative Information Systems, 1(1):169-203.
Parker K.C., Bednarek M.A. and Coligan J.E. (1994). Scheme for ranking potential HLA-
A2 binding peptides based on independent binding of individual peptide side-chains.
Journal of Immunology, 152:163-175.
Piatetsky-Shapiro G. (1991). Knowledge Discovery in Real Databases: A Report on the
IJCAI-89 Workshop. AI Magazine, 11(5): 68-70.
Quinlan J.R. (1986). Induction of decision trees. Machine Learning, 1:81-106.
Rammensee H.G., Falk K. and Rotzschke O. (1993). Peptides naturally presented by
MHC class I molecules. Annual Review in Immunology, 11:213-244.
Rammensee H.G., Friede T. and Stevanovic S. (1995). MHC ligands and peptide motifs:
first listing. Immunogenetics, 41:178-228.
Rawlings C.J. and Searls D.B. (1997). Computational Gene Discovery and Human
Disease. Current Opinion in Genetics and Development, 7:416-423.
Ritter O., Kocab P., Senger M., Wolf D. and Suhai S. (1994). Prototype Implementation
of the Integrated Genomic Database. Computers and Biomedical Research, 27(2):97-
Salazar-Onfray F., Nakazawa T., Chhajlani V., Petersson M., Karre K., Masucci G.,
Celis E., Sette A., Southwood S., Appella E. and Kiessling R. (1997). Synthetic
peptides derived from the melanocyte-stimulating hormone receptor MC1R can stimulate
HLA-A2-restricted cytotoxic T lymphocytes that recognize naturally processed peptides
on human melanoma cells. Cancer Research, 57(19):4348-55.
Salzberg S. (1995). Locating protein coding regions in human DNA using a decision tree
algorithm. Journal of Computational Biology, 2(3):473-485.
Schuler G.D., Epstein J.A., Ohkawa H. and Kans J.A. (1996). Entrez: molecular biology
database and retrieval system. Methods in Enzymology, 266:141-62, 1996.
Shoudai T., Lappe M., Miyano S., Shinohara A., Okazaki T., Arikawa S., Uchida T.,
Shimozono S., Shinohara T. and Kuhara S. (1995). BONSAI garden: parallel
knowledge discovery system for amino acid sequences. ISMB, 3:359-366.
Sieburg H.B., Baray C. and Kunzelman K.S. (1993). Testing HIV molecular biology in in
silico physiologies. ISMB , 1:354-361.
Silberschatz A. and Tuzhilin A. (1997). What makes patterns interesting in knowledge
discovery systems. IEEE Transactions on Knowledge and Data Engineering , 8(6):970-
Stanfill C. and Waltz D. (1986). Toward memory-based reasoning. Communications of the
ACM , 29(12):1213-1228.
Stranieri A., Zeleznikow J., Gawler M. and Lewis B. (1998). A hybrid-neural approach to
the automation of legal reasoning in the discretionary domain of family law in Australia.
Artificial Intelligence and Law (in press).
Stormo G.D., Schneider T.D., Gold L. and Ehrenfeucht A. (1982). Use of ÔPerceptronÕ
algorithm to distinguish translational initiation in E. coli. Nucleic Acids Research,
Swets J.A. (1988). Measuring the accuracy of diagnostic systems. Science, 240:1285-
Sycara K. (1990). Negotiation planning: An AI approach. European Journal of Operations
Research , 46:216-234.
Tatusov R.L., Koonin E. and Lipman D.J. (1997). A genomic perspective on Protein
Families. Science , 278:631-637.
Travers P.J. Histo. <>
Uberbacher E.C., Xu Y. and Mural R.J. (1996). Discovering and understanding genes in
human DNA sequence using GRAIL. Methods in Enzymology , 266:259-281.
van Elsas A., van der Burg S.H., van der Minne C.E., Borghi M., Mourer J.S., Melief
C.J. and Schrier P.I. (1996). Peptide-pulsed dendritic cells induce tumoricidal cytotoxic
T lymphocytes from healthy donors against stably HLA-A*0201-binding peptides from
the Melan-A/MART-1 self antigen. European Journal of Immunology, 26(8):1683-
Weiss S.M. and Kulikowski C.A. (1991). Computer systems that learn. Morgan Kaufman
Publishers. San Mateo, California.
Yu C.H. (1994). Abduction? Deduction? Induction? Is there a logic of exploratory data
analysis. The Annual Meeting of American Educational Research Association, New
Orleans, Louisiana, April 1994.
< _of_EDA.html>.
Zadeh L.A. (1965). Fuzzy Sets. Information and Control , 8:338-353.
Zeleznikow J. and Hunter D. (1994). Building Intelligent Legal Information Systems:
Knowledge Representation and Reasoning in Law. Kluwer Computer/Law Series 13.
Zeleznikow J., Vossos G. and Hunter, D. (1994). The IKBALS project: Multimodal
reasoning in legal knowledge based systems. Artificial Intelligence and Law, 2(3):169-