kaw - SWI

crazymeasleΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

89 εμφανίσεις

Submitted to: Knowledge Acquisition Workshop 1999

Top
-
down Design and Construction of Knowledge
-
Based Systems
with Manual and Inductive Techniques


Floor Verdenius

Maarten W. van Someren

ATO
-
DLO

Department of Social Science Informatics

PO Box 1
7

University of Amsterdam

6700 AA Wageningen

Roeterstraat 15

The Netherlands

1018 WB Amsterdam

F.Verdenius@ato.dlo.nl

The Netherlands


Maarten@swi.psy.uva.nl


AB
STRACT

In this paper we present the outline of a method for planning the design and construction of
knowledge
-
based systems that combines a divide
-
and
-
conquer approach as commonly used in
knowledge acquisition and software engineering with the use of induc
tive techniques.
Decomposition of a knowledge acquisition problem into sub
-
problems is guided by the expected
costs and benefits of applying elicitation or induction to acquire the knowledge for part of the
target knowledge. The method is illustrated with
a
rational reconstruction

of a knowledge
acquisition process that involved inductive techniques.

Keywords
: Inductive Techniques, Knowledge Acquisition, Learning goals, Decomposition



1.

Introduction

Knowledge acquisition involves the formalisation of (human
) knowledge about a certain
expert task. The aim is to build a system that can execute this task with a performance
that is comparable to the human expert. The knowledge on which a system is based can
be acquired in different ways. The classical way is to
elicit knowledge from a human
expert and formalise this in an operational language, e.g. using an expert system shell.
This is often extended to acquiring knowledge not only from a single human expert but
also from several experts with complementary expert
ise and from relatively unstructured
information such as textbooks and manuals. A different approach that we shall call the
"inductive approach" is based on induction from observations or machine learning.
Expert performance is sampled or observations in a

domain are collected and inductive
methods are used to construct a system that performs the task of the human expert or that
can make predictions about a domain.

Both the knowledge acquisition and the inductive approach have their strengths and
weaknesses
. Consequently, for many problems a pure approach is not optimal. Here we
briefly review the separate approaches and then mixed approaches.

Knowledge Acquisition

Many design problems in knowledge
-
based system construction are characterised by the
availabil
ity of a number of knowledge sources that can be used to construct a system.
Examples of sources are documents, human expertise on related tasks, collections of
observations or even existing knowledge in computational form.

If a domain involves a complex r
elation between problem data and solutions then direct
knowledge elicitation is not effective: it will lead to questions to a domain expert that are
too global and will therefore not result in useful knowledge. For example, if only the
possible problems an
d solutions are known, a knowledge engineer can only ask a very
general question like:

How do you find solutions like S from problem data like P?

For a human expert such general questions are hard to answer for a complex domain. If
possible problems and so
lutions and something about intermediate reasoning steps is
known then more specific questions can be asked.

Research in Knowledge Acquisition has lead to the formulation of predefined methods
and general conceptual models or ontologies. An ontology is an
abstract description of
concepts that play a role in problem solving in a particular domain. If a method or
ontology is found to fit to a particular knowledge acquisition problem then this can act as
a basis for a dialogue between knowledge engineer and ex
pert. These methods and
ontologies serve two important functions. They provide a common language that can be
used to phrase more specific questions to the expert and they make it possible to
decompose the knowledge acquisition problem into sub
-
problems, a
form of divide
-
and
-
conquer.

One difficulty with this approach is the selection of an appropriate method and ontology.
The range of methods and ontologies is likely to be very large. Another problem is that
this approach does not take into account the econo
mic aspect of the process: once a
method is selected it may turn out to be very expensive to acquire the knowledge for each
component of the method. Below we give a method for design and construction of
knowledge
-
based systems that optimises the use of ava
ilable resources.

The Inductive Approach

Machine learning technology gives the prospect of (partially) automating the construction
of knowledge
-
based systems. There are several ways to apply machine learning
techniques to knowledge acquisition. The most di
rect approach is to collect a set of
examples of problems with solutions that are provided or approved by a human expert
and to apply an induction technique to automatically construct a knowledge
-
based
system. This requires no intermediate knowledge. In so
me cases, such data are easy to
obtain but for many applications, obtaining data is expensive. If the task is difficult then
the underlying relation may be rather complex. In that case many data are needed to
acquire adequate knowledge which makes this opt
ion expensive.

Other approaches focus on refining or debugging knowledge that was acquired manually
(e.g. Shapiro, 1982; Ginsberg, 1988; Craw and Sleeman, 1990). Either approach can be
improved by using domain specific prior knowledge, for example in the f
orm of
descriptions of rules that are to be learned.

Like the elicitation
-
based approach, direct application of an inductive approach also
encounters problems if the knowledge acquisition problem is complex. One complication
is that uncovering a complex s
tructure will require many data. Especially if
measurements are noisy or the structure in the domain is probabilistic many data are
needed to obtain reliable results. Because for many applications data are difficult and
expensive to acquire and because exp
loiting available explicit knowledge, a
straightforward inductive approach is often not optimal. Here again we find the need for a
structured and economic approach to the acquisition problem. Decomposing the learning
problem and learning components separat
ely may reduce the number of examples that are
needed (e.g. Shapiro, 1987).

In our method we integrate the elicitation
-
based approach (including the use of
predefined methods and ontologies) with the inductive approach and we show how
optimal use can be ma
de of available sources of knowledge using systematic
decomposition of the learning task.

Combining Knowledge Acquisition and Induction

Several authors have presented approaches and techniques for combining inductive
techniques and knowledge acquisition. M
orik et al. (1993) introduced the principle of
balanced co
-
operation
: divide the tasks between human and computer to optimise the
combination. This was elaborated in the MOBAL system (Morik et al., 1993) that
allowed the user access to all data and general
isations and the included meta
-
rules as
constraints on possible generalisations and supports automated refinement of knowledge
that was explicitly entered by the user (see also Craw and Sleeman, 1990, Ginsburg,
1988; Aben and van Someren, 1990) or other fo
rms of constraints on possible
generalisations. Other approaches a enable the user to visualise aspects of the data to
select an appropriate analysis technique (e.g. Kohavi, Sommerfield and Dougherty,
1997). However, these approaches do not address one of
the key issues in knowledge
acquisition: decomposition of the acquisition problem. They assume acquisition problems
that may be large in terms of the number of variables but that can actually be addressed
as a single problem without decomposition. Many rea
listic problems require a divide
-
and
-
conquer approach and combined use of different acquisition methods. In this paper
we present an approach to such problems that incorporates principles of knowledge
acquisition and of induction and that is based on divid
e
-
and
-
conquer and economy.

This paper is organised as follows: section 2 briefly introduces the MeDIA framework for
designing inductive applications. Section 3 illustrates the application of the framework by
detailing the design process of a fruit treatmen
t panning system. Sections In section 4 we
discuss the implications and applicability of this method, and discuss some directions for
further work.



2.

MeDIA: A Method for Design of Inductive
Applications

Knowledge acquisition and machine learning differ in
another respect which cerates a problem
with combining them. Acquisition based on elicitation starts with a phase in which informal
representations are used and machine learning projects often start with a large set of data. Our
goal is to define a methodo
logy that covers the entire process, from specifying the problem and
identifying available sources to a running knowledge
-
based system. At which point should
decisions about the representation and the format of data be taken? At which point should a tool
f
or induction or elicitation be selected?




Figure 1:

Overview of the MeDIA model

In line with earlier work (Verdenius and Engels, 1997) we view the development of inductive
systems to proceed in three hierarchical leve
ls according to the Method for Designing Inductive
Applications (the MeDIA model, see Figure 1); within these levels, a total of six activities is
located (see Table 1):

1.

Application Level

a.

Requirements Definition
: deriving, from the problem owner and/or pot
ential
user, the requirements for the application to deliver

b.

Source Identification
: available data
-

and knowledge resources that are relevant
for the current domain are identified

c.

Acquisition planning
: a resource directed decomposition process, to be detai
led
in this paper

2.

Analysis Level

a.

Data analysis
: deriving an explicit description of data characteristics that are
required to select and tune inductive techniques

b.

Technique selection
: selecting the appropriate technique for implementing each
sub
-
task

3.

Techn
ique Level

a.

Technique
-
implementation
: setting parameters of the selected technique, and (if
required) delivering a tuned model

The levels are as much as possible performed sequentially. Iteration of design steps however still
may occur at two locations:



be
tween
acquisition planning

and
data analysis
, and



between
technique selection

and
technique implementation
.

A knowledge base of
resources

is available for
acquisition planning
. In this knowledge base, the
input and output of resources is defined. Moreove
r, knowledge is available whether the
knowledge for a component of the target system obtained using this resource can be acquired by
means of inductive techniques.

When implementing a technique, the assumptions used in technique selection may proof
inaccu
rate: the selected technique proofs inadequate for performing the task, in spite of earlier
indications. In that case, a new technique selection has to be made.

For performing the
acquisition planning

a
divide
-
and
-
conquer

approach is proposed, configured
f
or optimal use of available sources of knowledge. In both software engineering (e.g.
Sommerville, 1995) and knowledge engineering (e.g. Schreiber, 1993) divide
-
and
-
conquer
approaches are standard. Many authors have defined languages that support top
-
down
d
evelopment (e.g. Schreiber, 1993; Terpstra, van Heijst, Wielinga and Shadbolt, 1993). However
most methods are not very specific about how to reduce a complex problem into simpler sub
-
problems, and if presented, decomposition choices are mainly motivated b
y computational
efficiency or design modularity.





Table 1
MeDIA activities and their input and output.

MeDIA
activity

Input

Output

Knowledge Base

Requirements
Definition

-



Functional
Requirements



Non
-
functional
Requirements

-

Source
Identi
fication

-



Domain Ontology



Knowledge Sources



Data (source)
Definition

-

Acquisition
planning



Functional
Requirements



Non
-
functional
Requirements



Task
Decomposition

Primitive Task
KB



Domain Ontology



Knowledge Sources



Data (source)
Definition

Data an
alysis
and
acquisition



Data (source)
Definition



Domain Ontology



Task
Decomposition



Data Sets (to be
collected)



Data Source
Description

-

Technique
selection



Data Sets



Data Source
Description



Domain Ontology



Task
Decomposition



Non
-
functional
require
ments



Model
Representation



Search Method



Estimation
Criterion

Task Technique
KB

Technique
-
implementatio
n



Model
Representation



Search Method



Estimation
Criterion



Domain Ontology



Non
-
functional
requirements



Training Algorithm



Performance
Algorithm



M
odel

Technique tuning
KB



In this paper, we describe a method for decomposing knowledge acquisition problems into sub
-
problems that can be solved better than the original problem. Here, the design choices for
decomposing a performance task are strongly m
otivated by the possibility to acquire an adequate
component against reasonable costs. The main differences with standard approaches for the
development of knowledge systems are that:

1. a wide range of types of knowledge sources and acquisition methods (e
.g. induction,
elicitation, decomposition) are considered and evaluated as possible source of knowledge
for a specific component of the target system. The options are evaluated on the basis of
cost
-

and accuracy
-
estimates

2. also the decomposition process

is directed by the structure of the available knowledge
and economy

In data mining decomposition of a problem is normally based on the requirements for
data
mining
like data selection, data cleaning and data conversion (see for example Engels, 1997;
Fayya
d et al., 1996). Here we base decomposition on a deconstruction of the required
functionality in functionalities of reduced size and complexity to minimise the acquisition costs.

Compared to knowledge acquisition methods such as CommonKADS (Schreiber et al
., 1999) our
approach defines several extensions. First, it considers data as a potential source for acquiring
knowledge. In CommonKADS, human experts are considered the main source of knowledge,
and most tools and techniques focus on extracting knowledge
from human experts. Second, our
approach explicitly takes the costs and benefits of acquisition into account. Finally, we explicitly
consider induction as acquisition method when decomposing tasks. Decomposition of a task into
subtasks can take place if fo
r the subtasks data
-
types are available in the domain ontology.
Moreover, at least one primitive task needs to have the same input
-

or output
-
structure as the
subtask to decompose. When this is not possible, or not preferred by the user, manual
decompositi
on has to be performed by the user.

Compared to other approaches for designing inductive applications (e.g. Brodley and Smyth,
1997), our model explicitly separates the design at a conceptual level from the actual
implementation. At the application level,

implementation issues are not assessed.

Requirement Definition and Source Identification

The first steps are to define the (functional and non
-
functional) system requirement, to describe
the domain ontology and to identify and collect resources that can
supply knowledge for the
target system. With functional requirements, we refer to the input
-
output mapping of the problem
to be solved. This is typically a (formal or verbal) definition of available and demanded data
items and their semantics. Non
-
function
al requirements refer to all aspects of the solution that
are not of relevance for the mapping, but that still have an effect on the acceptance and
satisfaction of the problem owner about the offered solution. Examples of non
-
functional
requirements are th
e hardware and/or software platform, the response time or the preferred
layout of the user interface. The domain ontology is the set of relevant concepts that can be used
to express problems and solutions. In CommonKADS (Schreiber et al. 1999), the notion
of
domain ontology is further defined.

The knowledge resources are collected in the next step. This is done largely on the basis of the
domain ontology and the explicit information in the requirement definition.

Acquisition planning

The next step is to pl
an the use of the available resources for the actual knowledge acquisition.
The result of this step is a plan that specifies which resources are to be used to acquire the
knowledge for the components of a target system. The structure in which these compone
nts are
connected to form the final system is also produced as output.

In general, a knowledge acquisition problem can be solved in three ways:



by direct elicitation of the knowledge from a source, e.g. a human expert or a document,



by induction from obs
ervations, or



by further decomposition into sub
-
problems.

Decomposition continues until the sub
-
problems can be mapped onto acquirable knowledge
resources: a set of resources is found, connectable in a data
-
flow structure, from which
knowledge can be acqu
ired to perform the task of the target system. When a task is directly
mapped onto a source that is not a decomposition source it is referred to as a
primitive task
. Our
notions of task and primitive task are adjacent to the notions of task and inference a
s used in
knowledge acquisition (e.g. CommonKADS, Schreiber et al.1999).

Each of the options
direct elicitation
,
induction from observations

and
decomposition
, involves
further choices. The main criterion for these choices is
acquisition economy
: the balan
ce between
the expected costs of implementing the option and the expected accuracy of the resulting
knowledge (O'Hara and Shadbolt, 1996). In case of
direct elicitation

in the case of manual
acquisition methods, the accuracy depends on the quality of the s
ources (e.g. human experts) and
of the communication process. In case of
induction from observation

the accuracy of the result
will depend on the availability of reliable data, the complexity of the actual relations and on
knowledge about the type of relat
ion that is to be induced. If many reliable data are available, if
the underlying relation is not very complex and if the type of function is known then induction is
likely to be successful. Otherwise there is a risk of constructing knowledge that is incor
rect.

The idea of
decomposition into sub
-
problems

is based directly on the top
-
down approach of
stepwise refinement (e.g. Wirth, 1971, 1976). The main difference is that in software engineering
the main principle that guides decomposition is minimising the

complexity of the resulting
systems and thereby supporting activities like debugging, maintaining and re
-
using the system.
In knowledge acquisition the acquisition of the knowledge is usually the main factor that
determines the costs and benefits and ther
efore this guides the decomposition. Decomposition is
useful if (cheap and accurate) sources of knowledge are available for sub
-
tasks of the overall
knowledge acquisition task but not for the overall task. For example, there may be abundant data
for one su
b
-
problem and a communicative expert for another sub
-
problem but not for the
problem as a whole. This is a reason to split the knowledge acquisition problem into sub
-
problems that are then acquired separately. Another situation where a decomposition can be

cheaper and give more accurate results than a single step inductive approach is when there is no
prior knowledge to bias induction on the overall problem.

Acquisition Economy

To decide if decomposition is a good idea, we compare the expected costs and ben
efits of
acquisition with and without decomposition. The benefit of an acquisition operation (elicitation
based or induction based) is the accuracy of the resulting knowledge. The costs of the acquisition
process depend on the acquisition process. In case
of elicitation, this involves time of the expert
and of the knowledge engineer, equipment, etc. In case of an inductive approach this involves the
costs of collecting and cleaning data and of applying an induction system to the result. If we
decompose the
acquisition problem, the costs and benefits are simply the sum of those of the
sub
-
problems. So we get for elicitation/induction:



EG(elicitation/induction) = w1 * EA(elicitation/induction
-

EC(elicitation/induction)



and for decomposition:



EG(compound

operation) = w2 * min(EG(operation
i
)
-



EG(operation
i
)

with:



EG = expected gain

EC = expected costs

EA = expected accuracy



Here the weight parameters w1 and w2 indicate the importance of accuracy relative to
real

costs
of acquisition; if the accuracy

is translated as annual benefits, w1 and w2 are related to
return on
investment
. The expected accuracy of a compound acquisition is derived from the minimal
accuracy of its components, which is a pessimistic estimate. As we argued above, in some cases
eli
citation is almost impossible because the expert cannot answer very global questions. This
means that the
costs

are high and the accuracy of the knowledge is 0. In machine learning
applications the costs of actually running a system are usually rather smal
l compared to other
costs, such as designing the target system, collecting data and tuning the induction tool, so this
could be left out.

The Decomposition Process

A decomposition is constructed by



inserting a source description that is connected to one o
r more types of data in the current
goal



adding or deleting a connection in the data
-
flow structure



inserting a method (a sub
-
procedure) for a component in the data
-
flow structure

The method for decomposing a knowledge acquisition problem is based on the i
dea that the
reasons for decomposing a knowledge acquisition problem that we gave above are applied in the
order given above. The method is a form of
best
-
first search

that uses expected costs and
benefits to evaluate candidate decompositions.

In case of f
urther decomposition, the method is applied recursively to the sub
-
problems. The
algorithm is depicted as algorithm 1. If costs and accuracies cannot be estimated the alternative is
to perform a pilot study to assess the costs and expected accuracy. In the

context of elicitation
this amounts to performing elicitation on part of the task and evaluating the result. In the context
of induction it amounts to comparative studies by cross validation. Main goal of such studies is
to select the best techniques in t
erms of the above expressed balance between costs and accuracy.

Data Analysis



In some cases, the resource can simply be included in the target system but in most cases the
knowledge must be "extracted'' from the resource, using an acquisition technique.

The
acquisition plan specifies the resource to be used but not the acquisition technique. Data analysis
applies to situations in which acquisition means induction from data. The purpose of this step is
to measure properties of the data that are relevant f
or selecting a technique. Selecting and
applying a technique will be the final step. Currently there is no comprehensive and practically
applicable method for this. As observed by Verdenius and van Someren (1997) many application
projects that use inductiv
e techniques do reason about selection of a technique. Often, designers
only consider one single induction technique. If necessary, the problem is transformed to make it
suitable for the chosen technique.

Several studies report experiments about the relat
ion between properties of the data and the
performance of learning systems. The ESPRIT project
Machine Learning Toolbox

(e.g.
Kodratoff et al, 1994) has gathered heuristics for technique selection for classification. The
heuristics focus on several aspects

of the learning problem. Based on descriptions various aspects
of the learning process the user is provided with a number of alternative techniques that can be
applied to the learning task at hand. Relevant aspects include, beside aspects of the data that

we
mentioned above, the nature of the learning task, uncertainty, the availability of background
knowledge, user interaction. The heuristics are implemented in an automated tool for user
support (Craw et al., 1994). The
STATLOG

project (Michie et al., 199
4) provides experimental
comparison of more then twenty different classification techniques on some thirty different data
-
sets. The analysis of the results indicates strong and weak points of the different techniques.
Moreover, additional analysis on STATL
OG results (Brazdil et al., 1994) generalizes over the
results in an attempt to formulate comprehensive heuristics.

In general, analysis consists of selecting a form for the hypothesis, transforming the data into a
suitable format so that an appropriate l
earning method can be applied. Langley (1996) gives an
extensive description of various forms of hypotheses and corresponding learning methods but
less is known about which properties of a dataset indicate which hypothesis and which learning
method are opt
imal. Currently this problem is handled by experimentally trying out methods and
evaluating them by cross validation. We expect that better understanding of properties of the
dataset that discriminate between different classes of hypotheses will enable mor
e rational
selection of the form of the hypothesis.









































3.

Example: The Product Treatment Support System



We illustrate the method with the example introduced above, on planning systematic treatments
for fruit ripening (Ve
rdenius, 1996). This problem involved both knowledge acquisition and
machine learning. The project was not run with this method in mind but our description can be
viewed as a post hoc
design rationale
. The initial acquisition goal is:

construct a knowledge

system that takes as input information on a batch of fruits that
arrives from abroad and that produces a recipe for storing the fruits


Figure 2:

Description of the input and output that defines the lear
ning problem





Requirement Definition and Source Identification

Figure 2 shows the overall learning problem. The outcome of the task, that is, the
solution

to this
planning problem
, is a
treatment recipe
. A recipe is a prescription of the values c
i,j

for

a set of
relevant
treatment conditions

c
i

that applies to a specific time interval. The time interval is
subdivided in fixed
-
duration time
-
slices j. Storage conditions include attributes like temperature,
relative humidity and ethylene concentration; rele
vance of conditions is determined by the
product type.


Figure 3:

Part of the ontology

The first step is to identify available sources of knowledge for this task. Figure 3 illustrates part
of the domain ontology as devel
oped for this application. This should be read as a schema that
can be instantiated with specific knowledge. The following data about a batch of fruit (grouped)
are available:



batch data, such as origin, product cultivar etc.



commercial data, mainly the re
quired
due date

of the product treatment



product data
, being a number values for attributes such as colour, shape, firmness, weight
etc, describing per individual product in a batch various quality aspects at the start of the
recipe. It is assumed here tha
t a fixed final quality is delivered for all recipes.

Table 2 lists some of the sources that are available in the fruits storage planning domain. These
sources cover the application of machine learning, knowledge elicitation from experts and
extraction fr
om documents. The sources include information that is not part of the original
problem. For example, the source
Sample Products

refers to
Batch (of Fruit)
,
Administrative
data

and
Sampling Instructions
, and it delivers
Quality data
. The latter two are not
mentioned in
the original problem statement.

In this stage, only the resources are identified but no effort is made to extract the actual
knowledge. At this level the sources are not bound to any of the acquisition means. Available
resources may not be us
ed and actual acquisition of the knowledge is postponed until a complete
plan is available. The actual knowledge is to be obtained by applying an acquisition technique to
the resource: a human expert, a document, a set of data or an existing system. At thi
s point, no
choice for a technique is made either because this will depend on details of the resource that are
not relevant for this stage of the design process.

Table 2

Some sources of knowledge for the example

Source ID

Determined
by

Conclusio
n
s

Cost

Acc

Constraints and
<comments>

Estimate
Quality 1

[p

i, x
]
n

E(p
batch, x
)

1.0

0.95

n


[㘰Ⱐ㈰そ

䕳瑩浡瑥
兵Q汩ty′



i, x
]
n

E(p
batch, x
)

0.2

0.92

n


[㔬‱そⰠ瀠
i, x


[


p batch I





]

Se汥l琠灲潤畣t

[瀠
I, x
]
n
,
m
selection

[p
I, x
]
m

0.4

0.98

n


[㘰Ⱐ㈰そⰠ渠


[㔬‱5]Ⱐ





1…m: p
l,j



[瀠
batch, j





]

S灥c楦y
Rec楰i

䔨p

batch,
l
),
due dat,
origin,
Cultivar,
m
specification

E(rc
batch, I
)

0.6

0.8



Design
Recipe

rc
batch, I
, due
date,
planstrategy,
m
design

c
I, j

0.4

0.9

I = 1… recipe duration
=
m潳瑵oa瑥t
oec楰攠i
=
*
=
c

i, j

1.0

0.1

<induce from all available
data>

Postulate
Recipe
3

E(p
batch, x
),
origin,
Cultivar,
History

c

i, j

0.9

0.15

<induce from selected and
pre
-
processed data>

Postulate
Requirements

[p
I, x
]
n
,
history,
origin,
cultivar, due
date

E(rc
batch, i)
)

0.8

0.6



Postulate
Recipe 1

due date

c
i, j

0.0

0.2

<standard r
ecipes>

Adapt recipe

c

I, j

m
adapt
, E(p
batch, I)

c
i, j

0.2

0.4

<apply heuristic in adapt
model>

Acquisition is guided by economic principles and therefore an estimate must be made of the
costs and the expected
value

(or
accuracy
) of the knowledge that
can be acquired from a
resource. Accuracy ought to be estimated independent of the technique that will be used.

Some sources may have no costs if the knowledge already exists. Moreover, note that for a
(sub)problem and (sub)solution combination there may
be more than one way to acquire the
knowledge. For example, it may be possible to directly acquire knowledge that relates
External
Conditions

and
Planning Destination

to a
Detailed Recipe
.

The costs of using these resources and the expected accuracies wer
e estimated using rules of
thumb. For example,
Specify Recipe

involves finding a detailed recipe specification from product
data, batch data and the required due date of the batch. The cost is estimated from the availability
of resources and the complexity

of the task. The size of the space defined by the properties in
Specify Recipe

gives an indication of the number of data that must be acquired to obtain certain
accuracy in case of an inductive approach. This in turn gives an estimate of the costs. The
re
lation is likely to be complex and this suggests that many cases are needed. Costs and accuracy
of an elicitation approach are estimated from the time that it takes to acquire the expertise for a
task. If this is unknown then a rough estimate is made based

on the complexity, as for the
inductive approach. The accuracy is estimated from a pilot experiment.













Acquisition Planning

Figure 4 shows the final decomposition and Table 3 gives an overview of the sources and
techniques that were actually use
d to acquire knowledge for the various components. Below we
reconstruct the process that lead to this decomposition and choice of acquisition methods.


Figure 4
: Decomposition of the knowledge acquisition problem

The est
imated costs of acquiring the complete system by elicitation are very high (because there
is no expert) and the same is true for induction. Without further analysis, there are about 15
-
40
input variables and between 12 and 52 output variables (depending on

the duration of the
storage). The relation is therefore likely to be very complex and it would take many data to find
an initial model if it is possible at all. We estimate costs and accuracies of single step acquisition
(elicitation or induction). The es
timates are in Table 4. The last column gives the expected gain
using a value 3 for weight values w
i
, as a reasonable value for the ROI (Return On Investment).

We now consider decomposition. The available sources and the causal and temporal structures
defi
ne a number of possible decompositions of the initial knowledge acquisition problem. There
are many possibilities and here we describe some possibilities with estimates of the expected
accuracy and acquisition costs.

Decomposition 1

From the initial proble
m of Figure 3, the first decomposition step is to abstract from the quality
data on individual products to the quality of the batch. This requires a number of measurements.
Taking the average of a number of measurements requires a large product sample (see

Verdenius
1996). In Figure 5, the resulting decomposition is depicted. The expected gain of this
decomposition is: 3 * 0.35
-

1.6 =
-
0.55.


Figure 5:

Decomposition 1



Decomposition 2

The next step in the decomposition
aims at overcoming the weakest point in decomposition 1.
Postulate Recipe 3 has a poor cost/accuracy ratio. It can be replaced by a two step approach,
where recipe specification is followed by a recipe design. The resulting decomposition is shown
in Figure

6. The expected gain of this option is: 3 * 0.8
-

2 = 0.4. Already a non
-
negative
outcome, but still worse then the original problem formulation (over the ROI).


Figure 6: Decomposition 2

Final Decomposition

The final
decomposition again is advocated by first identifying the weakest point in the best
-
so
-
far, and identifying a task combination with a better pay
-
off. Here, it appears that estimating the
product quality can be optimised by first drawing a small sample from

the total data set, and
using these data to estimate the quality. Due to sample reduction the benefit increases. The
resulting decomposition was shown in Figure 2. The expected gain of this is: 3 * 0.8
-

1.6 = 0.8

Acquiring Components

We now can concentra
te on the actual acquisition of the knowledge for the components. The first
component,
Select Products

implements a sampling procedure. For each product, a number of
easy
-
to
-
assess data items are available. Based on these items, the product is classified a
s being
either
near_batch_mean

or
far
_
from
_
batch
_
mean
. This is a classification task. Historic data on
the relation between product descriptors and batch mean is available. On the other hand, for
humans, looking at this relation is fairly uncommon. Consequ
ently, elicitation of knowledge
from human experts is not an option (low accuracy vs. high costs). Data analysis may learn that
the underlying type of function is relatively simple, although not fully orthogonal on the data
axis. Interpretability may be a
(non
-
functional) requirement, as the resulting knowledge has to be
applied by human experts in order to select fruits. In the actual planner, it has been implemented
by means of a decision rule learner. The rules are extracted, and handed over to a human e
xpert
to perform the actual selection on location.

The next component is the actual assessment of the batch quality. This is simply averaging of the
measurements. The main differences between the two available
Estimate Quality
sources can be
found in the
number of (expensive) measurements that is required in the case of unselected and
selected estimation. The former requires between 60 and 200 expensive measurements to be
taken. In the latter case, only 5
-
10 are needed. This does not dramatically effectuat
e the
accuracy, but dramatically reduces the costs.

For acquisition of recipe specification, the two options of elicitation or induction must be
evaluated. Human experts are not used to specify recipes on batch level, i.e. expertise is not
available. Hist
oric data is available for induction of the required knowledge. On the input side 21
attributes are taken as input. The size of the output space is limited (in the actual fruit planning
system, only 1 parameter was output; a maximum of 4 output values can
be imagined). Based on
a comparison between linear and non
-
linear models, a preference was developed for non
-
linear
models. These have been implemented in the form of a neural network.



4.

Discussion

We presented a rational reconstruction of decisions to us
e machine learning in a knowledge
acquisition context. Applications of machine learning to knowledge acquisition involve more
than selecting and applying an appropriate induction tool. In general, knowledge or data are not
or only partially available and d
ecisions must be taken on how to acquire them. Knowledge
acquisition problems are often better solved using a
divide
-
and
-
conquer

approach that reduces
the overall problem to sub
-
problems that can be solved by machine learning or direct elicitation.
This pr
ocess of
divide
-
and
-
conquer

is guided by estimations of costs of the acquisition process
and of the expected accuracy of the result.

In this section, we discuss the relation between the approach advocated here, and some of the
approaches that are of use fo
r knowledge acquisition or induction. Finally, we discuss options for
further work.

Comparison with other methods

Knowledge acquisition methods

Many existing knowledge acquisition methods rely heavily on the idea of decomposition (e.g.
Terpstra, 1993, Sch
reiber, 1993a, Marcus, 1988). However, these methods focus on
modelling

languages and do rarely make the underlying principles explicit that are needed for a rational
application of the methods. These methods also do not cover the use of inductive techniqu
es.
Here we reconstruct the rationale behind these methods and use this to extend them towards the
use of machine learning methods. We presented criteria and a method for decomposing
knowledge acquisition problems into simpler sub
-
problems and illustrated
this with a
reconstruction of a real world application. This method can be applied both to inductive methods,
knowledge elicitation or other manual acquisition methods.

In modern approaches for knowledge acquisition, especially in CommonKADS, the starting
point for divide
-
and
-
conquer approaches is identified from libraries of standard models. For
example, suppose that the acquisition problem is to construct a system that can design storage
recipes for fruits. The knowledge engineer may decide adopt a model
from a library of methods
(Breuker and VandeVelde, 1994). First, the problem is specified as before:

Input
: Fruits Characteristics, Current Quality, Required Quality, Recipe Duration

Output
: Storage Recipe, i.e. condition set
-
points for a series of time
-
slices

The KADS library offers the following models:

Name

Input

Output

Design

Needs and desires

Design
solution

Configuration

Components, required
structure, constraints,
requirements

Configuratio
n

Planning

Initial state, goal state,
world d
escription,
plan description, plan
model

Plan

Assignment,
scheduling

Components,
resources

Assignment

It is not obvious which of these is appropriate here.
Recipe Duration

can be viewed as needs and
desires, constraints, requirements and plan descriptio
ns.
Fruits Characteristics

and
Product
Quality

do not have an immediate counterpart in the terminology above. The
Storage Recipe

corresponds most closely to an assignment, although it can also be viewed as a plan, a design
solution or a configuration. Alth
ough
assignment and scheduling

sounds like a good choice, the
models for this type of task concern allocation of resources to tasks in a schedule. This does not
correspond to our task.
Planning

is a better term. The inputs of the most general model for
pla
nning

(Valente, 1995) are:
initial state
,
goal state
,
world description
,
plan description

and
plan model
. A plan is an ordered set of actions that starts in the initial state and ends with a state
that satisfies the requirements of a goal state. The world
knowledge describes general
information about the world in which the actions will take place.

In our example,
Fruits Characteristics

and
Product Quality

can be viewed as "initial state''.
However, the storage recipe does not involve discrete states and th
erefore a planning process is
problematic. Even when the process is somehow discretised then there are very many
possibilities and the goal provides little guidance for the evaluation of intermediate states.
Another problem is, that if we compare this to t
he available resources in table 2, we see that the
resulting model is not
coherent
. The
Fruits Characteristics

and
Current Quality

are not the
description of the
initial state

parameter of the planning operators. The approach outlined in
Breuker and VandeV
elde does not tell us what to do now. An obvious step is to apply the whole
approach recursively to the task of finding the input of the planning operators from
Fruits
Characteristics

and
Current Quality
. We shall not pursue this here. But it is noted that

the
planning model cannot actually be applied because of the continuous character of the operators
and the process, which is not mentioned in the description of the model as a prerequisite.
Moreover, the analysis process is about the same as that of our a
pproach. This is because the
data
-
flow structure of the available knowledge is of much more importance at this stage than the
structure of the data and the knowledge. Our approach postpones the choice between discrete
models and continuous models until lat
er and only then selects a modelling technique.

Inductive Methods

Compared with inductive engineering methods our methodology has a broader scope than most
methodologies. MeDIA includes the identification of resources, takes into account economic
factors a
nd structuring of the acquisition problem. Machine learning technology plays a specific
role in the overall method. A straightforward inductive approach to this problem would probably
have been more expensive and less successful. The reason is the complexi
ty of the relation
between the "raw'' data about a batch of fruits and its destination and the recipe and in the costs
of collecting data.



Further Work

The main "hole" in the methodology is selection of a model for the hypothesis and related data
transfo
rmation. We intend to review the literature on this question and summarise the state of the
art. After this, we intend to do more empirical evaluations of the methodology.



5.

Conclusions

The MeDIA approach is based on separation of planning and implementati
on of the knowledge
acquisition process and on a "divide and conquer" approach to the planning problem. This
approach is possible if enough information about sources of knowledge is available. This
information can often be obtained by heuristics and cheap
measurements on the data. In
knowledge acquisition, these are part of the "experience" of knowledge engineers. In machine
learning and in statistical data analysis, rules of thumb and experience are used to estimate the
expected accuracy of the result of a
pplying an induction system. For example, for many
statistical techniques, rules of thumb relate the number of variables, the complexity of the
function to be induced and the number of data to an estimate of accuracy. The main alternative,
if there is no p
rior knowledge, is currently a "reactive" approach. The expected accuracy of
applying an operator can be determined empirically by trying it out. For inductive techniques,
this is done by cross validation, resulting in an estimate of the accuracy. In knowl
edge elicitation
simply asking an expert to provide the knowledge does this. If this fails it is concluded that
decomposition is necessary. See Brodley (1995) for a method following this approach. Graner
and Sleeman (1993) follow a similar approach in the
context of knowledge acquisition. Their
model does not include search through possible decompositions or the use of estimated costs and
accuracies.

The method outlined here can be extended to include the expected gain of having the resulting
system. This
would give a more comprehensive model including both the costs of acquisition
and the costs of having and using the acquired knowledge. See van Someren et al. (1997) for a
model of induction methods that include costs of measurements and costs of errors, i
n the
context of learning decision trees. These two models can be integrated into a single model, see
for example DesJardins (1995) for a similar model for robot exploration.

The MeDIA method involves decomposition before formalisation and data analysis (
except
when data analysis detects the need for different types of hypotheses and thus leads to
decomposition). Some heuristics for estimation of expected accuracy are stated in terms of
statistical properties of the data (see the STATLOG results). This sug
gests that data collection
and data analysis should be integrated more tightly with decomposition. However, we expect that
this is in general not correct. Accuracy can be estimated relatively well without using properties
of the data.





References



Aben
, M. and M. W. van Someren (1990) Heuristic Refinement of Logic Programs, in:L.C.
Aiello (ed):
Proceedings ECAI
--
90
, London:Pitman, 7
-
12.

P. Brazdil, J. Gama and B. Henery (1994), Characterising the applicability of Classification
Algorithms Using Meta
-
Lev
el Learning, in: F. Bergadano and L. de Raedt (eds.),
Proceedings of
ECML
-
94
, Springer Verlag, Berlin, pp. 84
-
102

J. Breuker and W. van de Velde (1994),
CommonKADS Library for Expertise Modelling,

IOS
PRess, Amsterdam

Brodley, C. (1995) Recursive bias sele
ction for classifier construction.
Machine Learning,

20,
pp. 63
-
94.

C.E. Brodley and P. Smyth (1997), Applying Classification Algorithms in Practice,
Statistics and
Computing

7, pp. 45
-
56

Craw, S., and Sleeman, D. (1990) Automating the refinement of knowle
dge
-
based systems. In:
Aiello, L. C., ed.,
Proceedings ECAI
--
90
, pp. 167
--
172. London: Pitman.

DesJardins, M. (1995) Goal
-
directed learning: a decision
-
theoretic model for deciding what to
learn next. In: D. Leake and A. Ram (eds)
Goal
-
Driven Learning
, MIT

Press.

R. Engels (1996), Planning Tasks for Knowledge Discovery in Databases; Performing Task
-
Oriented User Guidance, in: Proceedings of the 2nd Int. Conf. on KDD

Engels, R., Lindner, G., and Studer, R. (1997) A guided tout through the data mining jungle.

In:
Proceedings of the 3rd International Conference on Knowledge Discovery in Databases

(KDD
-
97).

U.M. Fayyad, G. Piatesky
-
Shapiro and P. Smyth (1996), From Data Mining to Knowledge
Discovery: An Overview, in: U.M. Fayyad et al. (eds.),
Advances in Knowle
dge Discovery and
Data Mining
, pp. 1
-
37

Ginsberg, A. (1988).
Refinement of Expert System Knowledge Bases: A Metalinguistic
FrameWork for Heuristic Analysis.
Pitman.

Graner, N. (1993). The Muskrat system. In:
Proceedings second workshop on multistrategy
lea
rning
, George Mason University.

Kodratoff, Y., et al. Will Machine Learning solve my problem,
Applied Artificial Intelligence
,

Kohavi, R, D. Sommerfield, and J. Dougherty (1997) Data Mining using MLC++, a Machine
Learning Library,

in C++.
International J
ournal on Artificial Intelligence Tools
, vol. 6.

P. Langley and H.A. Simon (1994), Applications of Machine Learning and Rule Induction, in:
Communications of the ACM.

Langley, P. (1997).
Elements of Machine Learning
. Morgan Kaufmann.

Marcus, S., ed. (1988)
.
Automatic knowledge acquisition for expert systems.

Boston: Kluwer.

J. McDermott (1988), Preliminary Steps Toward a Taxonomy of Problem Solving Methods, in:
S. Marcus (ed),
Automating Knowledge Acquisition for Expert Systems
; Kluwer Academic
Publishers,
Dordrecht (NL)

Michie, D., Spiegelhalter, D. J. and Taylor, C. C. (Eds.) 1994. Machine Learning, Neural and
Statistical Classification. Ellis Horwood. London
.

T.M. Mitchell (1997),
Machine Learning
, McGraw
-
Hill, New York

Morik, K., Wrobel, S., Kietz, J.
-
U
. and Emde, W. (1993)
Knowledge acquisition and machine
learning
, London:Academic Press.

O'Hara, K., and Shadbolt, N. 1996. The thin end of the wedge: Efficiency and the generalised
directive model methodology. In Shadbolt, N.; O'Hara, K.; and Schreiber, G
., eds.,
Advances in
Knowledge Acquisition
. Springer Verlag. 33
--
47.

Polderdijk, J.; Verdenius, F.; Janssen, L.; van Leusen, R.; den Uijl, A.; and de Naeyer, M.
(1996). Quality measurement during the post
-
harvest distribution chain of tropical products. In
:
Proceedings of the Congress Global Commercialization of Tropical Fruits
, volume 2. 185
--
195.

J.R. Quinlan (1993),
C4.5: Programs for Machine Learning
, Morgan Kauffman, San Mateo
(CA)

A. Rudstrom (1995), Applications of Machine Learning, Report 95
-
018, Un
iversity of
Stockholm

Schreiber, A.T.; Wielinga, B.J.; and Breuker, J.A., eds. 1993.
KADS: A Principled Approach to
Knowledge
-
Based System Development
, London: Academic Press.

Shapiro, E.Y. (1982).
Algorithmic Program Debugging.

ACM Distinguished Dissertat
ions
series. Cambridge, Massachussetts: MIT Press.

Shapiro, A. (1987)
Structured induction in expert systems
, Addison Wesley.

M.W. van Someren, C. Torres and F. Verdenius (1997), A Systematic Description of Greedy
Optimization Algorithms for Cost Sensitive

Generalisation, in: X. Liu and P. Cohen,
Proceedings of IDA
-
97
, Springer Verlag, Berlin (Ge), pp. 247
-
258.

I. Sommerville (1995),
Software Engineering
, Addison and Wesley, UK

L. Steels (1990), Components of Expertise,
AI Magazine

11:2, pp. 29
-
49

Terpstra,

P.; van Heijst, G.; Wielinga, B.; and Shadbolt, N. (1993). Knowledge acquisition
support through generalised directive models. In David, J.
-
M.; Krivine, J.
-
P.; and Simmons, R.,
eds.,
Second Generation Expert Systems
. Berlin Heidelberg, Germany: Springer
-
V
erlag. 428
-
455.

J.L. Top (1993),
Conceptual Modelling of Physical Systems
, PhD thesis, Enschede (NL)

Valente, A. (1995) Planning, in: J. Breuker and W. van de Velde (1994),
CommonKADS Library
for Expertise Modelling,

IOS PRess, Amsterdam

F. Verdenius (1996
), Managing Product Inherent Variance During Treatment, Computers and
Electronics in Agriculture 15, pp. 245
-
265

F. Verdenius (1997), Developing an Embedded Neural Network Application: The making of the
PTSS, in: B. Kappen and S. Gielen, Neural Networks, B
est Practice in Europe, World Scientific,
Singapore, pp. 193
-
197

F. Verdenius and M.W. van Someren (1997), Applications of Inductive Techniques: a Survey in
the Netherlands, in:
AI Communications
,10, pp. 3
-
20

F. Verdenius, A.J.M. Timmermans and R.E. Schout
en (1997), Process Models for Neural
Network Application in Agriculture, in: AI Applications in Natural Resources,
Agriculture and
Environmental Sciences,

11 (3)

F. Verdenius and R. Engels (1997), A Process Model for Developing Inductive Applications,
Proc
eedings of Benelearn
-
97
, Tilburg University (NL), pp. 119
-
12

S.M. Weiss and C.A. Kulikowski (1991), Computer Systems that Learn, Morgan Kauffman, Palo
Alto

S.M. Weiss and N. Indurkhya (1998), Predictive Data Mining, Morgan Kauffman, San Francisco
(CA)

N. W
irth (1971), Program Development by stepwise refinement, Comm ACM, 14 (4), 221
-
227

N. Wirth (1976), Systematic Programming, An introduction, Englewood Cliffs, NJ: Prentice Hall