kaw - SWI

fantasicgilamonsterData Management

Nov 20, 2013 (3 years and 6 months ago)

93 views

Submitted to: Knowledge Acquisition Workshop 1999

Top
-
down Design and Construction of Knowledge
-
Based Systems
with Manual and Inductive Techniques



Floor Verdenius

Maarten W. van Someren

ATO
-
DLO

Department of Social Science Informatics

PO Box
17

University of Amsterdam

6700 AA Wageningen

Roeterstraat 15

The Netherlands

1018 WB Amsterdam

F.Verdenius@ato.dlo.nl

The Netherlands



Maarten@swi.psy.uva.nl

A
BSTRACT

In this paper we present the outline of a method for planning the design and construction of
knowledge
-
based systems that combines a divide
-
and
-
conquer approach as commonly used in
knowledge acquisition and software engineering with the use of indu
ctive techniques.
Decomposition of a knowledge acquisition problem into sub
-
problems is guided by the expected
costs and benefits of applying elicitation or induction to acquire the knowledge for part of the
target knowledge. The method is illustrated with

a
rational reconstruction

of a knowledge
acquisition process that involved inductive techniques.

Keywords
: Inductive Techniques, Knowledge Acquisition, Learning goals, Decomposition



1.

Introduction

Knowledge acquisition involves the formalisation of (huma
n) knowledge about a certain
expert task. The aim is to build a system that can execute this task with a performance
that is comparable to the human expert. The knowledge on which a system is based can
be acquired in different ways. The classical way is to

elicit knowledge from a human
expert and formalise this in an operational language, e.g. using an expert system shell.
This is often extended to acquiring knowledge not only from a single human expert but
also from several experts with complementary exper
tise and from relatively unstructured
information such as textbooks and manuals. A different approach that we shall call the
"inductive approach" is based on induction from observations or machine learning.
Expert performance is sampled or observations in
a domain are collected and inductive
methods are used to construct a system that performs the task of the human expert or that
can make predictions about a domain.

Both the knowledge acquisition and the inductive approach have their strengths and
weaknesse
s. Consequently, for many problems a pure approach is not optimal. Here we
briefly review the separate approaches and then mixed approaches.

Knowledge Acquisition

Many design problems in knowledge
-
based system construction are characterised by the
availabi
lity of a number of knowledge sources that can be used to construct a system.
Examples of sources are documents, human expertise on related tasks, collections of
observations or even existing knowledge in computational form.

If a domain involves a complex
relation between problem data and solutions then direct
knowledge elicitation is not effective: it will lead to questions to a domain expert that are
too global and will therefore not result in useful knowledge. For example, if only the
possible problems a
nd solutions are known, a knowledge engineer can only ask a very
general question like:

How do you find solutions like S from problem data like P?

For a human expert such general questions are hard to answer for a complex domain. If
possible problems and s
olutions and something about intermediate reasoning steps is
known then more specific questions can be asked.

Research in Knowledge Acquisition has lead to the formulation of predefined methods
and general conceptual models or ontologies. An ontology is an

abstract description of
concepts that play a role in problem solving in a particular domain. If a method or
ontology is found to fit to a particular knowledge acquisition problem then this can act as
a basis for a dialogue between knowledge engineer and e
xpert. These methods and
ontologies serve two important functions. They provide a common language that can be
used to phrase more specific questions to the expert and they make it possible to
decompose the knowledge acquisition problem into sub
-
problems, a

form of divide
-
and
-
conquer.

One difficulty with this approach is the selection of an appropriate method and ontology.
The range of methods and ontologies is likely to be very large. Another problem is that
this approach does not take into account the econ
omic aspect of the process: once a
method is selected it may turn out to be very expensive to acquire the knowledge for each
component of the method. Below we give a method for design and construction of
knowledge
-
based systems that optimises the use of av
ailable resources.

The Inductive Approach

Machine learning technology gives the prospect of (partially) automating the construction
of knowledge
-
based systems. There are several ways to apply machine learning
techniques to knowledge acquisition. The most d
irect approach is to collect a set of
examples of problems with solutions that are provided or approved by a human expert
and to apply an induction technique to automatically construct a knowledge
-
based
system. This requires no intermediate knowledge. In s
ome cases, such data are easy to
obtain but for many applications, obtaining data is expensive. If the task is difficult then
the underlying relation may be rather complex. In that case many data are needed to
acquire adequate knowledge which makes this op
tion expensive.

Other approaches focus on refining or debugging knowledge that was acquired manually
(e.g. Shapiro, 1982; Ginsberg, 1988; Craw and Sleeman, 1990). Either approach can be
improved by using domain specific prior knowledge, for example in the
form of
descriptions of rules that are to be learned.

Like the elicitation
-
based approach, direct application of an inductive approach also
encounters problems if the knowledge acquisition problem is complex. One complication
is that uncovering a complex
structure will require many data. Especially if
measurements are noisy or the structure in the domain is probabilistic many data are
needed to obtain reliable results. Because for many applications data are difficult and
expensive to acquire and because ex
ploiting available explicit knowledge, a
straightforward inductive approach is often not optimal. Here again we find the need for a
structured and economic approach to the acquisition problem. Decomposing the learning
problem and learning components separa
tely may reduce the number of examples that are
needed (e.g. Shapiro, 1987).

In our method we integrate the elicitation
-
based approach (including the use of
predefined methods and ontologies) with the inductive approach and we show how
optimal use can be m
ade of available sources of knowledge using systematic
decomposition of the learning task.

Combining Knowledge Acquisition and Induction

Several authors have presented approaches and techniques for combining inductive
techniques and knowledge acquisition.
Morik et al. (1993) introduced the principle of
balanced co
-
operation
: divide the tasks between human and computer to optimise the
combination. This was elaborated in the MOBAL system (Morik et al., 1993) that
allowed the user access to all data and genera
lisations and the included meta
-
rules as
constraints on possible generalisations and supports automated refinement of knowledge
that was explicitly entered by the user (see also Craw and Sleeman, 1990, Ginsburg,
1988; Aben and van Someren, 1990) or other f
orms of constraints on possible
generalisations. Other approaches a enable the user to visualise aspects of the data to
select an appropriate analysis technique (e.g. Kohavi, Sommerfield and Dougherty,
1997). However, these approaches do not address one of

the key issues in knowledge
acquisition: decomposition of the acquisition problem. They assume acquisition problems
that may be large in terms of the number of variables but that can actually be addressed
as a single problem without decomposition. Many re
alistic problems require a divide
-
and
-
conquer approach and combined use of different acquisition methods. In this paper
we present an approach to such problems that incorporates principles of knowledge
acquisition and of induction and that is based on divi
de
-
and
-
conquer and economy.

This paper is organised as follows: section 2 briefly introduces the MeDIA framework for
designing inductive applications. Section 3 illustrates the application of the framework by
detailing the design process of a fruit treatme
nt panning system. Sections In section 4 we
discuss the implications and applicability of this method, and discuss some directions for
further work.



2.

MeDIA: A Method for Design of Inductive
Applications

Knowledge acquisition and machine learning differ in

another respect which cerates a problem
with combining them. Acquisition based on elicitation starts with a phase in which informal
representations are used and machine learning projects often start with a large set of data. Our
goal is to define a method
ology that covers the entire process, from specifying the problem and
identifying available sources to a running knowledge
-
based system. At which point should
decisions about the representation and the format of data be taken? At which point should a tool
for induction or elicitation be selected?


In line with earlier work (Verdenius and Engels, 1997) we view the development of inductive
systems to proceed in three hierarchical levels according to the Method for Designin
g Inductive
Applications (the MeDIA model, see Figure 1); within these levels, a total of six activities is
located (see Table 1):

1.

Application Level

a.

Requirements Definition
: deriving, from the problem owner and/or potential
user, the requirements for the a
pplication to deliver

b.

Source Identification
: available data
-

and knowledge resources that are relevant
for the current domain are identified

c.

Acquisition planning
: a resource directed decomposition process, to be detailed
in this paper

2.

Analysis Level

a.

Data a
nalysis
: deriving an explicit description of data characteristics that are
required to select and tune inductive techniques

b.

Technique selection
: selecting the appropriate technique for implementing each
sub
-
task

3.

Technique Level

a.

Technique
-
implementation
: se
tting parameters of the selected technique, and (if
required) delivering a tuned model

The levels are as much as possible performed sequentially. Iteration of design steps however still
may occur at two locations:



between
acquisition planning

and
data ana
lysis
, and



between
technique selection

and
technique implementation
.

A knowledge base of
resources

is available for
acquisition planning
. In this knowledge base, the
input and output of resources is defined. Moreover, knowledge is available whether the
k
nowledge for a component of the target system obtained using this resource can be acquired by
means of inductive techniques.

When implementing a technique, the assumptions used in technique selection may proof
inaccurate: the selected technique proofs ina
dequate for performing the task, in spite of earlier
indications. In that case, a new technique selection has to be made.

For performing the
acquisition planning

a
divide
-
and
-
conquer

approach is proposed, configured
for optimal use of available sources of
knowledge. In both software engineering (e.g.
Sommerville, 1995) and knowledge engineering (e.g. Schreiber, 1993) divide
-
and
-
conquer
approaches are standard. Many authors have defined languages that support top
-
down
development (e.g. Schreiber, 1993; Terps
tra, van Heijst, Wielinga and Shadbolt, 1993). However
most methods are not very specific about how to reduce a complex problem into simpler sub
-
problems, and if presented, decomposition choices are mainly motivated by computational
efficiency or design mo
dularity.





Table 1
MeDIA activities and their input and output.

MeDIA
activity

Input

Output

Knowledge Base

Requirements
Definition

-



Functional
Requirements



Non
-
functional
Requirements

-

Source
Identification

-



Domain Ontology



Knowledge S
ources



Data (source)
Definition

-

Acquisition
planning



Functional
Requirements



Non
-
functional
Requirements



Domain Ontology



Knowledge Sources



Data (source)


Task
Decomposition

Primitive Task
KB

Definition

Data analysis
and
acquisition



Data (source)
De
finition



Domain Ontology



Task
Decomposition



Data Sets (to be
collected)



Data Source
Description

-

Technique
selection



Data Sets



Data Source
Description



Domain Ontology



Task
Decomposition



Non
-
functional
requirements



Model
Representation



Search Meth
od



Estimation
Criterion

Task Technique
KB

Technique
-
implementatio
n



Model
Representation



Search Method



Estimation
Criterion



Domain Ontology



Non
-
functional
requirements



Training Algorithm



Performance
Algorithm



Model

Technique tuning
KB



In this pap
er, we describe a method for decomposing knowledge acquisition problems into sub
-
problems that can be solved better than the original problem. Here, the design choices for
decomposing a performance task are strongly motivated by the possibility to acquire
an adequate
component against reasonable costs. The main differences with standard approaches for the
development of knowledge systems are that:

1. a wide range of types of knowledge sources and acquisition methods (e.g. induction,
elicitation, decompositi
on) are considered and evaluated as possible source of knowledge
for a specific component of the target system. The options are evaluated on the basis of
cost
-

and accuracy
-
estimates

2. also the decomposition process is directed by the structure of the av
ailable knowledge
and economy

In data mining decomposition of a problem is normally based on the requirements for
data
mining
like data selection, data cleaning and data conversion (see for example Engels, 1997;
Fayyad et al., 1996). Here we base decomposi
tion on a deconstruction of the required
functionality in functionalities of reduced size and complexity to minimise the acquisition costs.

Compared to knowledge acquisition methods such as CommonKADS (Schreiber et al., 1999) our
approach defines several e
xtensions. First, it considers data as a potential source for acquiring
knowledge. In CommonKADS, human experts are considered the main source of knowledge,
and most tools and techniques focus on extracting knowledge from human experts. Second, our
approac
h explicitly takes the costs and benefits of acquisition into account. Finally, we explicitly
consider induction as acquisition method when decomposing tasks. Decomposition of a task into
subtasks can take place if for the subtasks data
-
types are available

in the domain ontology.
Moreover, at least one primitive task needs to have the same input
-

or output
-
structure as the
subtask to decompose. When this is not possible, or not preferred by the user, manual
decomposition has to be performed by the user.

Co
mpared to other approaches for designing inductive applications (e.g. Brodley and Smyth,
1997), our model explicitly separates the design at a conceptual level from the actual
implementation. At the application level, implementation issues are not assessed
.

Requirement Definition and Source Identification

The first steps are to define the (functional and non
-
functional) system requirement, to describe
the domain ontology and to identify and collect resources that can supply knowledge for the
target system.

With functional requirements, we refer to the input
-
output mapping of the problem
to be solved. This is typically a (formal or verbal) definition of available and demanded data
items and their semantics. Non
-
functional requirements refer to all aspects of

the solution that
are not of relevance for the mapping, but that still have an effect on the acceptance and
satisfaction of the problem owner about the offered solution. Examples of non
-
functional
requirements are the hardware and/or software platform, th
e response time or the preferred
layout of the user interface. The domain ontology is the set of relevant concepts that can be used
to express problems and solutions. In CommonKADS (Schreiber et al. 1999), the notion of
domain ontology is further defined.

The knowledge resources are collected in the next step. This is done largely on the basis of the
domain ontology and the explicit information in the requirement definition.

Acquisition planning

The next step is to plan the use of the available resources f
or the actual knowledge acquisition.
The result of this step is a plan that specifies which resources are to be used to acquire the
knowledge for the components of a target system. The structure in which these components are
connected to form the final sys
tem is also produced as output.

In general, a knowledge acquisition problem can be solved in three ways:



by direct elicitation of the knowledge from a source, e.g. a human expert or a document,



by induction from observations, or



by further decomposition

into sub
-
problems.

Decomposition continues until the sub
-
problems can be mapped onto acquirable knowledge
resources: a set of resources is found, connectable in a data
-
flow structure, from which
knowledge can be acquired to perform the task of the target
system. When a task is directly
mapped onto a source that is not a decomposition source it is referred to as a
primitive task
. Our
notions of task and primitive task are adjacent to the notions of task and inference as used in
knowledge acquisition (e.g. C
ommonKADS, Schreiber et al.1999).

Each of the options
direct elicitation
,
induction from observations

and
decomposition
, involves
further choices. The main criterion for these choices is
acquisition economy
: the balance between
the expected costs of implem
enting the option and the expected accuracy of the resulting
knowledge (O'Hara and Shadbolt, 1996). In case of
direct elicitation

in the case of manual
acquisition methods, the accuracy depends on the quality of the sources (e.g. human experts) and
of the
communication process. In case of
induction from observation

the accuracy of the result
will depend on the availability of reliable data, the complexity of the actual relations and on
knowledge about the type of relation that is to be induced. If many reli
able data are available, if
the underlying relation is not very complex and if the type of function is known then induction is
likely to be successful. Otherwise there is a risk of constructing knowledge that is incorrect.

The idea of
decomposition into su
b
-
problems

is based directly on the top
-
down approach of
stepwise refinement (e.g. Wirth, 1971, 1976). The main difference is that in software engineering
the main principle that guides decomposition is minimising the complexity of the resulting
systems an
d thereby supporting activities like debugging, maintaining and re
-
using the system.
In knowledge acquisition the acquisition of the knowledge is usually the main factor that
determines the costs and benefits and therefore this guides the decomposition. De
composition is
useful if (cheap and accurate) sources of knowledge are available for sub
-
tasks of the overall
knowledge acquisition task but not for the overall task. For example, there may be abundant data
for one sub
-
problem and a communicative expert fo
r another sub
-
problem but not for the
problem as a whole. This is a reason to split the knowledge acquisition problem into sub
-
problems that are then acquired separately. Another situation where a decomposition can be
cheaper and give more accurate results

than a single step inductive approach is when there is no
prior knowledge to bias induction on the overall problem.

Acquisition Economy

To decide if decomposition is a good idea, we compare the expected costs and benefits of
acquisition with and without d
ecomposition. The benefit of an acquisition operation (elicitation
based or induction based) is the accuracy of the resulting knowledge. The costs of the acquisition
process depend on the acquisition process. In case of elicitation, this involves time of t
he expert
and of the knowledge engineer, equipment, etc. In case of an inductive approach this involves the
costs of collecting and cleaning data and of applying an induction system to the result. If we
decompose the acquisition problem, the costs and bene
fits are simply the sum of those of the
sub
-
problems. So we get for elicitation/induction:



EG(elicitation/induction) = w1 * EA(elicitation/induction
-

EC(elicitation/induction)



and for decomposition:



EG(compound operation) = w2 * min(EG(operation
i
)
-



EG(operation
i
)

with:



EG = expected gain

EC = expected costs

EA = expected accuracy



Here the weight parameters w1 and w2 indicate the importance of accuracy relative to
real

costs
of acquisition; if the accuracy is translated as annual benefits, w1 a
nd w2 are related to
return on
investment
. The expected accuracy of a compound acquisition is derived from the minimal
accuracy of its components, which is a pessimistic estimate. As we argued above, in some cases
elicitation is almost impossible because t
he expert cannot answer very global questions. This
means that the
costs

are high and the accuracy of the knowledge is 0. In machine learning
applications the costs of actually running a system are usually rather small compared to other
costs, such as desi
gning the target system, collecting data and tuning the induction tool, so this
could be left out.

The Decomposition Process

A decomposition is constructed by



inserting a source description that is connected to one or more types of data in the current
goa
l



adding or deleting a connection in the data
-
flow structure



inserting a method (a sub
-
procedure) for a component in the data
-
flow structure

The method for decomposing a knowledge acquisition problem is based on the idea that the
reasons for decomposing a
knowledge acquisition problem that we gave above are applied in the
order given above. The method is a form of
best
-
first search

that uses expected costs and
benefits to evaluate candidate decompositions.

In case of further decomposition, the method is app
lied recursively to the sub
-
problems. The
algorithm is depicted as algorithm 1. If costs and accuracies cannot be estimated the alternative is
to perform a pilot study to assess the costs and expected accuracy. In the context of elicitation
this amounts to

performing elicitation on part of the task and evaluating the result. In the context
of induction it amounts to comparative studies by cross validation. Main goal of such studies is
to select the best techniques in terms of the above expressed balance bet
ween costs and accuracy.

Data Analysis



In some cases, the resource can simply be included in the target system but in most cases the
knowledge must be "extracted'' from the resource, using an acquisition technique. The
acquisition plan specifies the res
ource to be used but not the acquisition technique. Data analysis
applies to situations in which acquisition means induction from data. The purpose of this step is
to measure properties of the data that are relevant for selecting a technique. Selecting and

applying a technique will be the final step. Currently there is no comprehensive and practically
applicable method for this. As observed by Verdenius and van Someren (1997) many application
projects that use inductive techniques do reason about selection
of a technique. Often, designers
only consider one single induction technique. If necessary, the problem is transformed to make it
suitable for the chosen technique.

Several studies report experiments about the relation between properties of the data and
the
performance of learning systems. The ESPRIT project
Machine Learning Toolbox

(e.g.
Kodratoff et al, 1994) has gathered heuristics for technique selection for classification. The
heuristics focus on several aspects of the learning problem. Based on desc
riptions various aspects
of the learning process the user is provided with a number of alternative techniques that can be
applied to the learning task at hand. Relevant aspects include, beside aspects of the data that we
mentioned above, the nature of the
learning task, uncertainty, the availability of background
knowledge, user interaction. The heuristics are implemented in an automated tool for user
support (Craw et al., 1994). The
STATLOG

project (Michie et al., 1994) provides experimental
comparison of
more then twenty different classification techniques on some thirty different data
-
sets. The analysis of the results indicates strong and weak points of the different techniques.
Moreover, additional analysis on STATLOG results (Brazdil et al., 1994) gener
alizes over the
results in an attempt to formulate comprehensive heuristics.

In general, analysis consists of selecting a form for the hypothesis, transforming the data into a
suitable format so that an appropriate learning method can be applied. Langley
(1996) gives an
extensive description of various forms of hypotheses and corresponding learning methods but
less is known about which properties of a dataset indicate which hypothesis and which learning
method are optimal. Currently this problem is handled

by experimentally trying out methods and
evaluating them by cross validation. We expect that better understanding of properties of the
dataset that discriminate between different classes of hypotheses will enable more rational
selection of the form of the

hypothesis.









































3.

Example: The Product Treatment Support System



We illustrate the method with the example introduced above, on planning systematic treatments
for fruit ripening (Verdenius, 1996). This problem involved b
oth knowledge acquisition and
machine learning. The project was not run with this method in mind but our description can be
viewed as a post hoc
design rationale
. The initial acquisition goal is:

construct a knowledge system that takes as input information

on a batch of fruits that
arrives from abroad and that produces a recipe for storing the fruits


Figure 3:

Description of the input and output that defines the learning problem





Requirement Definition and Source Iden
tification

Figure 3 shows the overall learning problem. The outcome of the task, that is, the
solution

to this
planning problem
, is a
treatment recipe
. A recipe is a prescription of the values c
i,j

for a set of
relevant
treatment conditions

c
i

that applies

to a specific time interval. The time interval is
subdivided in fixed
-
duration time
-
slices j. Storage conditions include attributes like temperature,
relative humidity and ethylene concentration; relevance of conditions is determined by the
product type.


Figure 4:

Part of the ontology

The first step is to identify available sources of knowledge for this task. Figure 4 illustrates part
of the domain ontology as developed for this application. This should be read as a sch
ema that
can be instantiated with specific knowledge. The following data about a batch of fruit (grouped)
are available:



batch data, such as origin, product cultivar etc.



commercial data, mainly the required
due date

of the product treatment



product data
,
being a number values for attributes such as colour, shape, firmness, weight
etc, describing per individual product in a batch various quality aspects at the start of the
recipe. It is assumed here that a fixed final quality is delivered for all recipes.

Table 2 lists some of the sources that are available in the fruits storage planning domain. These
sources cover the application of machine learning, knowledge elicitation from experts and
extraction from documents. The sources include information that is n
ot part of the original
problem. For example, the source
Sample Products

refers to
Batch (of Fruit)
,
Administrative
data

and
Sampling Instructions
, and it delivers
Quality data
. The latter two are not mentioned in
the original problem statement.

In this s
tage, only the resources are identified but no effort is made to extract the actual
knowledge. At this level the sources are not bound to any of the acquisition means. Available
resources may not be used and actual acquisition of the knowledge is postponed

until a complete
plan is available. The actual knowledge is to be obtained by applying an acquisition technique to
the resource: a human expert, a document, a set of data or an existing system. At this point, no
choice for a technique is made either becau
se this will depend on details of the resource that are
not relevant for this stage of the design process.

Table 2

Some sources of knowledge for the example

Source ID

Determined
by

Conclusion
s

Cost

Acc

Constraints and
<comments>

Estimate
Qualit
y 1

[p

i, x
]
n

E(p
batch, x
)

1.0

0.95

n


[㘰Ⱐ㈰そ

䕳瑩浡瑥
兵Q汩ty′



i, x
]
n

E(p
batch, x
)

0.2

0.92

n


[㔬‱そⰠ瀠
i, x


[


p batch I





]

Se汥l琠灲潤畣t

[瀠
I, x
]
n
,
m
selection

[p
I, x
]
m

0.4

0.98

n


[㘰Ⱐ㈰そⰠ渠


[㔬‱5]Ⱐ





1…m: p
l,j



[瀠
batch, j





]

S灥c楦y
Rec楰i

䔨p

batch,
l
),
due dat,
origin,
Cultivar,
m
specification

E(rc
batch, I
)

0.6

0.8



Design
Recipe

rc
batch, I
, due
date,
planstrategy,
m
design

c
I, j

0.4

0.9

I = 1… recipe duration
=
m潳瑵oa瑥t
oec楰攠i
=
*
=
c

i, j

1.0

0.1

<induce from all available
data>

Postulate
Recipe
3

E(p
batch, x
),
origin,
Cultivar,
History

c

i, j

0.9

0.15

<induce from selected and
pre
-
processed data>

Postulate
Requirements

[p
I, x
]
n
,
history,
origin,
cultivar, due
date

E(rc
batch, i)
)

0.8

0.6



Postulate
Recipe 1

due date

c
i, j

0.0

0.2

<standard r
ecipes>

Adapt recipe

c

I, j

m
adapt
, E(p
batch, I)

c
i, j

0.2

0.4

<apply heuristic in adapt
model>

Acquisition is guided by economic principles and therefore an estimate must be made of the
costs and the expected
value

(or
accuracy
) of the knowledge that
can be acquired from a
resource. Accuracy ought to be estimated independent of the technique that will be used.

Some sources may have no costs if the knowledge already exists. Moreover, note that for a
(sub)problem and (sub)solution combination there may
be more than one way to acquire the
knowledge. For example, it may be possible to directly acquire knowledge that relates
External
Conditions

and
Planning Destination

to a
Detailed Recipe
.

The costs of using these resources and the expected accuracies wer
e estimated using rules of
thumb. For example,
Specify Recipe

involves finding a detailed recipe specification from product
data, batch data and the required due date of the batch. The cost is estimated from the availability
of resources and the complexity

of the task. The size of the space defined by the properties in
Specify Recipe

gives an indication of the number of data that must be acquired to obtain certain
accuracy in case of an inductive approach. This in turn gives an estimate of the costs. The
re
lation is likely to be complex and this suggests that many cases are needed. Costs and accuracy
of an elicitation approach are estimated from the time that it takes to acquire the expertise for a
task. If this is unknown then a rough estimate is made based

on the complexity, as for the
inductive approach. The accuracy is estimated from a pilot experiment.













Acquisition Planning

Figure 5 shows the final decomposition and Table 3 gives an overview of the sources and
techniques that were actually use
d to acquire knowledge for the various components. Below we
reconstruct the process that lead to this decomposition and choice of acquisition methods.


Figure 5
: Decomposition of the knowledge acquisition problem

The est
imated costs of acquiring the complete system by elicitation are very high (because there
is no expert) and the same is true for induction. Without further analysis, there are about 15
-
40
input variables and between 12 and 52 output variables (depending on

the duration of the
storage). The relation is therefore likely to be very complex and it would take many data to find
an initial model if it is possible at all. We estimate costs and accuracies of single step acquisition
(elicitation or induction). The es
timates are in Table 4. The last column gives the expected gain
using a value 3 for weight values w
i
, as a reasonable value for the ROI (Return On Investment).

We now consider decomposition. The available sources and the causal and temporal structures
defi
ne a number of possible decompositions of the initial knowledge acquisition problem. There
are many possibilities and here we describe some possibilities with estimates of the expected
accuracy and acquisition costs.

Decomposition 1

From the initial proble
m of Figure 3, the first decomposition step is to abstract from the quality
data on individual products to the quality of the batch. This requires a number of measurements.
Taking the average of a number of measurements requires a large product sample (see

Verdenius
1996). In Figure 6, the resulting decomposition is depicted. The expected gain of this
decomposition is: 3 * 0.35
-

1.6 =
-
0.55.


Figure 6:

Decomposition 1



Decomposition 2

The next step in the decomposition
aims at overcoming the weakest point in decomposition 1.
Postulate Recipe 3 has a poor cost/accuracy ratio. It can be replaced by a two step approach,
where recipe specification is followed by a recipe design. The resulting decomposition is shown
in Figure

7. The expected gain of this option is: 3 * 0.8
-

2 = 0.4. Already a non
-
negative
outcome, but still worse then the original problem formulation (over the ROI).


Figure 7: Decomposition 2

Final Decomposition

The final
decomposition again is advocated by first identifying the weakest point in the best
-
so
-
far, and identifying a task combination with a better pay
-
off. Here, it appears that estimating the
product quality can be optimised by first drawing a small sample from

the total data set, and
using these data to estimate the quality. Due to sample reduction the benefit increases. The
resulting decomposition was shown in Figure 2. The expected gain of this is: 3 * 0.8
-

1.6 = 0.8

Acquiring Components

We now can concentra
te on the actual acquisition of the knowledge for the components. The first
component,
Select Products

implements a sampling procedure. For each product, a number of
easy
-
to
-
assess data items are available. Based on these items, the product is classified a
s being
either
near_batch_mean

or
far
_
from
_
batch
_
mean
. This is a classification task. Historic data on
the relation between product descriptors and batch mean is available. On the other hand, for
humans, looking at this relation is fairly uncommon. Consequ
ently, elicitation of knowledge
from human experts is not an option (low accuracy vs. high costs). Data analysis may learn that
the underlying type of function is relatively simple, although not fully orthogonal on the data
axis. Interpretability may be a
(non
-
functional) requirement, as the resulting knowledge has to be
applied by human experts in order to select fruits. In the actual planner, it has been implemented
by means of a decision rule learner. The rules are extracted, and handed over to a human e
xpert
to perform the actual selection on location.

The next component is the actual assessment of the batch quality. This is simply averaging of the
measurements. The main differences between the two available
Estimate Quality
sources can be
found in the
number of (expensive) measurements that is required in the case of unselected and
selected estimation. The former requires between 60 and 200 expensive measurements to be
taken. In the latter case, only 5
-
10 are needed. This does not dramatically effectuat
e the
accuracy, but dramatically reduces the costs.

For acquisition of recipe specification, the two options of elicitation or induction must be
evaluated. Human experts are not used to specify recipes on batch level, i.e. expertise is not
available. Hist
oric data is available for induction of the required knowledge. On the input side 21
attributes are taken as input. The size of the output space is limited (in the actual fruit planning
system, only 1 parameter was output; a maximum of 4 output values can
be imagined). Based on
a comparison between linear and non
-
linear models, a preference was developed for non
-
linear
models. These have been implemented in the form of a neural network.



4.

Discussion

We presented a rational reconstruction of decisions to us
e machine learning in a
knowledge acquisition context. Applications of machine learning to knowledge
acquisition involve more than selecting and applying an appropriate induction tool. In
general, knowledge or data are not or only partially available and d
ecisions must be taken
on how to acquire them. Knowledge acquisition problems are often better solved using a
divide
-
and
-
conquer

approach that reduces the overall problem to sub
-
problems that can
be solved by machine learning or direct elicitation. This pr
ocess of
divide
-
and
-
conquer

is
guided by estimations of costs of the acquisition process and of the expected accuracy of
the result.

In this section, we discuss the relation between the approach advocated here, and some of
the approaches that are of use fo
r knowledge acquisition or induction. Finally, we discuss
options for further work.

Comparison with other methods

Knowledge acquisition methods

Many existing knowledge acquisition methods rely heavily on the idea of decomposition
(e.g. Terpstra, 1993, Sch
reiber, 1993a, Marcus, 1988). However, these methods focus on
modelling

languages and do rarely make the underlying principles explicit that are needed
for a rational application of the methods. These methods also do not cover the use of
inductive techniqu
es. Here we reconstruct the rationale behind these methods and use this
to extend them towards the use of machine learning methods. We presented criteria and a
method for decomposing knowledge acquisition problems into simpler sub
-
problems and
illustrated
this with a reconstruction of a real world application. This method can be
applied both to inductive methods, knowledge elicitation or other manual acquisition
methods.

In modern approaches for knowledge acquisition, especially in CommonKADS, the
starting
point for divide
-
and
-
conquer approaches is identified from libraries of standard
models. For example, suppose that the acquisition problem is to construct a system that
can design storage recipes for fruits. The knowledge engineer may decide adopt a model
from a library of methods (Breuker and VandeVelde, 1994). First, the problem is
specified as before:

Input
: Fruits Characteristics, Current Quality, Required Quality, Recipe Duration

Output
: Storage Recipe, i.e. condition set
-
points for a series of time
-
slices

The KADS library offers the following models:

Name

Input

Output

Design

Needs and desires

Design
solution

Configuration

Components, required
structure, constraints,
requirements

Configuratio
n

Planning

Initial state, goal state,
world d
escription,
plan description, plan
model

Plan

Assignment,
scheduling

Components,
resources

Assignment

It is not obvious which of these is appropriate here.
Recipe Duration

can be viewed as
needs and desires, constraints, requirements and plan descriptio
ns.
Fruits Characteristics

and
Product Quality

do not have an immediate counterpart in the terminology above. The
Storage Recipe

corresponds most closely to an assignment, although it can also be
viewed as a plan, a design solution or a configuration. Alth
ough
assignment and
scheduling

sounds like a good choice, the models for this type of task concern allocation
of resources to tasks in a schedule. This does not correspond to our task.
Planning

is a
better term. The inputs of the most general model for
pla
nning

(Valente, 1995) are:
initial
state
,
goal state
,
world description
,
plan description

and
plan model
. A plan is an ordered
set of actions that starts in the initial state and ends with a state that satisfies the
requirements of a goal state. The world
knowledge describes general information about
the world in which the actions will take place.

In our example,
Fruits Characteristics

and
Product Quality

can be viewed as "initial
state''. However, the storage recipe does not involve discrete states and th
erefore a
planning process is problematic. Even when the process is somehow discretised then
there are very many possibilities and the goal provides little guidance for the evaluation
of intermediate states. Another problem is, that if we compare this to t
he available
resources in table 2, we see that the resulting model is not
coherent
. The
Fruits
Characteristics

and
Current Quality

are not the description of the
initial state

parameter
of the planning operators. The approach outlined in Breuker and VandeV
elde does not
tell us what to do now. An obvious step is to apply the whole approach recursively to the
task of finding the input of the planning operators from
Fruits Characteristics

and
Current Quality
. We shall not pursue this here. But it is noted that

the planning model
cannot actually be applied because of the continuous character of the operators and the
process, which is not mentioned in the description of the model as a prerequisite.
Moreover, the analysis process is about the same as that of our a
pproach. This is because
the data
-
flow structure of the available knowledge is of much more importance at this
stage than the structure of the data and the knowledge. Our approach postpones the
choice between discrete models and continuous models until lat
er and only then selects a
modelling technique.

Inductive Methods

Compared with inductive engineering methods our methodology has a broader scope than
most methodologies. MeDIA includes the identification of resources, takes into account
economic factors a
nd structuring of the acquisition problem. Machine learning
technology plays a specific role in the overall method. A straightforward inductive
approach to this problem would probably have been more expensive and less successful.
The reason is the complexi
ty of the relation between the "raw'' data about a batch of fruits
and its destination and the recipe and in the costs of collecting data.



Further Work

The main "hole" in the methodology is selection of a model for the hypothesis and related
data transfo
rmation. We intend to review the literature on this question and summarise
the state of the art. After this, we intend to do more empirical evaluations of the
methodology.



5.

Conclusions

The MeDIA approach is based on separation of planning and implementati
on of the knowledge
acquisition process and on a "divide and conquer" approach to the planning problem. This
approach is possible if enough information about sources of knowledge is available. This
information can often be obtained by heuristics and cheap
measurements on the data. In
knowledge acquisition, these are part of the "experience" of knowledge engineers. In machine
learning and in statistical data analysis, rules of thumb and experience are used to estimate the
expected accuracy of the result of a
pplying an induction system. For example, for many
statistical techniques, rules of thumb relate the number of variables, the complexity of the
function to be induced and the number of data to an estimate of accuracy. The main alternative,
if there is no p
rior knowledge, is currently a "reactive" approach. The expected accuracy of
applying an operator can be determined empirically by trying it out. For inductive techniques,
this is done by cross validation, resulting in an estimate of the accuracy. In knowl
edge elicitation
simply asking an expert to provide the knowledge does this. If this fails it is concluded that
decomposition is necessary. See Brodley (1995) for a method following this approach. Graner
and Sleeman (1993) follow a similar approach in the
context of knowledge acquisition. Their
model does not include search through possible decompositions or the use of estimated costs and
accuracies.

The method outlined here can be extended to include the expected gain of having the resulting
system. This
would give a more comprehensive model including both the costs of acquisition
and the costs of having and using the acquired knowledge. See van Someren et al. (1997) for a
model of induction methods that include costs of measurements and costs of errors, i
n the
context of learning decision trees. These two models can be integrated into a single model, see
for example DesJardins (1995) for a similar model for robot exploration.

The MeDIA method involves decomposition before formalisation and data analysis (
except
when data analysis detects the need for different types of hypotheses and thus leads to
decomposition). Some heuristics for estimation of expected accuracy are stated in terms of
statistical properties of the data (see the STATLOG results). This sug
gests that data collection
and data analysis should be integrated more tightly with decomposition. However, we expect that
this is in general not correct. Accuracy can be estimated relatively well without using properties
of the data.





References



Aben
, M. and M. W. van Someren (1990) Heuristic Refinement of Logic Programs, in:L.C.
Aiello (ed):
Proceedings ECAI
--
90
, London:Pitman, 7
-
12.

P. Brazdil, J. Gama and B. Henery (1994), Characterising the applicability of Classification
Algorithms Using Meta
-
Lev
el Learning, in: F. Bergadano and L. de Raedt (eds.),
Proceedings of
ECML
-
94
, Springer Verlag, Berlin, pp. 84
-
102

J. Breuker and W. van de Velde (1994),
CommonKADS Library for Expertise Modelling,

IOS
PRess, Amsterdam

Brodley, C. (1995) Recursive bias sele
ction for classifier construction.
Machine Learning,

20,
pp. 63
-
94.

C.E. Brodley and P. Smyth (1997), Applying Classification Algorithms in Practice,
Statistics and
Computing

7, pp. 45
-
56

Craw, S., and Sleeman, D. (1990) Automating the refinement of knowle
dge
-
based systems. In:
Aiello, L. C., ed.,
Proceedings ECAI
--
90
, pp. 167
--
172. London: Pitman.

DesJardins, M. (1995) Goal
-
directed learning: a decision
-
theoretic model for deciding what to
learn next. In: D. Leake and A. Ram (eds)
Goal
-
Driven Learning
, MIT

Press.

R. Engels (1996), Planning Tasks for Knowledge Discovery in Databases; Performing Task
-
Oriented User Guidance, in: Proceedings of the 2nd Int. Conf. on KDD

Engels, R., Lindner, G., and Studer, R. (1997) A guided tout through the data mining jungle.

In:
Proceedings of the 3rd International Conference on Knowledge Discovery in Databases

(KDD
-
97).

U.M. Fayyad, G. Piatesky
-
Shapiro and P. Smyth (1996), From Data Mining to Knowledge
Discovery: An Overview, in: U.M. Fayyad et al. (eds.),
Advances in Knowle
dge Discovery and
Data Mining
, pp. 1
-
37

Ginsberg, A. (1988).
Refinement of Expert System Knowledge Bases: A Metalinguistic
FrameWork for Heuristic Analysis.
Pitman.

Graner, N. (1993). The Muskrat system. In:
Proceedings second workshop on multistrategy
lea
rning
, George Mason University.

Kodratoff, Y., et al. Will Machine Learning solve my problem,
Applied Artificial Intelligence
,

Kohavi, R, D. Sommerfield, and J. Dougherty (1997) Data Mining using MLC++, a Machine
Learning Library,

in C++.
International J
ournal on Artificial Intelligence Tools
, vol. 6.

P. Langley and H.A. Simon (1994), Applications of Machine Learning and Rule Induction, in:
Communications of the ACM.

Langley, P. (1997).
Elements of Machine Learning
. Morgan Kaufmann.

Marcus, S., ed. (1988)
.
Automatic knowledge acquisition for expert systems.

Boston: Kluwer.

J. McDermott (1988), Preliminary Steps Toward a Taxonomy of Problem Solving Methods, in:
S. Marcus (ed),
Automating Knowledge Acquisition for Expert Systems
; Kluwer Academic
Publishers,
Dordrecht (NL)

Michie, D., Spiegelhalter, D. J. and Taylor, C. C. (Eds.) 1994. Machine Learning, Neural and
Statistical Classification. Ellis Horwood. London
.

T.M. Mitchell (1997),
Machine Learning
, McGraw
-
Hill, New York

Morik, K., Wrobel, S., Kietz, J.
-
U
. and Emde, W. (1993)
Knowledge acquisition and machine
learning
, London:Academic Press.

O'Hara, K., and Shadbolt, N. 1996. The thin end of the wedge: Efficiency and the generalised
directive model methodology. In Shadbolt, N.; O'Hara, K.; and Schreiber, G
., eds.,
Advances in
Knowledge Acquisition
. Springer Verlag. 33
--
47.

Polderdijk, J.; Verdenius, F.; Janssen, L.; van Leusen, R.; den Uijl, A.; and de Naeyer, M.
(1996). Quality measurement during the post
-
harvest distribution chain of tropical products. In
:
Proceedings of the Congress Global Commercialization of Tropical Fruits
, volume 2. 185
--
195.

J.R. Quinlan (1993),
C4.5: Programs for Machine Learning
, Morgan Kauffman, San Mateo
(CA)

A. Rudstrom (1995), Applications of Machine Learning, Report 95
-
018, Un
iversity of
Stockholm

Schreiber, A.T.; Wielinga, B.J.; and Breuker, J.A., eds. 1993.
KADS: A Principled Approach to
Knowledge
-
Based System Development
, London: Academic Press.

Shapiro, E.Y. (1982).
Algorithmic Program Debugging.

ACM Distinguished Dissertat
ions
series. Cambridge, Massachussetts: MIT Press.

Shapiro, A. (1987)
Structured induction in expert systems
, Addison Wesley.

M.W. van Someren, C. Torres and F. Verdenius (1997), A Systematic Description of Greedy
Optimization Algorithms for Cost Sensitive

Generalisation, in: X. Liu and P. Cohen,
Proceedings of IDA
-
97
, Springer Verlag, Berlin (Ge), pp. 247
-
258.

I. Sommerville (1995),
Software Engineering
, Addison and Wesley, UK

L. Steels (1990), Components of Expertise,
AI Magazine

11:2, pp. 29
-
49

Terpstra,

P.; van Heijst, G.; Wielinga, B.; and Shadbolt, N. (1993). Knowledge acquisition
support through generalised directive models. In David, J.
-
M.; Krivine, J.
-
P.; and Simmons, R.,
eds.,
Second Generation Expert Systems
. Berlin Heidelberg, Germany: Springer
-
V
erlag. 428
-
455.

J.L. Top (1993),
Conceptual Modelling of Physical Systems
, PhD thesis, Enschede (NL)

Valente, A. (1995) Planning, in: J. Breuker and W. van de Velde (1994),
CommonKADS Library
for Expertise Modelling,

IOS PRess, Amsterdam

F. Verdenius (1996
), Managing Product Inherent Variance During Treatment, Computers and
Electronics in Agriculture 15, pp. 245
-
265

F. Verdenius (1997), Developing an Embedded Neural Network Application: The making of the
PTSS, in: B. Kappen and S. Gielen, Neural Networks, B
est Practice in Europe, World Scientific,
Singapore, pp. 193
-
197

F. Verdenius and M.W. van Someren (1997), Applications of Inductive Techniques: a Survey in
the Netherlands, in:
AI Communications
,10, pp. 3
-
20

F. Verdenius, A.J.M. Timmermans and R.E. Schout
en (1997), Process Models for Neural
Network Application in Agriculture, in: AI Applications in Natural Resources,
Agriculture and
Environmental Sciences,

11 (3)

F. Verdenius and R. Engels (1997), A Process Model for Developing Inductive Applications,
Proc
eedings of Benelearn
-
97
, Tilburg University (NL), pp. 119
-
12

S.M. Weiss and C.A. Kulikowski (1991), Computer Systems that Learn, Morgan Kauffman, Palo
Alto

S.M. Weiss and N. Indurkhya (1998), Predictive Data Mining, Morgan Kauffman, San Francisco
(CA)

N. W
irth (1971), Program Development by stepwise refinement, Comm ACM, 14 (4), 221
-
227

N. Wirth (1976), Systematic Programming, An introduction, Englewood Cliffs, NJ: Prentice Hall