Hybrid Systems: Two Examples of the
Combination of Rule

Based Systems and
Neural Nets
March 2001
Pieter Buzing
Neural networks and rule based systems both have their clear advantages as well as their
disadvantages. Combining these two could lead to a pow
erful system that profits from the positive
characteristics of each other. Fu(1989) and Towell & Shavlik(1994) each proposed such a hybrid
systems. This paper gives a comparison between them and concludes that in both systems the inductive
learning abil
ity of a neural network
can contribute to a knowledge base and that the semantic
foundedness of a knowledge base can be a good starting point for a neural network.
1
Introduction
In the AI tradition rule

based systems and neural networks are two very
different fields of research.
Both approaches have their own merits and their own flaws. These flaws are for the most part
complementary [1, 3]: a neural net holds no semantics, while a knowledge base does contain explicit
knowledge. A knowledge based syst
em has difficulty dealing with continuous variables; a neural net
can learn continuous probability distributions from examples. Neural networks ignore problem

specific
theory (e.g. Features may be context dependent), while knowledge based systems are desig
ned to deal
with domain

specific reasoning. Knowledge based systems have difficulty acquiring new knowledge,
while neural nets learn inductively by their nature. So combining the two could be fruitful: the network
could be structured in such a way that kn
owledge is 'visible' in the architecture and the learning
mechanism of the neural net could show if the rules and the data set are in accordance with each other.
In this paper we will discuss two hybrid systems that use a knowledge base to initialize a neu
ral
network. The network is then trained with examples. I shall make a comparison of the two systems
based on the following questions: How do the systems react to (initial) erroneous rules? Are the
semantics maintained (extractable) after neural training?
The first question indicates whether the neural
net approach adds something to a knowledge base system. The latter addresses the added value of the
rule base structure to a neural network.
Section two will hand some technical knowledge on the two classical
approaches and the new field of
hybrid systems. In the following three sections the two systems (one by Fu [2] and one by Towell &
Shavlik: KBANN [1]) will be compared with regard to the rules

to

network phase (typical in these kind
of hybrid systems), th
e training phase and the fault tolerance (i.e. Handling incorrect rules and noisy
data). In section six you will find a discussion of the context in which these different systems were
developed. Section seven presents the conclusions that can be drawn, foc
ussing on the two points
mentioned above.
2
Background of the techniques
In this section I will explain about the basic principles of both classical rule based systems and neural
networks. The strengths and weaknesses of each system will be discussed.
Paragraph 2.3 will give
arguments for hybridization.
2.1
Rule

based systems
Rule based systems consist of a rule base and a fact base [3]. The rule base contains general knowledge
(in implicational form) about a certain subject area, while the fact base ex
presses specific knowledge of
a particular case. The rules are used in the inference process to derive new facts from given ones. There
are two basic reasoning methods: forward

chaining and backward

chaining. The first method starts
with the known facts an
d applies rules in order to eventually reach the goal conclusion. The latter
method starts with the goal. It recursively selects rules that would deduce a (sub) goal until the set of
goals is completely resolved by given facts. Of course a bi

directional a
pproach is also possible.
One way of dealing with uncertainty is the use of certainty factors. Each rule has been given a CF
between

1 and 1. When a rule fires the conclusion of that rule is assigned the CF of the rule. When the
premise is uncertain (CF<1
) then the CF of the conclusion is adjusted to either the minimum of the
premises (in case of a conjunction) or the maximum of the premises (disjunction). E.g.:
Rule1 if A and B then C (CF=0.8)
if CF(A)=0.8 and CF(B)=0.5 then CF(C)=0.4
Rule2 if A or B then
C (CF=0.9)
if CF(A)=0.7 and CF(B)=0.2 then CF(C)=0.63
The strength of a rule

based system is the high abstraction level. Knowledge can be declared in a very
comprehensive manner, making it possible to easily verify the knowledge (rule) base with (human
)
domain experts. The system also gives explanations for the given answers in the form of inference
traces.
Typical weaknesses are dealing with incomplete, incorrect and uncertain knowledge, continuous
variables and non

monotonic logic [3]. A complete doma
in theory may require thousands of (possibly
recursive) rules, which could lead to a very slow system [1]. Also the system does not “learn” anything
by itself.
2.2
Neural networks
In an artificial neural network a number of neurons are connected with each
other (see figure 1). We
distinguish the input layer, the hidden layer(s) and the output layer. Each connection has a certain
weight. Each node propagates a value calculated by a function taking the net sum of the weighted
activation of all connections lea
ding to that node as its input (see figure 2). A bias can be added by
connecting a bias

node (that always has 1 as activation value) with the node. The weight of this
connection is called the bias. In the training phase each input pattern is propagated thr
ough the
network, after which the error (squared sum of the difference between the desired output and the actual
output) is calculated. The weights are then adjusted using a back

propagation algorithm. The
adjustment to each weight value is calculated by u
sing the derivative of the error function: we want to
minimize the error, so we choose the direction that gives the steepest descent. The learning rate is a
measure for the step size. Big learning rates give good results in the beginning, but often fail to
find the
optimum. A too small learning rate can cause the algorithm to run very slow or get caught in a local
optimum [6].
Figure 1
A simple neural network architecture.
Net=
n
i=1
w
i
*x
i
f(net)=(1+e
xp(

net))

1
(1)
w
i
=
*
⩸
i
(2) f’(net) = exp(

n整e*E1⭥xpE

n整eF

2
= f(net)*(1

f(net))
(3)
㴠(d

o)*f’(net) = (d

oF*o*E1

漩
=
2a:
node f and its
connections
2b
: the gaussian
activation function
2c
:
w
i
is the weight adjustment.
楳
瑨攠汥慲ni
ng r慴攮 d 楳 瑨攠d敳楲敤
慣瑩t慴楯n. 楳 瑨攠慣tu慬au瑰u琮
Figure 2
A gaussian activation function and some formulas for the weight adjustment.
A neural net can learn from mere examples. In many domains it is far more easier to collect a
representati
ve data set, than to construct a (complete) knowledge base. It does not suffer too much from
noisy data and is not biased like human experts might be.
A big problem in neural networks is the choice of architecture: the only way to decide on a certain
archi
tecture is by trial

and

error. But the main weakness of a neural net is the lack of understandability,
i.e. It is impossible to extract any knowledge from a trained neural network. Also in complex domains
large data sets are required which

if available

l
ead to lengthy training times.
2.3
Hybrid systems
In the last decade researchers started to realize that rule based systems and neural networks are just two
ends of a whole intelligence spectrum [3]. A combination of these two approaches certainly sounds
appealing, as many weaknesses mentioned above can be compensated. The hybrid approach considered
in this article takes the domain knowledge as a starting point for the network architecture. This is done
by the mapping shown in figure 3.
Knowledge Base
Neur
al Network
Final Conclusions
Output Units
Attributes
Input Units
Intermediate Conclusions
Hidden Units
Dependencies
Weighted Connections
Figure 3
The correspondence between a rule base and a neural net
w
1,3
w
1,1
X
2
X
3
w
2,1
X
1
w
1,2
w
1,4
w
2,4
w
2,3
w
2,2
Input
layer
Hidden
layer
Output
layer
X
1
X
n
X
2
w
n
w
2
w
1
f
o
3
From Rules to Network
Both systems start out
with a (classical) knowledge base, consisting of a (not necessarily complete) set
of rules. This domain theory has to be translated into a neural network. Each system has it’s own
algorithm for this.
3.1
Fu: Constructing a conceptual network
The knowledge
base is transformed into a conceptualization network in the following way (see also
figure 4).
1
The rules are rewritten in their conjunctive form.
2
Data attributes are mapped into input units.
3
Concepts (intermediate hypotheses) are mapped into hidden
nodes.
4
Final hypotheses are mapped into output units.
5
Conjunction nodes that form a bridge between the condition nodes and the consequence node
are added.
6
The rule’s CF is mapped into the weight of the corresponding connection between a
conjunction
unit and a consequence unit. Connections between the non

conjunctive layers and
the conjunctive layer are set to one.
R1: A
䈠
P (C䘽F.8)
R2㨠:
䌠
䐠
儠(C䘽F.5)
R3㨠:
儠
Z (C䘽F.0)
Figure 4
The six steps in the Fu rules

to

network algorithm
The resulting network is in fact a neural network. The optimal weight vector we are looking for
actually represents the set of certainty factors of the rule set. In order to maint
ain the conjunctive
interpretation of the connection he added extra layers with nodes that propagates the minimum value of
the incoming activations
–
in conformity with the certainty factors calculus mentioned in 2.1. The
weights of the connections between
the conditions and the conjunction nodes are always set to one,
never to be changed. This is done to simplify the problem caused by the conjunction, which
mathematically (and semantically) is hard to deal with.
Fu tested his approach on the MYCIN knowled
ge base [4]. The MYCIN program could diagnose
infectious diseases and recommend therapy for that bacterial infection. Nowadays the system is
Step 1
Step 3
Step 2
Step 4
Step 5
Step 6
A
B
D
B
A
B
D
B
A
B
D
B
A
B
D
B
A
B
D
B
Q
P
Q
Q
P
Q
P
Z
Z
Z
w=1.0
w=0.5
w=0.8
considered overestimated and primitive, but in the eighties it was a breakthrough in AI. One of it’s
renowned quali
ties is the fact that it can handle uncertainty in a comprehensive manner: certainty
factors.
3.2
KBANN: Rules

to

Network algorithm
The KBANN system uses a seven step algorithm to construct a network.
1
Rewriting: the rules are transformed into Horn clause
s. Rules with the same consequence are
rewritten in the following form: {C
D
B, E
F
G
B} transformed to {C
D
B’,
E
F
G
B’’, B’
B’’
B}
2
Mapping: the rules are then organized into a neural net. The weight values are chosen in such a
way that the activatio
n emulates an AND

function. Towell & Shavlik empirically found w=4 to be
a good initial weight value (w=

4 for negative antecedents). The bias of the unit is set to (P

1/2)*w, where P is the number of positive antecedents.
3
Numbering: each node in the n
etwork is assigned a number according to its ‘level’ (defined as the
longest path to an input unit).
4
Adding hidden units: new units are placed in the network to facilitate learning of new features that
were not yet expressed in the initial knowledge rule
s. This step is optional, because the given rules
often have enough expressive power to make the learning of new rules possible.
5
Adding input units: the domain expert can identify certain features that (by chance or ignorance)
were not caught in a rule.
6
Adding links: in this step links with weight zero are added to the network. Each node at level n

1 is
connected to all nodes with level n.
7 Perturbing: in order to avoid problems caused by symmetry (all the connections leading to one
node are initia
lized with the same weight value) a small random value is added to each weight in
the network.
Figure 5
: Left is the situation after step 1. Right is the final network, where
the dark nodes are the ones
that were added. Not all nodes between two adjacent layers are connected in the figure, but they should
be. Also the weight values are not shown.
As shown in step two, KBANN chooses weight values that approximately give the s
ame behavior as
the original AND

function. But how do we interpret a weight value in a conjunction? The meaning of
the rules is now lost.
The implementation tested by Towell & Shavlik considers a DNA domain, where uncertainty is not
really an issue. The o
nly problem is that rules can turn out to be incorrect and that some rules may not
be known yet. Each nucleotide of a DNA string (size: 57 nucleotides) is connected to an input unit. The
goal is to decide whether the input is a so

called promoter or not. A
promoter is a short DNA sequence
that precedes the beginning of a gene.
A
B
Z
B’
B’’
Y
X
C
D
E
F
G
S
T
3.3
Differences
One major difference between the two systems is that Fu proposes to model uncertainty and that
KBANN does not. KBANN rules can be given a certainty value, but in Towe
ll & Shavliks view this is
not necessary because CFs have not proven decisive: during the training process (discussed in the next
section) the network will find the right weight even if the initial certainty is wrong. Even in the
MYCIN system you could cha
nge a number of CFs and still get the correct conclusions. The strength
of an artificial neural network (as opposed to humans) lies in its ability to find the correct weight
values. This is achieved by the back

propagation algorithm, which is discussed in
the next section.
Another difference is that the Fu system has 'conjunction layers' to facilitate the conjunction function.
He prefers to hold on to the semantics, but this means that he has to give up on the smooth gaussian
activation function. The AND

f
unction is also not differentiable. The implications of this will be made
clear in the next section: training. KBANN's substitute AND

function bears no meaning in the way that
we can interpret it in a (un)certainty context.
4
Learning Algorithms
Both sys
tems use back

propagation but Fu has the problem that the evaluation function is not
differentiable everywhere (because of the conjunctions) so it has to use a second learning algorithm.
4.1
Fu: Back

propagation and Hill

climbing
The constructed network co
nsists of conjunction and non

conjunction layers. The latter have a
differentiable function, so a conventional back

propagation algorithm can be used for these units. In the
conjunction layers however, this is not possible because AND is not a differentiab
le function. The
solution given by Fu is a hill

climbing mechanism to find the best path to propagate the error. We
follow a one

step look

ahead strategy: the error in node Z (see figure 4, step 6) could be caused by unit
P or Q. How much would the perform
ance of the system improve if we blame unit P? This is done by
adjusting the weight of the connection that leads to P. After comparing the two blaming possibilities we
choose to adjust the weight of the unit that would yield the greatest gain in performanc
e.
4.2
KBANN: Back

propagation
In the KBANN network all functions are differentiable, so back

propagation is applied to all layers.
This was explained in section 2.2.
Towell and Shavlik compared their KBANN with a standard neural network. Both networks us
ed the
same functions, the same back

propagation algorithm and the same training and test sets. The
researchers varied the size of the training set. Results clearly showed that when the training set was
small KBANN performs better than the standard neural
net. When the training set increased, the
difference in performance between the two networks decreased to zero. This indicates that the
initialization of KBANN with domain knowledge speeds up the learning process.
4.3
Differences
One can argue that the hil
l

climbing heuristic is not very sophisticated, but the way in which KBANN
transforms the conjunctions into conventional neural network nodes is not very graceful either.
KBANN adds a lot of connections, which have nothing to do with the domain theory. Fu
keeps his
initial connections, but after a training session it can be decided to add some connections and train it
again, expecting better results. The advantage of the Fu algorithm is that it keeps touch with the
original rules, while KBANN trails of into
pure mathematical functions, ignoring the meaning of the
knowledge. This makes it very hard to extract knowledge from the network.
5
Fault Tolerance
One of the problems in rule based systems is the inability to repair an incorrect knowledge base. Neural
n
ets on the other hand can still perform pretty well when confronted with corrupt data sets. How do
both hybrid systems cope with erroneous domain theory?
5.1
Fu: Removing the corrupt rule
The Fu system is able to identify and remove incorrect rules. A rule
is considered incorrect if the
change in the weight was reasonably high. The only problem is that a threshold for the weight shift has
to be identified, otherwise correct rules with a minor weight adjustment could be falsely accused. But if
the threshold
is too high incorrect rules might not be identified as such. When a rule has been removed,
back

propagation is resumed until the network is stable and the next weakest rule is removed.
Fu conducted ten experiments. In each of them he replaced six out of f
ifty rules are replaced by
incorrect ones. The system would still identify these malicious rules as incorrect in all experiments.
Another interesting issue is the deduction of new rules: the system could propose rules (i.e. correlation
between attributes)
that the human experts did not think of. Fu describes two procedures to create new
rules, but he did not implement them. The first one is to add new nodes to the network. But this would
be a big number of new nodes and connections, thus demanding a lot of
computing power and losing
the so carefully preserved semantics. Secondly he argues that if the desired output is always higher then
the actual output, then the rule has to be generalized (i.e. deletion of a premise) and that if the desired
output is lowe
r then the actual output, then the rule should be specialized (i.e. adding of a premise).
5.2
KBANN: Handling incorrect rules
This system does not add or delete rules in the way that Fu described. The addition of many
connections causes the number of rules
(remember that all the nodes between two layers are connected)
to increase dramatically. Every connection with a weight value other than zero could indicate a rule.
Even when you find a big change in a certain weight value it is hard to construct or delet
e a rule,
because a node can not be considered a conjunction anymore.
But the system can handle incorrect initial rules very convincingly. Tests made clear that 10% of the
rules could be removed or added and the KBANN system would still outperform a stand
ard neural
network. Also a small ‘adjustment’ in the rules would not lead to a drastic loss in performance: 30% of
incorrect rules (adding or removing a condition) would still leave KBANN superior to a standard
Neural network.
5.3
Differences
The different
nature of both systems becomes very clear now. The Fu system maintains the meaning of
the domain theory in his network, thereby making it possible to identify incorrect rules, (optionally)
leaving it up to the expert to judge what caused the wrong behavio
r: the rule or the data set. The
KBANN system solves the incorrect domain theory problem entirely different. It says nothing about
the correctness of a rule, it just alters the weights in such a way that the output improves (like a real
neural net should).
The test results show no great difference in the way both systems cope with
incorrect domain theory.
6
Discussion
The first thing that has to be said is that Fu and Towell & Shavlik conducted their research in different
times. Li

Min Fu published his fi
ndings in 1989, when explicit knowledge rules and certainty factors
were still considered the basis for artificial intelligence. In my opinion he overestimates the ability of
humans to correctly formulate domain theory. Towell & Shavlik(1994) have given up
on the ‘sacred
rules idea’; they use the domain knowledge to initialize the neural net, after which the computing
power should lead to a good system
behavior
. Fu is more interested in the interpretation of the network.
Though both projects were tested in
the biological field, there are some apparent differences between
the domains. First of all, the DNA knowledge base used by Towell & Shavlik does not contain
uncertain rules in the way that the Mycin KB (infectious diseases) does.
CFs have not proven deci
sive. In the Mycin rule base one can alter the CF of a rule and still acquire the
same diagnose. They are not mathematically founded either. There are other logical techniques
available that can express uncertainty much better, like probability theory and
fuzzy logic.
7
Conclusion
How do the systems react to (initial) erroneous rules?
Both systems can handle incorrect rules. The difference is that Fu points out exactly which rules cause
a discrepancy with the provided data set. This adds an extra dimen
sion to the system, because the
expert can verify his knowledge rules. The KBANN system just lets the weights of a certain connection
drop to zero, when there is a low correlation between two concepts. The network performs well, but it
is hard to tell whic
h initial rules were wrong. Whether this is a problem depends on the goal of the
implementation: if you want to verify your rule set with actual data the Fu system would be more
appropriate. If your goal is to make a network that simply does the job, the K
BANN black box is
suitable.
Are the semantics maintained (extractable) after neural training?
The Fu system is in fact a conceptual network. The adjusted knowledge rules can easily be extracted:
the nodes that lead to a conjunction node are the condition
of a rule, the node that the conjunction unit
is pointing to is the consequent and the weight of this last connection is the certainty factor of the rule.
The KBANN system does not have that property. There are too many connections added to the
network, w
hich would mean that extracted rules would have many (all) attributes as a premise. This
has no practical value, though Towell & Shavlik are working on that [7].
8
References
[1]
Geoffrey G. Towell, Jude W. Shavlik 1994.
Knowledge

Based Artificial Neu
ral Networks
.
Artificial Intelligence, Vol. 70 1994.
[2]
Li

Min Fu 1989.
Integration of Neural Heuristics into Knowledge

based Inference
.
Connection Science
,
Vol. 1, No.3, 1989
.
[3]
W. Ertel, C. Goller, M. Schramm 1995
. Integrating Rule Based Reasoning and
Neural
Networks.
[4]
B.G. Buchanan, E.H. Shortliffe 1984.
Rule

Based Expert Systems
. Reading, MA: Addison

Wesley
[5]
K. Mehrotra, C.K. Mohan, S. Ranka 1997.
Elements of Artificial Neural Networks
. MIT Press.
[6]
W.S. Sarle 1994.
Neural Network Implementa
tion in SAS Software
, Proceedings of the
Nineteenth Annual SAS Users Group International Conference.
[7]
G.G. Towell 1991.
Symbolic Knowledge and Neural Networks: Insertion, Refinement, And
Extraction.
PhD thesis, CS Department, University of Winsconsin, M
adison, WI.
Comments 0
Log in to post a comment