Basics of Knowledge Engineering

reverandrunΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

60 εμφανίσεις

MI CROSOFT BAYESI AN NETWORKS
Basics of Knowledge
Engineering
By
John Locke
Kindred Communications Troubleshooter Team
Microsoft Support Technology

December 1999




Table of Contents
Introduction 1
What is a Bayesian troubleshooter? 2

Why use Bayesian troubleshooters? 2

What is a knowledge engineer? 3

Using this tutorial 3

Authoring components 5
Types of information 5

Evidence 6

Symptoms 6

Causes 6

Resolutions 7

Problems 8

Content template 8

Modeling concepts 12
Model vs. session 12

The building blocks 13

Nodes 14

Problem nodes 14

Informational nodes 14

Fixable/observable nodes 14

Fixable/unobservable nodes 15

Unfixable nodes 15

Other 15

Relationships 16

States 16

Pseudo nodes 18

Costs and probabilities 18

Calculating costs 19

Time 19

Difficulty 19

Expense 19

Risk 20

Importance 20

Calculating cost 20

Types of Costs 21

Cost to Observe 21

Cost to fix 22

Cost of Service 22

Probabilities 22

Prior Probability 22

Link Probability 23

Basic modeling 25
Matching Information to Nodes 25

Entering Costs 31

Prior Probabilities 31

Link Probabilities 32

Summary 32

Standard assessments 32

Building an asymmetric assessment 36

Intermediate modeling 41
Grouping causes 41

Cause as Other node 41

Case 1: 43

Case 2: 43

Case 3: 44

Case 4: 44

Group, or not? 44

Cause as Informational Node 45

Advanced modeling 47
Other uses of Other nodes 47

Fine-tuning shades of probabilities 47

Unwanted informational nodes 48

Other nodes as escape hatch 52

Other node as symptom 52

Multiple Intermediate nodes 53

Multiple State Tests 53

Using Unfixable nodes 54

Using the “Impossible” state 55

Conclusion 57
Glossary 59
Appendix 63


B A S I C S O F K N O W L E D G E E N G I N E E R I N G
1
11
1
Introduction
Who is this Bayes guy anyway? Why Bayesian troubleshooters?
Are you sure?
n the middle of the 1700s, a minister in England named Thomas Bayes
developed a central theorem of probability science. Known as Bayes’ Rule, this
equation is meant to predict outcomes, based upon a series of known prior
probabilities.
In the middle of the 1990s, Microsoft hired a team of the brightest and best Bayesian
mathematicians to develop a way to apply Bayes Rule to computer software. Among
the results are the troubleshooting wizards in Windows 98 and the troubleshooting
tools on the Web site http://support.microsoft.com, which were created using the
Microsoft Bayesian Network (MSBN) tool.
I now send you an essay which I have found among the papers of our
deceased friend Mr Bayes, and which, in my opinion, has great merit... In
an introduction which he has writ to this Essay, he says, that his design
at first in thinking on the subject of it was, to find out a method by
which we might judge concerning the probability that an event has to
happen, in given circumstances, upon supposition that we know nothing
concerning it but that, under the same circumstances, it has happened a
certain number of times, and failed a certain other number of times. –
Richard Price, introducing “Essay towards solving a problem in the doctrine of
chances” to the Royal Society of London in 1764.
While Bayes Rule has been around a lot longer than computers, applying it to
computers is a radical new concept, and creating MSBN-based troubleshooters is no
easy task. It requires logic and patience, mixed with a dose of head banging and
Tylenol. The simplest troubleshooter takes several times as long to create as a
comparable HTML-based troubleshooter, and to learn how to create anything more
complex can take months of training and learning.
Chapte
r

1
I
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

2
22
2
The goal of this tutorial is to provide beginning knowledge engineers with an
understanding of how to create Bayesian troubleshooters, and to serve as a reference
for all knowledge engineers. I have spent two years learning and using the Microsoft
Bayesian Network (MSBN) tool as a member of the Kindred Communications
troubleshooter (KCTS) team creating troubleshooters for Microsoft products, and this
tutorial is based on what I’ve learned.
What is a Bayesian troubleshooter?
A Bayesian troubleshooter is a type of expert system based upon a belief network
constructed with a series of causes, symptoms, and resolutions, and designed to
troubleshoot problems. A belief network is a set of nodes or variables representing
system components and their states. Along with the nodes, there are two other
important items: links between the nodes indicating influence, and “prior
probabilities,” indicating the strengths of these influences.
A belief network represents a diagnostic problem in a way that allows formal
mathematical methods based on Bayes’ Rule to be used to compute the “posterior
probabilities” (that is, probabilities applicable to the current case) from the “prior
probabilities” (that is, probabilities generated by the author and knowledge engineer).
In troubleshooting models, additional information specified by the authors gives the
costs of testing and repairing elements of the system. These costs are used to find the
optimal (cheapest and most likely) path to problem resolution.
Microsoft uses Bayesian troubleshooters to diagnose problems with a variety of
software products. The troubleshooters appear in several Microsoft products,
including Windows 98 and Windows 2000, as well as on the Microsoft support site
(http://support.microsoft.com).
Why use Bayesian troubleshooters?
Among their advantages, Bayesian troubleshooters allow for uncertainty. Rather than
using a logical if-then, tree-based structure, Bayesian troubleshooters use probabilities
to determine in what order to present recommendations. When authoring a
troubleshooter, you don’t have to worry about end users having to do something they
don’t want to do—they can skip a question and return to it later. You, as the author,
don’t have to deal with which steps they have skipped, or figure out what to
recommend next. Also, you can easily extend existing troubleshooter models, and if
you have a consistent way of assigning costs and probabilities, you won’t have to go
back and update links.
Among their drawbacks, Bayesian networks require you to make assumptions about
the frequency of events, using numbers that you don’t have. There are two sets of
probabilities attached to a cause-and-effect pairing: one represents the probability of
the cause happening, out of the entire universe, and the other the probability that it
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

3
33
3
causes a given problem. The numbers you get for the latter you can gather based upon
support calls. The numbers for the former are based on a WAG.
1
Future generations
of troubleshooter technology may update probabilities of troubleshooter models
automatically, based upon historical data of what evidence users have given.
What is a knowledge engineer?
I borrow the term “knowledge engineer” from the academic use of the term for
designers of expert systems. An effective knowledge engineer is part writer, part editor,
part mathematician, and part tester.
Knowledge engineers analyze and chunk troubleshooting information, and turn it into
a cohesive troubleshooter model. For Microsoft troubleshooters, most of the original
information either comes from writers in the User Ed/User Assistance groups, or
from support engineers in Product Support. The knowledge engineers on the KCTS
team then organize the information into an effective structure, build the models, and
send them back to the original writers to verify accuracy. When the knowledge
engineers are finished, they hand the troubleshooter off to editors, production, and
localization.
For this tutorial (and other troubleshooter documentation), I refer to the initial
content providers as the writers, those who organize the information into discrete
symptoms and resolutions as authors, and those who construct the belief networks
using the chunked information as knowledge engineers. Right now, knowledge
engineers are doing most of the authoring. With an effective template, writers should
ideally perform this function.
Using this tutorial
This tutorial covers authoring and knowledge engineering. There are many
enhancements and additional features for troubleshooters, including sniffing,
expandable text, shortcuts to executables, and popups. Other documents describe
these features. The troubleshooter intranet site has an extensive style section that
describes what text should appear on each type of page, and gives guidelines for
wording. Other documents describe how to build, deploy, and test troubleshooter
projects. This document focuses strictly on organizing information and modeling.
At the time of writing, we are using the version of the tool called the Microsoft
Troubleshooter Editor (MSTS). Its successor, Mercury, will add several project
management features and change the file format and outputs, but will not affect how
knowledge engineers organize and model troubleshooters. The original version was
called the Microsoft Bayesian Network (MSBN) tool, and this document uses that
name generically for all versions of the tool.


1
Wild-A** Guess.
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

4
4
4
4
Whether or not knowledge engineers author troubleshooters, they need to be
thoroughly familiar with how the information should be organized. Bayesian
troubleshooters are very effective for a specific task: diagnosing a complex problem
with many possible causes, with only a vague symptom as a starting point. Bayesian
troubleshooters generally work well for diagnosing problems. However, if you, as the
user, do not have a problem with a specific cause, Bayesian troubleshooters tend to be
less effective than other help systems that provide extensive indexing and searching
capabilities. Knowledge engineers must be able to identify effective and ineffective
troubleshooting material up front. So we will start with information organization and
authoring concepts.
Before modeling, the knowledge engineer must understand all of the basic
components of Bayesian networks. In the next section, I will explain the concepts you
need to understand before going further.
Then we will explore techniques of modeling troubleshooters. We will progress from
easy modeling to more advanced techniques. You should not attempt to build any
troubleshooters until you are familiar with all of the basic concepts and have begun
reading the Basic Modeling section.
The last chapter will provide you with alternative concepts to use to predict
troubleshooter behavior at run time.
I provide a glossary of terms at the end.

B A S I C S O F K N O W L E D G E E N G I N E E R I N G

Authoring components
Who’s fault is it, and how to think like a troubleshooter
Organizing information for troubleshooters, compared to organizing a Word
document, or even a Web site, is like trying to teach a child compared to writing a
computer program. The troubleshooter always wants to know “why.” “Why shouldn’t
I ask the user to reformat the hard drive first?” “Why should I ask the user to see if
their printer is online, first?”
To date, most documentation for troubleshooting computer problems has used a
linear, step-by-step process. Most beginning support engineers are instructed to follow
such a step-by-step troubleshooting order. But the most effective troubleshooters,
human or otherwise, will take a few moments to gather evidence first, diagnose the
problem, and only then present the solution.
To author an effective troubleshooter, writers need to get away from any concept of
“first try this, then try this.” The better approach is to train the troubleshooter with
“This is really easy. This always seems to be the problem.” In other words, teach the
troubleshooter the reasons to check for certain things over other things, and then let
the troubleshooter decide what to present when.
Types of information
The first thing to do is break all the bits of information down into appropriate types.
Support engineers are accustomed to dealing with three:
Symptoms
,
Causes
, and
Resolutions
. To this list, for the sake of organizing and dividing information, we will
add a fourth type:
Problems
. The other thing we need to know for each symptom and
resolution is the set of possible behaviors, or
Evidence
.
Bayesian troubleshooters work by gathering evidence. An important concept in
troubleshooting is that the primary goal is to diagnose the problem first. Providing a
resolution to the problem is the final outcome, but the diagnosis comes first. We, as
humans, gather evidence to diagnose problems. Bayesian troubleshooters do the same.
To build an effective troubleshooter model, we must give it the evidence we’re looking
for.
Chapte
r

2
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

6
6
6
6
Evidence
Evidence is observable phenomena. Sometimes, you gather evidence by doing
something else. Other times, you need to actively seek evidence. Evidence ranges from
very broad to very specific. Evidence may or may not indicate a problem or cause.
Examples:
ƒ I’m getting a 404 error when I try to go to my home page.
ƒ I’m getting a “Fatal Exception Error.”
ƒ I’m using Windows NT 4 Workstation.
ƒ I have Adobe Photoshop installed.
ƒ The registry key
“HKLM/Software/Microsoft/Office/9.0/Registration/ProductID” exists.
ƒ I can’t print.
ƒ I can print from a command prompt, but not from my application.
When authoring a troubleshooter, try to define as much evidence as possible, ranging
from specific to general.
Symptoms
Symptoms are comprised of groups of evidence that are parallel. A symptom often
contains two pieces of evidence: a test succeeds, or it fails. It can contain more: I’m
using Windows NT; I’m using Windows 98; I’m using Windows 3.1; I’m using Linux.
This is a set of mutually exclusive pieces of evidence—only one can be the case at a
time.
We will explore arranging evidence into symptoms more, later.
Causes
Causes are the underlying reasons the user is having a problem.
The user does not
need to know the cause to fix it.
The author should be able to define the cause, so
that the knowledge engineer can model it effectively.
Examples:
ƒ The printer cable is faulty.
ƒ The user doesn’t understand the function.
ƒ The program isn’t installed correctly.
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

7
77
7
ƒ A system file is corrupted.
ƒ The document is corrupted.
ƒ There isn’t enough RAM.
ƒ The system drive is full.
ƒ A file is locked.
ƒ The server is offline.
ƒ The domain is not registered correctly.
ƒ The page is missing, or misspelled.
Causes are usually specific. However, advanced troubleshooters often group causes
together. These larger groupings of causes can represent a more general cause. For
example, a general cause of not being able to print is that the printer is not properly
connected to the computer. More specific causes can be a faulty printer cable, the
printer is offline, the printer is incompatible, a switching device is not set properly, or
the printer is off.
Resolutions
Resolutions are instructions for fixing a problem. They are usually associated with a
single cause, but not always. Removing and reinstalling Word, for instance, can fix
many causes, such as a corrupted system file, an incorrect registry entry, and a
corrupted Normal.dot template. Likewise, more than one resolution may fix a single
cause—in the preceding example, deleting or renaming the Normal.dot file may also
fix the cause of a corrupted Normal.dot file, because Word will create a new one.
Examples:
ƒ Replace the faulty printer cable with a working one.
ƒ Buy a new computer.
ƒ Take a class about Access.
ƒ Buy more memory.
ƒ Rename the Normal.dot template.
ƒ Restart.
ƒ Make the file writeable.
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

8
88
8
ƒ Log on with an Administrator account.
Resolutions are specific, but vary in cost and effectiveness.
Problems
Problems are symptoms that are significant enough to make a user go to a
troubleshooter. While Bayesian troubleshooters work when a problem has a single
cause, their strength is in diagnosing more complex problems.
Examples:
ƒ My printer doesn’t print.
ƒ I can’t connect to my ISP.
ƒ My document doesn’t convert correctly.
The goal of each resolution is to solve the user’s problem.
Content template
Obviously, organizing troubleshooting information into the above components is only
part of the picture. The other part is describing how each component relates to the
others.
The KCTS team has experimented with several different templates for writers to use.
Most have been of limited effectiveness. Multiple causes apply to a single problem.
Causes may apply to multiple problems. Multiple pieces of evidence identify causes.
Multiple causes may have common pieces of evidence. Multiple resolutions can
address a single cause, or a single resolution may fix multiple causes. Expressing this in
a Word document is challenging at best.
The KCTS team currently sends an Excel spreadsheet to the writers, which they fill
out with symptoms, causes, and resolutions. This approach, while more effective than
a Word document, still does not adequately describe the relationships of the
information components, and knowledge engineers authoring troubleshooters need to
spend a substantial amount of time corresponding with the writers to build an accurate
model.
I think the best solution is to use a relational database to build an outline. I have
partially developed an Access 2000 database to meet such a need. For now, however,
you must develop and manipulate the outline by hand.
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

9
99
9

Figure 1: Hierarchy of problems, causes, evidence, and resolutions
Start by listing each problem at the top of the hierarchy. Then, under each problem,
list all of the possible causes. Under each cause, list all of the evidence that identifies a
cause, and all resolutions. If a cause appears under more than one problem, the
evidence and resolutions need to be the same for all appearances of the cause.
Each cause should have an explanation, clarifying what its most common problems
are, and what is the best evidence to observe. Evidence needs to include instructions
for observing that evidence, and a list of other pieces of evidence it combines with to
create a symptom. Resolutions generally include a procedure to follow, or instructions
about where to go next.
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

10
1010
10


Figure 1 shows a hierarchical breakdown of an example set of problems, causes,
evidence, and symptoms. Table 1 is an example of this information entered in an
Excel worksheet organized in this manner.
Problem Cause Evidence Resolution
Bike won't stop Brake Pads worn out The brake pads are old
Bike won't stop Brake Pads worn out Rubber worn past line
Bike won't stop Brake Pads worn out Replace brake pads
Bike won't stop Brake Pads worn out True wheel
Bike won't stop Brakes need adjusting
Brake levers go to
handlebar
Bike won't stop Brakes need adjusting
Brake pads don't line up to
rim
Bike won't stop Brakes need adjusting
A
djust brakes
Bike won't stop Brakes not toed in
Brake pads don't line up to
rim
Bike won't stop Brakes not toed in
A
djust brakes
Bike won't stop Wheels need truing Wheel wobbles when spun
Bike won't stop Wheels need truing True wheel
Brakes squeal Brakes need adjusting
Brake levers go to
handlebar
Brakes squeal Brakes need adjusting
Brake pads don't line up to
rim
Brakes squeal Brakes need adjusting
A
djust brakes
Brakes squeal Brakes not toed in
Brake pads don't line up to
rim
Brakes squeal Brakes not toed in
A
djust brakes
Chain falls off Bottom Bracket Loose Cranks are loose
Chain falls off Bottom Bracket Loose Adjust Bottom Bracket
Chain falls off Bottom Bracket Loose
Overhaul Bottom
Bracket
Chain falls off Chain twisted Chain is kinked
Chain falls off Chain twisted Replace Chain
Chain falls off Derailer not adjusted Adjust derailleur
Chain falls off Hub loose Wheel wobbles when spun
Chain falls off Hub loose Overhaul hub
Chain falls off Hub loose Tune hub
Wheels wobble Axle bent Axle doesn't fit in dropouts
Wheels wobble
A
xle bent Replace axle
Wheels wobble Hub loose Wheel wobbles when spun
Wheels wobble Hub loose Overhaul hub
Wheels wobble Hub loose Tune hub
Wheels wobble Wheels need truing Wheel wobbles when spun
Wheels wobble Wheels need truing True wheel

Table 1: Problem hierarchy with sample data.
Each component should have a text description of how frequently it occurs, or how
reliably it influences or fixes the next level up in the hierarchy.
The troubleshooter author should organize the information into an outline in this
manner. The next step is to reverse the hierarchy, to reorganize the information into
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

11
1111
11


the opposite order. This is where the role of the knowledge engineer begins. But
before we can reorganize the information, we must learn the basic concepts of
Bayesian belief networks.
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

Modeling concepts
Socio-economic units; chances are, it’s expensive
In this chapter, we will explore several important concepts you must understand
before attempting to build a troubleshooter model. The first is the distinction between
a model and a session. Then we explore the building blocks: nodes, states, and
relationships. Finally, we discuss costs and probabilities.
Model vs. session
When creating a troubleshooter model, you reproduce events from cause to effect.
You draw relationships from the initial underlying causes to the evidence they cause.
You group causes together into related sets of causes. The final effect, downstream of
all causes, is the problem.
When using a troubleshooter to solve your problem, you see the opposite sequence.
When troubleshooting, first you identify the problem, then you observe symptoms,
and finally you try resolutions. The individual sequence you observe while
troubleshooting is called a
session
. The act of identifying a problem, which starts a
troubleshooting session, is called
instantiation
.
A troubleshooter model encompasses all combinations of problems, causes and
evidence. A session is the particular sequence for one particular instantiation. A model
is two-dimensional, while a session is linear.
When you model a troubleshooter, you are reconstructing the knowledge of an expert,
mapping information into an expert system. In the process of creating a
troubleshooter model, you must constantly test it by running sessions and comparing
the recommendations to those of a human expert.
If you’ve done your job well, the troubleshooter can be as effective as the experts you
gathered your information from. However, just as two different human experts may
have different approaches to troubleshooting a problem, a Bayesian troubleshooter
may use a different strategy than the original expert.
Chapte
r

3
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

13
1313
13



Figure 2: A model works from cause to problem, while a session progresses from problem to resolution
If the sequence a troubleshooter recommends in a session differs from what your
writer wants, you either need to check the assumptions you’ve made for cost and
probability, or explain to the writer why the troubleshooter has chosen that sequence.
In either case, the text on each individual page should stand on its own, without
relying on other pages for context.
Since testing models using sessions is so integral to the process of modeling, most of
this tutorial explains each concept both in terms of an individual session and the
complete model.
The building blocks
The basic components of troubleshooter models are nodes, states, relationships, and
pseudo nodes. These concepts interrelate. Some of the descriptions will not make
sense until you finish reading this entire chapter.
In this discussion, a
page
is an individual Web page seen by the end user in a session,
after the troubleshooter has gone through the production process. A program called
Bnts.dll, written by the Microsoft Decision Theory group, executes the
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

14
1414
14


troubleshooter algorithm
in an operation called
inferencing
. The
troubleshooter
engine
is the program that generates each Web page.
Nodes
A
Node
is an oval in the MSBN tool that represents a system component, observable
symptom or problem. It may also represent a troubleshooter page, an invisible
grouping of information. Troubleshooters use six different types of nodes: problem,
informational, fixable/observable, fixable/unobservable, unfixable, and other.
Fixable/observable, fixable/unobservable, and unfixable are collectively called
cause

nodes
(not necessarily the same as a cause used in authoring, above).
Problem nodes
A problem node is a node that represents a possible primary symptom experienced by
a user. Instantiating a problem node begins a session. Problem nodes contain
information from problems provided by the authors.
All problems appear on a single page in a finished troubleshooter model, each with a
corresponding radio button. A problem node is bottommost, meaning that it always
has parents and never has children.
Informational nodes
An informational node contains a single symptom. A symptom is a collection of
related, exclusive pieces of evidence. Finished troubleshooters display an informational
node as a single page.
Some symptoms indicate causes or groups of causes. Use informational nodes as a test
in these cases. A test will generally either pass or fail. The state of a test may also
change over time, or after a cause has been repaired. The answers should address
whether the test passed or failed.
Other symptoms may have a variety of results. These symptoms are usually
configuration issues, and do not change without substantial effort on the part of the
user. The answers should list the possible conditions of the symptom, with no
implication of yes or no.
Informational nodes may have parents, children, or both. You can use an
informational node as a test to determine the case of a single piece of evidence, or as a
configurational switch, to specify among several pieces of evidence. An informational
node may also represent a cause with a specific symptom and multiple resolutions.
Fixable/observable nodes
A fixable/observable node represents a cause that has both a symptom and a
resolution. The symptom should specifically identify this cause, and the resolution
should be the most effective one for the cause.
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

15
1515
15


These are the most common node types we use. Any time you can identify one
definitive symptom for a cause, and one definitive resolution, use a fixable/observable
node.
Older versions of troubleshooters display each fixable/observable node as a single
page. As of this writing, Microsoft is developing new versions of the troubleshooter
executables that will display a fixable/observable node as two pages: a symptom page
and a resolution page. If the user observes the symptom as abnormal, the
troubleshooter will display the resolution page. If the user observes the resolution as
abnormal, the troubleshooter will end at the Bye pseudo node.
All cause nodes are topmost, meaning they always have children and never have
parents.
Fixable/unobservable nodes
A fixable/unobservable node contains only a resolution. Generally you use this type of
node if the only way to detect a cause is to resolve it, or if there are multiple
resolutions for a single cause.
In the latter case, list both resolutions on a single node if they are roughly equivalent,
or if they apply to different configurations that you have not already detected. List the
resolutions on different nodes if one of the resolutions does not always work.
The troubleshooter displays each fixable/unobservable node as a single page. If the
user observes the node as abnormal, the troubleshooter will display the Bye pseudo
node.
Unfixable nodes
Unfixable nodes are extremely rare. There is almost always a way to fix a cause, even
though it may be expensive. Even a defective motherboard can be fixed by buying a
new one or buying a new computer. An uneducated user can be educated. There is
almost always a fix.
Since cause nodes are tied to resolutions rather than to causes, when you put an
unfixable node into a network, it will not appear to the end user. There is no text
attached to an unfixable node.
There are specific conditions where using an unfixable node can solve a modeling
problem. These will be discussed later.
Other
Other nodes are used for grouping causes together, and for assessing configurations.
They have no text, and their states are never observed. Using other nodes strategically
can substantially simplify troubleshooter models, and in some cases is the only
practical way to construct a particular situation.
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

16
1616
16


Relationships
Now that we have looked at the types of nodes used in Bayesian troubleshooters, let’s
take a look at relationships.
Causes cause symptoms, and eventually problems. In Bayesian networks, we are
modeling the overall situation, rather than building hard-coded links from problems to
causes. Problems are the last thing in the list, not the starting point.
Parent nodes
are nodes that have children. They are the root causes and
configurations.
Child nodes
are nodes that have parents. Problems and tests are child
nodes. When you use the MSBN tool, the arrows point from parents to children.
These arrows are called
arcs
, or
edges
.
Parent nodes exercise influence over child nodes. Other child nodes of parents can
also influence recommendations.
Possible relationships of node types
Possible relationships of node typesPossible relationships of node types
Possible relationships of node types


Topmost Parent

Parent and child

Bottommost child
Fixable/Observable
Fixable/Unobservable
Unfixable
Informational (Configuration)
Other (very rare)
Other
Informational (very rare)
Problem
Informational (Test)
Other (very rare)
Table 2: This table describes what types of nodes can be parents, children, or both.
States
State
refers to the condition of a particular node. The set of all states of a node
describes the range of its possible behavior. All nodes have at least two states.
For most nodes, the two states you assess during modeling are
normal
and
abnormal
.
The normal state represents the situation in which a system component is known to
be functioning properly. If a given node represents a cause, for example, the normal
state is the condition when that cause is not present. The abnormal state, on the other
hand, indicates that the cause is (or could be) the source of the problem.
There can be many abnormal states for an individual node, but only one normal
state. The normal state always occurs first in the list of states, and internally is indexed
as zero. Abnormal states are numbered sequentially beginning with one.
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

17
1717
17


One implication of tests having one normal state and multiple abnormal states is
when you use a test to eliminate a group of causes. Using multiple abnormal states,
you can influence the order of different parent cause nodes. But all parents will appear
eventually under any abnormal state of a causally independent test node. Only
observing the normal state will prevent parent cause nodes from appearing in a list of
recommendations.
Some components have no notion of normal or abnormal. Configurations,
particularly, have complete sets of evidence, none of which could be considered
abnormal. Whether you’re using a bike that has coaster brakes or side-pull brakes has
no relation to normal and abnormal; if using coaster brakes is normal and all other
brakes are abnormal, that is not a value judgment we make in a troubleshooter. The
question certainly affects which causes may cause our problem, but we cannot say that
having coaster brakes is why our bike won’t stop. In these cases, states are numbered
beginning with zero. (We also use a different type of assessment, but we’ll delve into
that later).
During a session, finished troubleshooters actually use two additional states:
unobserved
and
unknown
. Initially, all nodes in the network are unobserved. The
model contains the definitions of all possible sessions, but cannot know exactly what
state any component is in until it is told.
When the user instantiates a problem, the problem node is set to the abnormal state.
The troubleshooter algorithm then calculates all of the probabilities and costs that
influence that problem node, and generates a list of recommendations. The engine
then displays the first node in the list.
If the node is an informational node, when the user gives evidence, the troubleshooter
observes the state of the informational node with the corresponding evidence. The
troubleshooter algorithm then recalculates the network, based upon the new evidence,
and generates a new list of recommendations.
The same thing happens if the user observes a cause node as normal. But if the user
observes the cause as abnormal, the troubleshooter determines that it has diagnosed
the cause of the user’s problem, and ends. This is true no matter how many abnormal
states exist for the node. (For this reason, there is no point in assigning more states to
a cause node. Informational nodes, however, can influence the network differently
with different abnormal states.)
We will discuss causal independence later, but for now, it’s important to note that if
you want an informational node to divide causes or groups of causes, rather than just
eliminate them, you must use a standard assessment.
In either an informational or a cause node, the user may elect to “skip” the node.
Doing so sets the state of the node to the unknown state. The algorithm does not
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

18
1818
18


recalculate the network; instead, the troubleshooter engine displays the next node in
the list of recommendations.
Technically, the unknown state is the same as the unobserved state. However, nodes
that have been marked as unknown are tracked. If the troubleshooter algorithm
recommends them in a future inference, the engine skips them.
Pseudo nodes
The troubleshooter engine displays several pages that do not correspond to nodes in
the troubleshooter model. The text for these pages is stored in network properties. We
call these pages pseudo nodes, and finished troubleshooters display them at the end of
a session. Pseudo nodes include the Bye page, the Fail page, the Impossible page, and
the Service page.
If the user observes a cause node as abnormal, troubleshooting ends in success, and
the troubleshooter engine displays the Bye page.
If, on the other hand, all causes are observed as normal, troubleshooting ends in
failure, and the troubleshooter engine displays the Fail page.
If some causes have been skipped, instead of displaying the Fail page, the
troubleshooter engine displays the Service page and allows you to go back to the
unknown nodes.
If the user has given evidence that contradicts the possible outcomes of the
troubleshooter, the troubleshooter engine displays the Impossible page. This
frequently happens when evidence narrows the possible cause of a problem to a group
of causes that the user observes as normal.
Costs and probabilities
The heart of Bayesian troubleshooters is determining probabilities and costs. Pure
Bayesian networks use only probabilities. But because troubleshooting with
probabilities alone may result in more work, Microsoft Research developed a way to
account for effort in terms of costs. The troubleshooter algorithm examines all
possible paths to a solution, and then ranks them according to their likelihood of
success and their total costs. It generates a list of recommendations indicating which
nodes, in order, are the best candidates to examine for a quick resolution of the
problem.
Existing troubleshooters vary wildly in how costs and probabilities are assigned. An
earlier version of this document attempted to improve this situation by defining more
specifically how to assign probabilities and costs. This method, while somewhat
cumbersome, has proved to be fairly effective. Implementing a relational database that
gathers the information directly from the writer would simplify determining costs.
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

19
1919
19


Calculating costs
When we think of cost, our first inclination is to think of expense in terms of dollars.
Costs may also be measured in seconds, or work. There are at least five costs to
consider:
ƒ Time
ƒ Difficulty
ƒ Expense
ƒ Risk
ƒ Importance
Each of these should be considered separately.
Time
When figuring time, make an estimate of how long a slow absolute beginner would
take to observe the symptom or perform the resolution. Most tasks will take at least a
minute just to read through and understand. Tasks that involve rebooting may take at
least 5 minutes, just to get back to the appropriate place in the troubleshooter.
Generally, allow 1 minute for each screen the user interacts with, including the
troubleshooter itself. If the instructions take three screens to scroll through and read,
that’s 3 minutes. If the user has to change settings on two pages of a control panel,
that’s another 2 minutes, plus a minute to open the control panel. Such a procedure
would take about 6 minutes.
See table 3 for some example tasks.
Difficulty
How much experience should the user have before attempting
this? If you’re providing steps for a network administrator, even
though the actual time to observe is short, that administrator
has taken years to gain the knowledge to be able to do the task.
For the troubleshooter itself, use the target audience. If all
instructions start with the Start button, it’s written for a novice
user. If, on the other hand, the instructions tell the user to run a
SELECT query with specified parameters, the target audience is
either a power user or administrator.
Expense
Does the resolution involve buying a new part, or a new computer? If so, write down
a rough dollar cost. Estimate low, for expense—the user may have the item on hand,
and he/she can always skip the node and return later.
D I F F I C U L T Y
F A C T O R
1. Novice
2. Beginner
3. Intermediate
4. Power User
5. Administrator
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

20
2020
20


Risk
The final thing to consider is the consequences of messing up.
Generally, if the user needs to tweak the registry, the risk is
high. If the fix involves changing some system files, the risk is
medium. If the issue is trivial, the risk is low.
Multiply the cost you’ve calculated so far by the risk factor to
determine the final cost.
Importance
The importance category is simply a way of identifying symptoms you must know
before you can do anything else. Sometimes crucial information can save a lot of time
in the long run; gathering this information first will make troubleshooters more
effective.
Calculating cost
First, add up the time of all the tasks in minutes. Then, multiply this number by the
difficulty of the most difficult step. Add any dollar expense to this figure. Then,
multiply this figure by the risk factor. Finally, if a given symptom is crucial to know
ahead of time, divide the final figure by 10. For figuring most costs, generally use
whole numbers. However, don’t feel obligated to do so—for crucial information, it is
quite acceptable to have costs as low as 0.01.
R I S K F A C T O R
1. Low
2. Medium
3. High
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

21
2121
21


Task
Time Difficulty Cost(1) Expense Cost(2) Risk Final cost
Read one default
troubleshooter
screen, steps
beginning from
Start
1 1 1 0 1 1 1
Reboot 5 1 5 0 5 1 5
Start Control
Panel
1 1 1 0 1 1 1
Use Add Printer
wizard--5 pages
in wizard
5 1 5 0 5 1 5
Set printer as
default
1 1 1 0 1 1 1
Find registry key 1 4 4 0 4 3 12
Adjust display
properties--2
pages in TS,
Control Panel,
changes on 2
tabs
5 1 5 0 5 1 5
Start in
Safe/VGA mode-
-
2 pages in TS,
Reboot, 1 screen
8 2 16 0 16 1 16
Get updated
driver from HCL--
3 pages in TS, 3
screens, web
connection,
download
12 2 24 0 24 1 24
Costs of sample tasks
Table 3: Sample cost worksheet. Cost (1) and cost (2) are subtotals of the calculations after each column.
Types of Costs
Troubleshooters use three different costs. Use the above procedure to figure costs for
symptoms and resolutions.
Cost to Observe
Figure the cost to observe based upon the steps involved in observing a symptom.
Generally the risks and expense associated with observing symptoms are very low.
When a cost to observe is in an informational node, it generally needs to be very low
compared to other parent nodes, or the node will be skipped. The troubleshooter
engine may decide it’s cheaper to resolve the problem than it is to observe a different
symptom.
When assessing a fixable/observable node, the troubleshooter engine uses the cost to
observe, and generally ignores the cost to fix.
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

22
2222
22


All problem nodes have a cost to observe of 1.0.
Cost to fix
Figure the cost to fix based upon the steps involved in resolving a symptom. The
troubleshooter engine seems to ignore the cost to fix of a fixable/observable node,
except when two nodes are otherwise equally assessed.
Fixable/unobservable nodes, on the other hand, don’t have a cost to observe, so all
cost calculations are done using the cost to fix.
Cost of Service
The cost of service is a network property. The troubleshooter ends with the service
node if the cost to observe or fix of unobserved nodes, accounting for probabilities,
exceeds the cost of service. Set this number substantially higher than the highest cost
in the network. Most existing troubleshooter networks use a cost of service of 2000.
Probabilities
In the next section, we will discuss methods of determining probabilities for causally
independent nodes.
Prior Probability
All topmost nodes have a prior probability. The prior probability represents the
number of times the cause appears in any given sample of the population. It’s
important to note that you can’t measure this number directly—it’s related, but not
necessarily equal, to the number of computers exhibiting a problem.
Bayesian troubleshooters are supposed to model the actual distribution of problems
on all computers, whether they’re showing symptoms or not. Out of all computers
using Office 2000, what percentage are Windows NT? Out of all Windows 2000
computers, how many have TCP/IP configured correctly? Out of all computers with
printers, how many have printer sharing enabled?
The obvious problem here is that there is no way we can know, definitively, the
answers to these questions. The evidence we can gather is not the same as the prior
probability we’re trying to calculate.
Suppose you want to determine the chances of finding a parking space at the
Microsoft campus at 11 a.m. on a work day. The prior probability is the percentage of
workdays that there is a parking space available at 11 a.m., given all work days (perhaps
25%). The link probability is the chance that if there’s an available parking space at 11
am on the day you go, that you will get it (close to a hundred percent, say 90%—10%
of the time somebody cuts you off and beats you to it). The MSBN multiplies these
probabilities to determine the actual chance that you will get a parking place if you go
to campus at 11 a.m. (the actual probability is about 22%).
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

23
2323
23


The problem for troubleshooting networks is that we measure the results, not the
prior probabilities. We don’t have somebody standing in the parking lot every day at
11 am, recording whether it’s full or not. The data we have tells us only the number of
times people went to the parking lot and couldn’t park. This is the same number the
MSBN tries to generate, but if we try to plug this number in as the prior probability
we’ve given irrelevant information. Support call information tells us that, when we
have a given problem (couldn’t park), how often was that caused by a given cause (no
space available), as opposed to other causes (my car died).
Early versions of the authoring tool automatically inserted a prior probability of 95%
normal/5% abnormal into all new cause nodes. I tend to exaggerate the actual
probability I assign, so that the percentage of abnormal ranges between 1 and 20%.
Updating prior probabilities
Bayesian networks allow for adjusting prior probabilities based upon
observed results. Unfortunately, current troubleshooters do not
implement the algorithm that makes this adjustment.
Future troubleshooters may well use this technology, however, and the
more accurate our probabilities are up front, the less drastically they will
change with automatic updates.
Link Probability
The link probability defines the relationship between the parent and the child.
Child nodes do not have their own prior probabilities. Instead, a child node’s
probability is calculated from all of its parents, using the link probability.
When you assess a child node, the authoring tool first constructs an evidence table
listing all the possible states of the child node’s parents. One line in the evidence table
is the case where all parent nodes are normal. If all causes are normal, we would
probably expect the problem to be normal. This case is called the “leak” state, because
we generally insert a small amount of abnormal, to cover causes we didn’t think of.
The other cases reflect the effect of an abnormal cause on the problem. If your printer
is out of paper, how often will you not be able to print? If your car is out of gas, will it
fail to start? Quite often, the answers are Boolean; that is, 0% or 100%. An empty tank
will always cause a car to not start. No paper will always cause your printer to fail to
print. We describe assessing a node in this Boolean manner as
deterministic
. Using
deterministic probabilities in a Bayesian troubleshooter reduces its behavior to
standard logic, if-then-else and hyperlink style.
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

24
2424
24


But often there is a gray area here. Network congestion may sometimes cause a page
not found error, but not always. Water in your gas may sometimes keep your car from
starting, but not always. These shades of probabilities are the meat of Bayesian
troubleshooters—what they can do better than other methods.
This whole section has describes assessing child probabilities using
Causal
independence
. Causal independence means that all parents of a child node are
assessed separately from all other parents. The alternative, Standard assessment, will be
discussed later.
For all parent/child relationships, you need to assign a probability describing the
distribution of the child’s states given each possible state of the parent. We will explore
this deeper in the next chapter.

B A S I C S O F K N O W L E D G E E N G I N E E R I N G

Basic modeling
Getting from here to there; doing the wiring; are your parents related?
As we discussed in Chapter 1, the main job of the knowledge engineer is to interpret
information written for us and build a working model. There are some common
techniques we use to do this, but when models become more complex, different
knowledge engineers will design different solutions.
Before modeling, the knowledge engineer should have as much content as possible
organized into problems, causes, evidence, and resolutions. The crucial part of the
equation is evidence—the more the better. If you’re working on a long troubleshooter
and you have little evidence, squeeze more out of your writer.
Matching Information to Nodes
Figure 3 is a map of how different information components organized by the author
map to troubleshooter models.
Chapte
r

4
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

26
2626
26



Figure 3: Mapping template information to a troubleshooter model
For the next few pages, we’re going to take a look at an example using the new
content template. Assuming the writer has broken the information into effective
blocks, we will have an outline of potential bicycle maintenance problems that may
look like table 4.
The left column is a list of problems applicable to an individual troubleshooter. The
next is a list of causes applying to each problem. Note that there is some duplication
here: Brakes Need Adjusting, Brakes Not Toed In, and Wheels Need Truing all
appear more than once. The right column contains evidence and resolutions applying
to each cause, again duplicated not only across problems, but also across causes. The
resolution Adjust Brakes appears four times. Several pieces of evidence repeat
themselves, too.
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

27
2727
27



Problem Cause Evidence Resolution
Bike won't stop Brake Pads worn out The brake pads are old
Bike won't stop Brake Pads worn out Rubber worn past line
Bike won't stop Brake Pads worn out Replace brake pads
Bike won't stop Brake Pads worn out True wheel
Bike won't stop Brakes need adjusting
Brake levers go to
handlebar
Bike won't stop Brakes need adjusting
Brake pads don't line up to
rim
Bike won't stop Brakes need adjusting
A
djust brakes
Bike won't stop Brakes not toed in
Brake pads don't line up to
rim
Bike won't stop Brakes not toed in
A
djust brakes
Bike won't stop Wheels need truing Wheel wobbles when spun
Bike won't stop Wheels need truing True wheel
Brakes squeal Brakes need adjusting
Brake levers go to
handlebar
Brakes squeal Brakes need adjusting
Brake pads don't line up to
rim
Brakes squeal Brakes need adjusting
A
djust brakes
Brakes squeal Brakes not toed in
Brake pads don't line up to
rim
Brakes squeal Brakes not toed in
A
djust brakes
Chain falls off Bottom Bracket Loose Cranks are loose
Chain falls off Bottom Bracket Loose Adjust Bottom Bracket
Chain falls off Bottom Bracket Loose
Overhaul Bottom
Bracket
Chain falls off Chain twisted Chain is kinked
Chain falls off Chain twisted Replace Chain
Chain falls off Derailer not adjusted Adjust derailleur
Chain falls off Hub loose Wheel wobbles when spun
Chain falls off Hub loose Overhaul hub
Chain falls off Hub loose Tune hub
Wheels wobble Axle bent Axle doesn't fit in dropouts
Wheels wobble
A
xle bent Replace axle
Wheels wobble Hub loose Wheel wobbles when spun
Wheels wobble Hub loose Overhaul hub
Wheels wobble Hub loose Tune hub
Wheels wobble Wheels need truing Wheel wobbles when spun
Wheels wobble Wheels need truing True wheel

Table 4: Raw problem outline for bicycle troubleshooters.
The first thing to do is list the resolutions and the problems. There are 4 problems in
table 4. Below is a list of the distinct resolutions, and next to the list, figure 4 shows
how you might put resolutions and problems into the MSBN tool. Note that for now,
the resolutions are all in fixable/unobservable nodes.

B A S I C S O F K N O W L E D G E E N G I N E E R I N G

28
2828
28



Figure 4: Problems and resolutions for Bike troubleshooter
The next step is to list the resolutions with the appropriate problem. For our example,
we get the following information:
Problem
ProblemProblem
Problem



Resolutions
ResolutionsResolutions
Resolutions



Bike won’t stop
True wheel
Replace brake pads
Adjust brakes
Brakes squeal
Adjust brakes
Chain falls off
Tune hub
Replace Chain
Overhaul hub
Overhaul Bottom Bracket
Adjust derailleur
Adjust Bottom Bracket
Wheels wobble
Tune hub
True wheel
Replace axle
R E S O L U T I O N S

ƒ
Replace brake pads

ƒ
Adjust brakes

ƒ
True wheel

ƒ
Adjust derailleur

ƒ
Replace Chain

ƒ
Replace axle

ƒ
Tune hub

ƒ
Overhaul hub

ƒ
Adjust Bottom
Bracket

ƒ
Overhaul Bottom
Bracket

B A S I C S O F K N O W L E D G E E N G I N E E R I N G

29
2929
29


Overhaul hub
Table 5: Resolutions associated with problems.
Figure 5 shows how this will appear in the MSBN:

Figure 5: Resolutions linked to problems.
Now comes the fun part, evaluating symptoms. To create a symptom, we combine the
sets of evidence, and phrase it as a question. Take a look at the symptom information,
and the causes to which the symptoms apply: In this example, all of our symptoms are
tests, because each has only one piece of evidence. If, during a session, we detect the
evidence, we observe the symptom of the node as abnormal. If we don’t detect this
evidence, we observe the symptom as normal.
Symptom

Evidence Causes

Resolutions

Are cranks loose? Cranks are loose Bottom Bracket
Loose

Overhaul Bottom
Bracket



Adjust Bottom
Bracket

Are the brake pads
very old?
Brake pads are
old
Brake Pads worn
out
Replace brake pads

Does the axle fit in
the dropouts?
Axle doesn’t fit
in dropouts
Axle bent

Replace axle

Do the brake levers
go to the handlebar?
Brake levers go
to handlebar
Brakes need
adjusting

Adjust brakes

B A S I C S O F K N O W L E D G E E N G I N E E R I N G

30
3030
30


Do the brake pads
line up to the rim?
Brake pads
don’t line up to
rim
Brakes need
adjusting

Adjust brakes


Brakes not toed in

Adjust brakes

Is there a kink in the
chain?
Chain kinked Chain twisted

Replace Chain

Is the rubber worn
past the line?
Rubber worn
past line
Brake Pads worn
out

Replace brake pads

Does the wheel
wobble when spun?
Wheel wobbles
when spun
Hub loose

Overhaul hub



Tune hub


Wheels need truing True wheel

Table 6: Symptoms and evidence table.
Let’s take a look at these one at a time. First, Are cranks loose? applies to two
resolutions: Overhaul bottom bracket and Adjust bottom bracket. This sounds
like a test. So we’ll add it as one.
The next three, Are the brake pads very old?, Does axle fit in the dropouts? and
Do the brake levers go to the handlebar? each only identify one resolution, so we
will combine the text for these symptoms with the resolution text, making the nodes
fixable/observable. We will also change the name of each node to the cause.
The next symptom, Do the brake pads line up to the rim?, applies to two causes,
but both causes have the same resolution, Adjust brakes. This is where you will need
to use some judgment. Does the end user need to know the underlying causes? In this
case, probably not—this may be too much information. If the information is
necessary, it still can be listed on the individual node. This symptom, again, only
identifies a single resolution. We can either add the symptom information to the other
symptom we’ve already added to this node, or choose which symptom is most
authoritative and discard the other one.
Is there a kink in the chain? also only identifies a single resolution. We’ll treat that as
above.
Is the rubber worn past the line?, again applies to a cause we already have a
symptom for. We will probably add this content to the same node.
Does the wheel wobble when spun? applies to multiple resolutions, so again, we’ll
add them to the model, which now looks like figure 6.
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

31
3131
31



Figure 6: Bike Troubleshooter with observable cause information and tests
Now, the model in figure 6 is fairly effective as it stands, but there are a few things to
point out, before we assign any probabilities or costs. First of all, the Wheels wobble?
test will probably appear when the user troubleshoots the Chain falls off problem,
but may not appear when the user troubleshoots the Bike wont stop problem,
depending on the costs and probabilities. Generally, a test needs to be substantially
cheaper than the costs of its parents, or it may not be shown.
As you build networks, you need to carefully evaluate every problem to see if the user
will see all crucial information. If not, you may need to pull more symptom
information out of informational nodes and put it in the observable part of a
fixable/observable node.
Entering Costs
After dividing information and putting it in appropriate nodes, the next step is to fill in
the cost information. We’ve already discussed calculating costs in Chapter 3, above.
You can quickly generate costs by dropping the variables into a spreadsheet. A
relational database template can easily generate a cost figure for you, based upon the
writer’s input on each of the factors.
Prior Probabilities
You can find descriptive values for probabilities on the Cause information provided
by the writer. These range from Always to Never (these two particular values should
not appear in a cause node…). Generally, causes should have a high probability of
being normal and a low probability of being abnormal. Most causes should never be
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

32
3232
32


abnormal more than 40% of the time. A general range for the abnormal probability
for causes is between 1% and 40%.
It’s fine for more than one cause to have the same probability.
Link Probabilities
This information isn’t always apparent. Think about the cause and the symptom
you’re considering. Treat problems the same as tests. If a cause is abnormal, does the
symptom always occur? If not, how often does the cause generate the symptom? If
you need guidance, go back to the writer.
You’ll also need to assess whether or not a symptom will appear, if none of its parent
causes are abnormal. This is the Leak State, the condition when all parent causes are
normal. If you’ve covered all possible causes, then the leak state would be zero
percent abnormal and 100% normal. If there could possibly be another responsible
cause covered elsewhere in the troubleshooter, or if there could be another possible
cause that you have not modeled, you should leave a percent or two of uncertainty
here, to allow the troubleshooter to continue.
This leak state can change the result the user sees. If the user indicates the result of a
test is abnormal, but all of its parent causes are normal, and the leak state is 100%
normal, the troubleshooter will end, displaying the Impossible pseudo node. If the
leak state has a slight leak, the troubleshooter will continue displaying other possible
causes until it exhausts all the causes in the model, and then show the Fail pseudo
node.
Summary
This section has covered basic modeling using Causally Dependent nodes, tests,
fixable/observable and unobservable nodes. Next, we will explore standard
assessments and build an asymmetric assessment.
Standard assessments
Now that we’ve covered the easy stuff, we’ll take a closer look at the core of the
Bayesian networks: link probabilities. The prior probability for a child node is the set
of link probabilities to its parents. You do not assess a probability for a child node
independent from its parents.
The MSBN tool provides two different types of assessment on child nodes. We’ve
already used the first, causal independence, in which the effect of each parent on the
child is assessed independently of all others. The other type, called standard
assessment, must be used when the parent nodes interact with one another. The most
common example is when you use a configuration node to divide cause nodes into
discrete groups.
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

33
3333
33


The first implication of using standard assessment is that the concept of normal and
abnormal is handled differently. When a causally independent test node is normal, or
state zero, it can eliminate any parent nodes that the test definitively identifies. When it
is any abnormal state, it no longer eliminates nodes; it only reorders the list of
recommendations.
Standard nodes, on the other hand, can eliminate different groups of parent nodes,
simply by using one (or more) parent as a switch. The simplest way of doing this is
called an asymmetric assessment.
To illustrate the difference between a causally independent node and a standard node,
let’s look at two different evidence, or truth tables. Both of these have four parent
nodes, but as you can see, the table is much bigger in the standard assessment.
Brake pads
worn out
Brakes out of
adjustment
Brakes not
toed in
Wheels need
truing
Normal Abnormal
False (Normal) False (Normal) False (Normal) False (Normal) 99% 1%
True (Abnormal) * * * 5% 95%
* True (Abnormal) * * 3% 97%
* * True (Abnormal) * 25% 75%
* * * True (Abnormal) 50% 50%
Causally independent truth table
Bike won't stop
Table 7: Causally independent truth table.
In table 7, the child node is the problem Bike won’t stop. Its parents are Brake pads
worn out, Brakes out of adjustment, Brakes not toed in, and Wheels need
truing. There are five cases we need to assess in this example: the case in which none
of the parents are abnormal, plus one case for each abnormal parent.
Figure 7: Causally independent model assessed in table 7.
We assess the distribution between the states of Bike won’t stop for each case in the
truth table. Each case is one line. All of the states for each case must add up to 100%.
But note that there is no relation between cases. We assess each case
independently

of all others.
It is not unusual for each case to be assessed at 0%/100%. Some causes always cause a
problem. This example shows how you can weight certain causes over others when
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

34
3434
34


you’re assessing different symptoms. If a bicycle’s brakes aren’t toed in, in this case,
there’s a 75% chance that you will not be able to stop. But there may be a 100%
chance that your brakes will squeal (some completely different symptom/problem).
Now, taking the same node, let’s swap a parent. Say that we have to troubleshoot
bikes with coaster brakes. Coaster brakes don’t have any brake pads, the pads can’t be
toed in, and it doesn’t matter if the wheels are trued. To keep this table from
becoming enormous, we’ll change the parent nodes a bit.
In this second case, the Type of brake node acts as a switch, with three states. If this
node has the state of Coaster, the state of Bike can’t stop is assessed the same as the
Hub needs grease node. If the Type of brake is Pad, Bike can’t stop is assessed
the same as Brake pads worn out. If Type of brake is Disc, Bike can’t stop is
assessed the same as Brakes out of adjustment.
Compared to the truth table for the causally independent version, this truth table is
huge. Not only do we have to determine the appropriate states of a lot of different
cases, but also the file quickly grows. Each added parent node multiplies the number
of cases by its number of states (usually 2). Each added state adds another multiple of
all the cases of the other nodes’ states.
As complexity of standard assessments grows, the files become huge and
performance becomes sluggish. But by grouping causes together using
Other nodes, you can drastically reduce the complexity of the model.
Just by following this tactic, I reduced an old troubleshooter in size from
114 KB to 69 KB, and fixed some strange behavior by doing so.
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

35
3535
35


Type of brakes
Brakes out of
adjustment
Brake pads
worn out
Hub needs
grease
Normal Abnormal
Coaster False (Normal) False (Normal) False (Normal) 100% 0%
Coaster False (Normal) False (Normal) True (Abnormal) 0% 100%
Coaster False (Normal) True (Abnormal) False (Normal) 100% 0%
Coaster False (Normal) True (Abnormal) True (Abnormal) 0% 100%
Coaster True (Abnormal) False (Normal) False (Normal) 100% 0%
Coaster True (Abnormal) False (Normal) True (Abnormal) 0% 100%
Coaster True (Abnormal) True (Abnormal) False (Normal) 100% 0%
Coaster True (Abnormal) True (Abnormal) True (Abnormal) 0% 100%
Pads False (Normal) False (Normal) False (Normal) 100% 0%
Pads False (Normal) False (Normal) True (Abnormal) 100% 0%
Pads False (Normal) True (Abnormal) False (Normal) 0% 100%
Pads False (Normal) True (Abnormal) True (Abnormal) 0% 100%
Pads True (Abnormal) False (Normal) False (Normal) 100% 0%
Pads True (Abnormal) False (Normal) True (Abnormal) 100% 0%
Pads True (Abnormal) True (Abnormal) False (Normal) 0% 100%
Pads True (Abnormal) True (Abnormal) True (Abnormal) 0% 100%
Disc False (Normal) False (Normal) False (Normal) 100% 0%
Disc False (Normal) False (Normal) True (Abnormal) 100% 0%
Disc False (Normal) True (Abnormal) False (Normal) 100% 0%
Disc False (Normal) True (Abnormal) True (Abnormal) 100% 0%
Disc True (Abnormal) False (Normal) False (Normal) 0% 100%
Disc True (Abnormal) False (Normal) True (Abnormal) 0% 100%
Disc True (Abnormal) True (Abnormal) False (Normal) 0% 100%
Disc True (Abnormal) True (Abnormal) True (Abnormal) 0% 100%
Standard Assessment truth table
Bike won't stop
Table 8: Standard Assessment truth table.
Note
As of August 1999, our current tool has a bug that prevents the actual
truth table shown above from working. You can only assess about half
of the table. When you scroll down to lower cases, you’ll get a “subscript
out of range” error.
The only workaround for this bug is to do the assessment in Notepad.
The last part of the file contains lists of probabilities. Percentages are
expressed as decimals, totaling 1. Future versions of the tool will fix this
bug.
In short, Causal Independence and Standard Assessments are completely different
ways of assigning probability. You cannot substitute one for the other. While standard
assessments are inherently more powerful, they are also more complex for both you
and the modeling engine.
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

36
3636
36


ƒ Causally independent nodes have 1 case for each parent’s abnormal states,
plus one case for when all parents are normal. For example, if all parents have
two states, the number of cases for the child node is 1 + the number of
parents. For 5 nodes, this is 1 plus 5, or 6.
ƒ To calculate the number of cases for a standard node, multiply the number of
states of each parent by the number of states of every other parent. For
example, if all parent nodes have 2 states, the number of cases for the child
node is 2 to the power of the number of nodes. For 5 nodes, this is 2 to the
fifth power, or 32.
Building an asymmetric assessment
The Bike #2 Troubleshooter Summary (see appendix) shows early output from the
content template of a problem that uses Standard Assessment, as shown in the
previous section. These are a set of tables showing different organizations of content
you might receive from a writer. In this section, we will organize and build this model,
using the Asymmetric Assessment Wizard to set up our standard assessment.
First, create our problems and resolutions, as outlined in above. There is one problem:
Bike won’t stop. There are three resolutions, attached to the causes Hub needs
grease, Brakes out of adjustment, and Brake pads worn out.
From the Symptom Summary table, we can see that the causes Brakes out of
adjustment and Brake pads worn out each have distinct symptoms, so we will make
them fixable/observable. We put the text and cost to observe from Are pads lined
up? into Brakes out of adjustment. We put the text and cost to observe from Pads
worn? into Brake pads worn out. We add Type of brakes? as an informational
node.
From the Resolution Summary table, we note that each resolution has a single, distinct
cause. We can therefore map each resolution to a cause, using the Cost to Fix and text
for the appropriate cause.
On the Cause table, we see that the frequency of Brake pads worn out is Often, so
we’ll assign this a prior probability of 70% normal, 30% abnormal. The frequency of
Brakes out of adjustment is Sometimes, so we’ll set its prior probability to 90%
normal, 10% abnormal. The frequency of Hub needs grease is Rarely, so we’ll set
its probability to 98% normal, 2 % abnormal.
We also see on this page that each cause only applies to one type of brake. In this case,
the Type of brakes? informational node provides configuration information, not test
information. There is no normal or abnormal type of brake. There are just three types
of brakes we’re considering in this example. In this case, the informational node is
used as a parent node, to divide the possibilities into different options. Figure 8 shows
the model as we now construct it:
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

37
3737
37



Figure 8: Standard Assessment model assessed in tables 8 and 9.
We’ve already seen a standard assessment of this example in the previous chapter.
However, we can build this assessment more quickly by using the Asymmetric
Assessment wizard. This wizard allows you to group cases into a hierarchical manner,
and then assess the resulting states. Figure 9 shows the hierarchy we construct using
the wizard, based upon the information in the Cause table in our example.

Figure 9: Hierarchical view of standard assessment using Asymmetric Assessment wizard.
The Asymmetric Assessment wizard is a shortcut that collapses the multitude of states
to a reduced set. You must still assess the probabilities on this reduced set, but the
truth table gets reduced to the cases shown in table 9: Six cases to assess, instead of 24.
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

38
3838
38


Type of brakes
Brakes out of
adjustment
Brake pads
worn out
Hub needs
grease
Normal Abnormal
Coaster * * False (Normal) 100% 0%
Coaster * * True (Abnormal) 0% 100%
Pads * False (Normal) * 100% 0%
Pads * True (Abnormal) * 0% 100%
Disc False (Normal) * * 100% 0%
Disc True (Abnormal) * * 0% 100%
Asymmetric Assessment truth table
Bike won't stop
Table 9: After arranging the hierarchy as shown in figure 8, you must still assess the leaf states. The truth table
becomes greatly reduced, but yields the same result as the full truth table shown in table 8.
Note that using the Asymmetric Assessment wizard only simplifies the truth table so
that you have fewer cases to assess. It does not reduce the actual complexity of the
network, or even get saved with the network. The reduced truth table in table 9
contains exactly the same assessment as table 8 in the previous section.
In this example, we’ve used the actual problem node to do our assessment. This is
generally not a good practice, because if we want to add other causes to this problem,
we end up doubling the work we must do for each one. A better way to organize this
information is to group the assessment using an
Other
type of node, and making this
Other node a parent of the problem, as shown in figure 10.

Figure 10: Using “Other” node to group configuration-specific causes.
So the question is, what do you do if you have several causes that apply to a single
configuration? Or if a cause applies to more than one configuration?
Let’s change our example above, adding a few new causes. First, we’ll decide that
Brakes out of adjustment also applies to Pad brakes, not just Disc brakes. Then
we’ll add a cause Disc warped to the Disc brake configuration, and Rim worn out
to the Pad brake configuration. Figure 11 shows how we would now organize this
model.
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

39
3939
39


Notice that we used Other nodes to group together all causes applying to a given state
of the information node we’re using to switch between configurations. Also, notice
that these Other nodes are causally independent. Let’s discuss this further.

Figure 11: Asymmetric Assessment with groups of configured causes.
The causally independent nodes Disc brakes and Pad brakes are nodes used to
group specific causes applying to a configuration. What we’re assessing in these nodes
is whether or not its parents apply to a configuration. The values are almost always
deterministic, either Yes or No.
In almost all cases, assess grouping nodes applying to a configuration with 100%
normal for the leak state, and 100% abnormal for all other states.
The effect of grouping nodes this way essentially eliminates them from the
assessment, allowing each of the parent nodes to be assessed as if they were directly
attached to the grouping node’s children.
Now that I’ve given that warning, I’ll contradict myself by saying there is
one reason to assign shades of probabilities to these intermediate
grouping nodes. We will discuss this more later, in the Advanced
Grouping chapter.
Notice in the model shown in figure 11, the cause Brakes out of adjustment applies
to two different configurations. This is perfectly acceptable. By setting up the
relationships in this way, we’re saying that this cause applies to both of these
configurations, but not to the third. If you have multiple configurations, the model will
recommend all parent nodes of each configuration separately from other
configuration nodes.
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

40
4040
40


If you’ve kept up so far, you should now be able to create most basic troubleshooter
networks. The remaining chapters cover more advanced techniques, and tips for
dealing with specific situations.

B A S I C S O F K N O W L E D G E E N G I N E E R I N G

Intermediate modeling
Too many choices
This section adds a few more concepts to your repertoire. These concepts mainly
provide you with different ways of interpreting and modeling causes. Like evidence,
causes can have a range of specificity. You can define a broad cause, and then drill
down to finer and finer causes until you arrive at the one that ultimately solves the
problem.
Grouping causes
To some extent, we’ve removed the cause information from cause nodes, putting it in
resolutions instead. So why gather cause information at all?
The answer is that we can fine-tune models and improve their performance by using
causes to group information more accurately.
Like symptoms, causes can have varying degrees of generality. They can be extremely
specific, or they can be more general, representing groups of more specific causes.
We’ve already learned how to create and assess specific causes. This chapter will cover
two ways of assessing groups of causes.
Cause as Other node
The troubleshooter team has created several printing troubleshooters. One obvious
general cause of printing problems is lack of connectivity between the printer and the
computer. Specific causes include Bad cable, Bad switching device, Bad printer,
and Bad printer port. The general symptom identifying lack of connectivity between
the printer and the computer involves having the user attempt to print from a
command prompt. If the user can print from a command prompt, we know there is
basic connectivity. If not, then we know the problem is caused by one of these four
specific causes. Some of these causes have more specific tests to observe identifying
symptoms.
Using the procedures we’ve already learned, we would model this situation as in figure
12.
Chapte
r

5
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

42
4242
42



Figure 12: Usual way of organizing lack of connectivity between computer and printer.
However, as we learned in the last chapter, we can use Other nodes to group causes
together into a more general case. For configuration types of information, we grouped
all nodes applying to a particular configuration to a single Other node representing
that configuration.
In this example, we can group all causes applying to a single, more general cause to a
single Other node representing that cause. Then we can test for that cause directly.
After we do our grouping, the network looks like figure 13.

Figure 13: Grouping connectivity causes into more general cause.
Modeling the problem this way may lead to the same recommendations as in figure
12, but we have one fewer case to assess, even though we’ve added a node.
However, if we move away from deterministic assessments, there are a couple of
subtle differences between these two different models.
In figure 12, we assess the link probabilities from each of the causes to the symptom
Print from command prompt? We assess the probability that each cause results in
the symptom. We can put a shade of gray in here. If, for example, printing from a
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

43
4343
43


command prompt doesn’t definitively identify a faulty printer, we can make the link
probability for that case less than 100% abnormal, say 95% abnormal. Perhaps the
printer is still causing the problem of not being able to print, but in some cases it can
be abnormal and the user can still print from a command prompt.
These shades of gray can be very useful for making recommendations come up in
different orders. By removing the certainty that a faulty printer always causes you to be
unable to print from a command prompt, we can assign some degree of uncertainty to
our test.
Let’s take a look at the recommendations made by the MSBN for each of the different
cases. For each of these, I’ve added two other cause nodes linked directly to the
problem. (Without adding other cause nodes, the test won’t come up.)
Case 1:
Figure 12, deterministically assessed. If we assess all of the link probabilities
deterministically (meaning each of the child nodes is assessed as 100% normal in the
leak state and 100% abnormal for all other cases), we get the following two
progressions: (
emphasis
indicates the states I give during evaluation)
1.
Print From Command Prompt?
Normal
.
2.
Other Cause 1?
Normal
.
3.
Other Cause 2?
Normal
.
-End-
1.
Print From Command Prompt?
Abnormal
.
2.
Bad Cable?
Normal
.
3.
Bad Switching Device?
Normal
.
4.
Bad Printer?
Normal
.
5.
Bad Printer Port?
Normal
.
-Error (Impossible pseudo node—contradictory evidence)-
Case 2:
Figure 13, deterministically assessed. The result of case 2 is identical to case 1. I won’t
repeat it here.
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

44
4444
44


Case 3:
Figure 12, with Bad printer Abnormal case assessed 95% abnormal in Print from
command prompt?. By changing the probability of this one case, the
recommendations go as follows:
1.
Print From Command Prompt?
Normal
.
2.
Other Cause 1?
Normal
.
3.
Other Cause 2?
Normal
.
4.
Bad Printer?
Normal
.
-End-
1.
Print From Command Prompt?
Abnormal
.
2.
Bad Cable?
Normal
.
3.
Bad Switching Device?
Normal
.
4.
Bad Printer?
Normal
.
5.
Bad Printer Port?
Normal
.
-Error (Impossible pseudo node—contradictory evidence)-
Case 4:
Figure 13, with Bad printer Abnormal case assessed 95% abnormal in No
connectivity. In this case, the recommendations are identical to Case 1 and Case 2.
The reason? By grouping all the causes into a single other node, and then
deterministically assessing this node with a child test node, we give the troubleshooter
evidence that this node and all of its parents are normal, or this node and one of its
parents are abnormal. Even if one of its parents (in this case, Bad printer) could be
abnormal when No connectivity is normal, because No connectivity is observed
to be normal, none of its parents will be considered.
Group, or not?
So the decision becomes whether this behavior is appropriate or not. It may be
appropriate for some problems but not others. For example, if the problem is that you
can’t print text, then we may not need to consider other ways the printer could be at
fault. But if the printer has some problem printing graphics, it may very well be
appropriate to link the cause directly to the problem, instead of going through the
grouped cause.
B A S I C S O F K N O W L E D G E E N G I N E E R I N G

45
4545
45


In figure 14, the Bad printer cause applies directly to the Can’t print graphics
problem, but indirectly to the Can’t print text problem, via the No connectivity
grouping node.

Figure 14: Some causes apply directly to one problem, and indirectly to another.
If we use the same assessments we did for Case 4, above, the behavior of Can’t print