Notes for CS3310
Part 7: Reasoning under uncertainty
Prof. Neil C. Rowe
Naval Postgraduate School
Version of January 2006
Rules with probabilities
Probability arguments in predicates indicate their degree of
truth on a scale of 0 to 1.
("a is true with probability 0.7 if b is true.”
"a is true 70% of the time that b is true.”)
(0.7 is a "rule strength probability“, or conditional probability
“a” given “b”
mathematicians call this p(a|b) )
("a is true with probability P if b is too.”)
(P in b(P) is an "evidence probability")
b(P), P > 0.5.
("a is true with probability 0.7 if b is
more than 50% certain.”)
(0.5 is an "inference probability threshold")
b(P2), P is P2*0.5.
("a is true half the time when b is
(0.5 is another form of rule strength probability)
b(P2), c(P3), P is P2*P3.
(“a is true with probability
the product of the probabilities of b and c.”)
(P is the
"andcombine" of P2 and P3
the probability of “a”
decreases with either the a decrease in “b” or “c”)
Probability examples for car repair
Many useful expert
system rules in the form of
"cause if effect” require probabilities.
Probabilities are always the last argument.
You wouldn't use all of the following in an expert
system, just the one or two most correct.
More probability examples for car repair
and P = P2*0.5
age(battery,old,P2) and P = P2*0.1.
age(battery,old,P3), P is P2*P3*0.57.
L is (0.5*P2)+(0.1*P3)+(0.6*P4)
L in the last rule is a "likelihood" not a probability
be more than 1.0. This last rule is like an artificial
Combining rule strengths with evidence probabilities
But suppose electrical_problem is itself uncertain.
Maybe the lights don't work and the radio won't
play in a car, but there could be other causes.
Let's say the probability of an electrical problem is
0.7. How do we combine that with 0.6? It should
decrease the overall probability somehow.
Let "H" be "battery dead" (i.e., the "hypothesis"),
and "E" be "electrical problem" (the "evidence").
Then from probability theory:
p ( H
E ) = p ( H | E ) p ( E )
(“The probability of H and E is the product of the
probability of H given E and the probability of E”)
Combining probabilities, cont.
The left side is what we want to know, the
probability that both the battery is dead and that
we have an electrical problem.
The first thing on the right side is the conditional
probability of H given E, or 0.6 in the example.
The last thing is the a priori probability of E, or
0.7 in the example.
So the answer is 0.42. And in general:
P = P2*0.6.
Combining disjunctive evidence
P is P2*0.3.
andcombine([P2,P3],P4), P is P4*0.8.
notworking(1.0), P is P2*0.4.
P is 1/(1+((T1
Notes on the previous page
is a "fuzzy" routine; it computes a
probability that two times (measured in seconds) are close
enough together to be "almost simultaneous" for an
collects the probabilities of rules that succeed in
concluding that the fuse is in some state. It’s built
Prolog. Its 3 arguments are a variable, a predicate
expression containing the variable, and the list of possible
bindings of that variable.
P is P1+P2
andcombine(L,P2), P is P1*P2.
Examples of Prolog’s built
Given the database:
(“Make a list of all X such that
(“Make a list of all X,Y pairs
for which f(X,Y) succeeds.”)
(“Bind Y to something and
make a list of the X values
for which f(X,Y)
(“Find all X for which there
exists a Y such that f(X,Y)
Zero probabilities are not the same as negations
Suppose we want to add probabilities and a rule
strength of 0.8 to the rule:
a if b and not c.
has a probability of 0.001? Does that
count as "not"?
If evidence has probabilities, we should avoid
negating it, just invert the associated probability.
are independent, the example becomes:
a(P) if b(P2) and c(P3) and P = P2*(1
Practice in writing probability rules
predicates of 2 arguments.)
1. "If it's definitely foggy today, it will be foggy tomorrow
with probability 0.75”
2. "If it's definitely humid and not unusually warm, it will
be foggy tomorrow with probability 0.9".
3. Rewrite (1) assuming fogginess today has a degree of
likelihood (like if you're indoors).
Practice in writing probability rules, cont.
4. Rewrite (2) assuming that being humid and being
unusually warm have degrees of likelihood and
5. What is the probability that it will be foggy
tomorrow, assuming that it is certainly foggy
today, it is humid with probability 0.5, and it is
unusually warm with probability 0.2? Assume all
probabilities are independent.
Probabilities from statistics on a population
Suppose 16 cases (cars) appear in the repair shop today.
Case 9 (B)
B = battery is dead, S = car won’t start
Notice: p ( B
S ) = p ( B ) + p ( S )
p ( B
since 9/16 = 5/16 + 8/16
But the formula holds for any B and S.
Classic probability combination formulae
Given probabilities "p" and "q" to combine.
These formulae are commutative and associative.
With three to combine, apply the formula to any two,
then combine that result with a third; etc.
Independence means the presence of one event does not
change the probability of the other event. Conservative
and liberal assumptions are appropriate if an event implies
the presence or absence of another.
Exercise with the combination methods
Given three pieces of evidence supporting the same
conclusion with probabilities 0.8, 0.6, and 0.5 each.
Note always: conservative “and”
It means that some input is numeric instead of true/false. We must
usually convert it to a probability.
Examples: speed of a missile S, for threat assessment; a patient's
temperature T, for a medical expert system.
We could compute f(T) = | T
98.6 | , and the larger this number is,
the sicker the patient is. Problem: this can be > 1, so isn't a
probability, and can't be orcombined or andcombined.
We could compute g(X) = 1
98.6)))). This will
be 0 when T = 98.6, and approaches 1 if T is very high or very
low. But steepness of curve is not adjustable.
We could compute h(T) = 1
98.6)/K))), and K
can be adjusted.
We could compute i(T) = 1
“exp(x)” means e to the x power. This uses normal distribution
(hence has sound theory behind it), and is adjustable.
There are also ways to handle fuzziness without converting to a
fuzzy set theory.
Bayes' Rule for uncertainty handling
Let H be some hypotheses or conclusion; let E be
some collection of evidence. The laws of
probability give the following theorem:
p ( H | E ) = p ( E | H ) p ( H ) / p ( E )
(p(H|E) means the probability of H given E.) This
allows us to reason "backwards" from evidence to
causes (or "hypotheses"); the real world moves
from causes to evidence.
If E = E1
... , the needed probabilities are
harder to get. But then we may be able to assume
independence of some of the factors, and multiply
them. This idea is used in "Bayesian networks"
which illustrate what factors affect which others.
Examples of Bayes' Rule to get rule strengths
E1 (evidence 1) is car won't start; E2 is radio won't play; E3 is
headlights don't shine; E = E1
E3; H (hypothesis) is
battery is dead.
Assume p ( E1 ) = 0.05, p ( E2 ) = 0.04, p ( E3 ) = 0.08, and p(H) =
0.03. Then by Bayes' Rule, assuming dead battery implies all
the evidence: p ( H | E1 ) = p ( E1 | H ) p ( H ) / p ( E1 ) =
1*0.03/0.05 = 0.6; p ( H | E2 ) = 1*0.03/0.04 = 0.75;
p ( H | E3 ) = 1*0.03/0.08 = 0.375.
Now suppose all three pieces of evidence are present:
By conservative assumption: max(0.6 ,0.75) = 0.75,
max(0.75,0.375) = 0.75
By liberal assumption: min(1 ,0.6+0.75) = 1, min(1,1+0.375)=1.
By independence assumption: p ( H | E1
E3 ) =
0.6 ) ( 1
0.75 ) ( 1
0.375 ) = 0.9375.
Naïve Bayes reasoning
Suppose E1 and E2 are two pieces of evidence for hypothesis H.
p(H | (E1
E2)|H) p( H ) / p(E1
If we assume E1 and E2 are “conditionally independent” of one
another with respect to H:
p(H | (E1
E2)) = p(E1|H) p(E2|H) p(H ) / p(E1
’ Rule twice, and this is equal to:
p(H|E1) p(E1) p(H|E2) p(E2) p(H) / (p(E1
E2) p(H) p(H))
Also if E1 and E1 are conditionally independent:
p(~H | (E1
E2)) = p(E1|~H) p(E2|~H) p(~H ) / p (E1
= p(~H|E1) p(E1) p(~H|E2) p(E2) p(~H)
/ (p (E1
E2) p(~H) p(~H)
Setting the ratio of left sides equal to the ratio of right sides, the
p(E1), p(E2), and (p(E1
E2) cancel out and we have:
p(H | (E1
E2)) / p(~H | (E1
Naïve Bayes reasoning, cont.
Define odds as o(X) = p(X)/p(~X) = p(X)/(1
Then p(X) = o(X) / (1+o(X)).
Then the equation becomes:
o ( H | (E1
E2) ) =
[o (H |
E1) / o(H
[o(H | E2) / o(H
)] * o(H)
This is the odds form of “Naïve
With more than two pieces of evidence:
o ( H | (E1
En) ) =
[o (H | E1) / o(H)] * [o(H | E2) / o(H)] *...
* [o(H | En) / o(H)] * o(H)
So positive evidence increases odds and negative evidence
To use, convert probabilities to odds; apply the above
formula; convert odds back to probabilities.
Bayesian reasoning for air defense
A ship would like to assign probabilities of
hostility to objects observed on radar.
Factors that can be used: speed, altitude, use
of abrupt turns, whether in an airlane,
source airport, and apparent destination,
Keep statistics on objects observed in some
area of the world, and correlate this to the
eventual identities discovered for the
objects. Use these to derive odds of
Odds for each factor can be learned from
experience, though an adversary could try to
Bayesian reasoning for naval air defense
Stochastic grammar rules to generate behavior
Another way to use probabilities is to use them to
generate behavior. For instance, attach them to rules of
free grammar” to generate random strings
like random error messages:
Prob. 0.4: msg :
write(‘Fatal error at ‘), number, write(‘: ‘), fault.
Prob. 0.6: msg :
write(‘Error in ’), number, write(‘: ‘), fault.
Prob. 0.5: number :
digit, digit, digit, digit, digit.
Prob. 0.5: number :
digit, digit, digit, digit.
Prob. 0.1: digit :
Prob. 0.1: digit :
Prob. 0.5: fault :
Prob. 0.3: fault :
Prob. 0.4: fault :
write(‘File not found’).
Artificial neural net example
Suppose you want to classify shapes in photographs
Construct something like an inference network (and
not graph) but where probabilities are
combined, not boolean operations computed: a
For China Lake photographs, shapes in the image
can be "sky", "dirt", "aircraft", and "person".
Useful evidence is "blueness", "redness", "has
many right angles", and "color uniformity". Good
initial weights must be estimated by intuition.
Example artificial neural network
Equation relating outputs to inputs:
# right angles
( ), ( )
1 2 43 4
o g w i w i w m m g w i w i
k k k k
Like inference networks, but probabilities are computed
instead of just "true" or "false". Two alternatives:
(1) The probabilities are expressed digitally. Then
"and” and "or" gates are special
circuits computing the formulae.
(2) The probabilities are analog voltages, and the gates
are analog integrated circuits.
Neural nets can "learn" by adjusting rule
probabilities to improve performance.
However, there are many ways an AI system can learn by
itself besides using a neural net: caching, indexing,
reorganizing, and generalizing. Neural nets are
only way to make an AI system learn.
The artificial neuron
The most common way is a device or program that computes:
The are inputs; f is the output probability; and the w
sub i are adjustable constants ("weights"). The g and the h
represent some nonlinear monotonic function, often
or or or
1 minus these (the second is called the “hyperbolic tangent
Increasing inputs for g should means increasing outputs, but
as the inputs get large, the increase in f slows down. This
is like neurons in the brain. It helps prevent information
This is also like a liberal orcombine, but with included rule
strengths on each input.
artificial neural network
More about artificial neural networks
You can have multiple levels (“layers”), with
neuron outputs as inputs of other neurons.
If your “g” function is linear, the artificial neurons
Inputs can be booleans (represented as 0 or 1),
weighted and combined just like probabilities.
You have one output neuron for every final
Output of a neuron can be compared to a
threshold; then you get a boolean, and can use
logical reasoning from then on.
Most neural networks are multilayer.
"Backward propagation" or "backpropagation" is the
most popular way that these networks learn.
It works by estimating the partial derivative with
respect to each weight of an incorrect output
value, then uses that to determine how much to
The weight connecting concept i at layer j to concept
k at layer j+1 is the same as the weight connecting
concept k back to i.
This is much like assuming p(A|B) = p(B|A), rarely
true. Nonetheless, it often works!
Short practice questions on list processing
What is the first answer of a Prolog
interpreter to each of the following
queries, assuming the definitions of
processing predicates given in the
Review on rule
cycle hybrid chaining
b, u, not(t).
Practice question on probabilities
1. 10 times out of 50, when a patient complains of
chest pains, it's a heart problem.
2. 35 times out 50, when a patient complains of
chest pains, it's indigestion.
3. 6 times out of 10, if a patient has a family
history of heart problems, then they have a
4. Tom complains of chest pains and has a family
history of heart problems.
5. 1 of 10 patients have heart problems.
Practice question on probabilities, cont.
(a) Write the above information as rules and facts.
(b) What is the probability that Tom has a heart
problem, using the independence assumption for
(c) Write a rule for the probability someone has a
heart problem given they both have chest pains
and a family history; again assume independence.
(d) Rewrite the last rule assuming chest pains and
family history have a degree of uncertainty and
they are independent of one another.
Practice question of probabilities, cont. (2)
(e) Using that last rule, what is the probability that
Tom has heart problem if he thinks with 0.7
probability he has chest pains and experts would
say with 0.8 probability that he has a family
history of heart problems?
(f) Using the Naïve Bayes odds formula for the rules
and facts without evidence uncertainty, what is the
probability that Tom has a heart problem?
(g) Using a perceptron neuron with equal weights of
1 on the two factors of chest pains and family
history, no input nonlinear function, and
x*x/(1+(x*x)) as the nonlinear output function,
what is the probability that Tom has a heart
problem with the evidence probabilities in (e)?
defining practice questions
, define a Prolog function predicate
that returns the third item from the
end of a list.
2. Define in Prolog a function predicate
that returns a
list of all the pairs of items that occur in a list. The first
item in each pair must have occurred before the second
item of the pair in the original list argument.
3. A Prolog rule to find the middle N items (that is, N items
exactly in the center) of a list L is:
length(M,N), length(L2,Q), length(L3,Q).
length(M,N), length(L2,Q), length(L3,Q).
length(M,Q), length(L2,Q), length(L3,N).
length(M,Q), length(L2,Q), length(L3,N).
Another probability practice question
Suppose that previous experience says that in
There's a 50% chance of a bug in 100 lines of Ada
There's a 60% chance of a bug in 100 lines of C++
Tom has a 30% chance of finding a bug in 100 lines
if there is one;
Dick has a 40% chance of finding a bug in 100 lines
if there is one.
Suppose a program contains 100 lines of C++ code
and 100 lines of Ada code. Tom debugs the C++
and Dick debugs the Ada. What is the probability
a bug will be found, assuming probabilities are
independent? Show your arithmetic.