Notes for CS3310
Artificial Intelligence
Part 7: Reasoning under uncertainty
Prof. Neil C. Rowe
Naval Postgraduate School
Version of January 2006
Rules with probabilities
Probability arguments in predicates indicate their degree of
truth on a scale of 0 to 1.
a(0.7) :

b(1.0).
("a is true with probability 0.7 if b is true.”
or
"a is true 70% of the time that b is true.”)
(0.7 is a "rule strength probability“, or conditional probability
“a” given “b”
–
mathematicians call this p(ab) )
a(P) :

b(P).
("a is true with probability P if b is too.”)
(P in b(P) is an "evidence probability")
a(0.7) :

b(P), P > 0.5.
("a is true with probability 0.7 if b is
more than 50% certain.”)
(0.5 is an "inference probability threshold")
a(P) :

b(P2), P is P2*0.5.
("a is true half the time when b is
true.”)
(0.5 is another form of rule strength probability)
a(P) :

b(P2), c(P3), P is P2*P3.
(“a is true with probability
the product of the probabilities of b and c.”)
(P is the
"andcombine" of P2 and P3
–
the probability of “a”
decreases with either the a decrease in “b” or “c”)
Probability examples for car repair
•
Many useful expert

system rules in the form of
"cause if effect” require probabilities.
•
Probabilities are always the last argument.
•
You wouldn't use all of the following in an expert
system, just the one or two most correct.
battery(dead,0.6) :

ignition(wont_start,1.0).
battery(dead,P) :

voltmeter(battery_terminals,abnormal,P).
battery(dead,P) :

voltmeter(battery_terminals,abnormal,P)
and P>0.1.
More probability examples for car repair
battery(dead,P) :

electrical_problem(P2),
and P = P2*0.5
.
battery(dead,P) :

age(battery,old,P2) and P = P2*0.1.
battery(dead,P) :

electrical_problem(P2),
age(battery,old,P3), P is P2*P3*0.57.
battery(dead,L) :

electrical_problem(P2),
age(battery,old,P3), problem_yesterday(P4),
expensive(battery,P5),
L is (0.5*P2)+(0.1*P3)+(0.6*P4)

(0.1*P5)
L in the last rule is a "likelihood" not a probability

it can
be more than 1.0. This last rule is like an artificial
neuron.
Combining rule strengths with evidence probabilities
Given:
battery(dead,0.6) :

electrical_problem.
But suppose electrical_problem is itself uncertain.
Maybe the lights don't work and the radio won't
play in a car, but there could be other causes.
Let's say the probability of an electrical problem is
0.7. How do we combine that with 0.6? It should
decrease the overall probability somehow.
Let "H" be "battery dead" (i.e., the "hypothesis"),
and "E" be "electrical problem" (the "evidence").
Then from probability theory:
p ( H
E ) = p ( H  E ) p ( E )
(“The probability of H and E is the product of the
probability of H given E and the probability of E”)
Combining probabilities, cont.
The left side is what we want to know, the
probability that both the battery is dead and that
we have an electrical problem.
•
The first thing on the right side is the conditional
probability of H given E, or 0.6 in the example.
•
The last thing is the a priori probability of E, or
0.7 in the example.
So the answer is 0.42. And in general:
battery(dead,P) :

electrical_problem(P2),
P = P2*0.6.
Combining disjunctive evidence
total_fuse(State,P) :

bagof(X,fuse(State,X),XL),
orcombine(XL,P).
fuse(blown,P) :

cord(frayed,P2), notworking(1.0),
P is P2*0.3.
fuse(blown,P) :

sound(pop,T,P2),
event(plug_in_device, T2),
almost_simultaneous(T,T2,P3),
andcombine([P2,P3],P4), P is P4*0.8.
cord(break_in_cord,P) :

cord(frayed,P2),
notworking(1.0), P is P2*0.4.
almost_simultaneous(T1,T2,P) :

P is 1/(1+((T1

T2)*(T1

T2)))
Notes on the previous page
Almost_simultaneous
is a "fuzzy" routine; it computes a
probability that two times (measured in seconds) are close
enough together to be "almost simultaneous" for an
average person.
Bagof
collects the probabilities of rules that succeed in
concluding that the fuse is in some state. It’s built

in in
Prolog. Its 3 arguments are a variable, a predicate
expression containing the variable, and the list of possible
bindings of that variable.
Use:
orcombine([P],P).
orcombine([P1L],P) :

orcombine(L,P2),
P is P1+P2

(P1*P2).
andcombine([P],P).
andcombine([P1L],P) :

andcombine(L,P2), P is P1*P2.
Examples of Prolog’s built

in “
bagof
”
Given the database:
f(
a,c
).
f(
a,e
).
f(
b,c
).
f(
d,e
).
Then:
?

bagof
(
X,f
(
X,c
),L).
L=[
a,b
]
(“Make a list of all X such that
f(
X,c
) succeeds.”)
?

bagof
([X,Y],f(X,Y),L).
L=[[
a,c
],[
a,e
],[
b,c
],[
d,e
]]
(“Make a list of all X,Y pairs
for which f(X,Y) succeeds.”)
?

bagof
(
X,f
(X,Y),L).
L=[
a,b
] Y=c
L=[
a,d
] Y=e
(“Bind Y to something and
make a list of the X values
for which f(X,Y)
succeeds.”)
?

bagof
(
X^Y,f
(X,Y),L).
L=[
a,b,d
]
(“Find all X for which there
exists a Y such that f(X,Y)
succeeds.”)
Zero probabilities are not the same as negations
Suppose we want to add probabilities and a rule
strength of 0.8 to the rule:
a if b and not c.
What it
c
has a probability of 0.001? Does that
count as "not"?
If evidence has probabilities, we should avoid
negating it, just invert the associated probability.
If
b
and
c
are independent, the example becomes:
a(P) if b(P2) and c(P3) and P = P2*(1

P3)*0.8.
Practice in writing probability rules
(Use
diagnosis
and
symptom
predicates of 2 arguments.)
1. "If it's definitely foggy today, it will be foggy tomorrow
with probability 0.75”
2. "If it's definitely humid and not unusually warm, it will
be foggy tomorrow with probability 0.9".
3. Rewrite (1) assuming fogginess today has a degree of
likelihood (like if you're indoors).
Practice in writing probability rules, cont.
4. Rewrite (2) assuming that being humid and being
unusually warm have degrees of likelihood and
are independent.
5. What is the probability that it will be foggy
tomorrow, assuming that it is certainly foggy
today, it is humid with probability 0.5, and it is
unusually warm with probability 0.2? Assume all
probabilities are independent.
Probabilities from statistics on a population
Suppose 16 cases (cars) appear in the repair shop today.
Case 1(B,S)
Case 2(B,S)
Case 3
Case 4
Case 5(B,S)
Case 6(B,S)
Case 7
Case 8
Case 9 (B)
Case 10(S)
Case 11(S)
Case 12(S)
Case 13
Case 14
Case 15(S)
Case 16
B = battery is dead, S = car won’t start
Notice: p ( B
S ) = p ( B ) + p ( S )

p ( B
S )
since 9/16 = 5/16 + 8/16

4/16.
But the formula holds for any B and S.
Classic probability combination formulae
Given probabilities "p" and "q" to combine.
These formulae are commutative and associative.
With three to combine, apply the formula to any two,
then combine that result with a third; etc.
Independence means the presence of one event does not
change the probability of the other event. Conservative
and liberal assumptions are appropriate if an event implies
the presence or absence of another.
Andcombine
Orcombine
Lower bound
(conservative)
max(0,p+q

1)
max(p,q)
Independence
p*q
p+q

(p*q)
Upper bound
(liberal)
min(p,q)
min(1,p+q)
p
q
p
q
p
q
Exercise with the combination methods
Given three pieces of evidence supporting the same
conclusion with probabilities 0.8, 0.6, and 0.5 each.
Andcombine:
conservative:
independence:
liberal:
Orcombine:
conservative:
independence:
liberal:
Note always: conservative “and”
independence “and”
liberal “and”
conservative “or”
independence “or”
liberal “or”
Fuzziness
It means that some input is numeric instead of true/false. We must
usually convert it to a probability.
Examples: speed of a missile S, for threat assessment; a patient's
temperature T, for a medical expert system.
We could compute f(T) =  T

98.6  , and the larger this number is,
the sicker the patient is. Problem: this can be > 1, so isn't a
probability, and can't be orcombined or andcombined.
We could compute g(X) = 1

(1/(1+((T

98.6)(T

98.6)))). This will
be 0 when T = 98.6, and approaches 1 if T is very high or very
low. But steepness of curve is not adjustable.
We could compute h(T) = 1

(1/(1+ ((T

98.6)(T

98.6)/K))), and K
can be adjusted.
We could compute i(T) = 1

exp(

(T

98.6)(T

98.6)/K) where
“exp(x)” means e to the x power. This uses normal distribution
(hence has sound theory behind it), and is adjustable.
There are also ways to handle fuzziness without converting to a
probability,
fuzzy set theory.
Bayes' Rule for uncertainty handling
Let H be some hypotheses or conclusion; let E be
some collection of evidence. The laws of
probability give the following theorem:
p ( H  E ) = p ( E  H ) p ( H ) / p ( E )
(p(HE) means the probability of H given E.) This
allows us to reason "backwards" from evidence to
causes (or "hypotheses"); the real world moves
from causes to evidence.
If E = E1
E2
... , the needed probabilities are
harder to get. But then we may be able to assume
independence of some of the factors, and multiply
them. This idea is used in "Bayesian networks"
which illustrate what factors affect which others.
Examples of Bayes' Rule to get rule strengths
E1 (evidence 1) is car won't start; E2 is radio won't play; E3 is
headlights don't shine; E = E1
E2
E3; H (hypothesis) is
battery is dead.
Assume p ( E1 ) = 0.05, p ( E2 ) = 0.04, p ( E3 ) = 0.08, and p(H) =
0.03. Then by Bayes' Rule, assuming dead battery implies all
the evidence: p ( H  E1 ) = p ( E1  H ) p ( H ) / p ( E1 ) =
1*0.03/0.05 = 0.6; p ( H  E2 ) = 1*0.03/0.04 = 0.75;
p ( H  E3 ) = 1*0.03/0.08 = 0.375.
Now suppose all three pieces of evidence are present:

By conservative assumption: max(0.6 ,0.75) = 0.75,
max(0.75,0.375) = 0.75

By liberal assumption: min(1 ,0.6+0.75) = 1, min(1,1+0.375)=1.

By independence assumption: p ( H  E1
E2
E3 ) =
1

( 1

0.6 ) ( 1

0.75 ) ( 1

0.375 ) = 0.9375.
Naïve Bayes reasoning
Suppose E1 and E2 are two pieces of evidence for hypothesis H.
Then by
Bayes
’ Rule:
p(H  (E1
E2))= p((E1
E2)H) p( H ) / p(E1
E2)
If we assume E1 and E2 are “conditionally independent” of one
another with respect to H:
p(H  (E1
E2)) = p(E1H) p(E2H) p(H ) / p(E1
E2)
Use
Bayes
’ Rule twice, and this is equal to:
p(HE1) p(E1) p(HE2) p(E2) p(H) / (p(E1
E2) p(H) p(H))
Also if E1 and E1 are conditionally independent:
p(~H  (E1
E2)) = p(E1~H) p(E2~H) p(~H ) / p (E1
E2)
= p(~HE1) p(E1) p(~HE2) p(E2) p(~H)
/ (p (E1
E2) p(~H) p(~H)
Setting the ratio of left sides equal to the ratio of right sides, the
p(E1), p(E2), and (p(E1
E2) cancel out and we have:
p(H  (E1
E2)) / p(~H  (E1
E2)) =
[p(HE1)p(~H)/(p(~HE1)p(H))] *
[p(HE2)p(~H)/(p(~HE2)p(H))] *
[p(H)/p(~H)]
Naïve Bayes reasoning, cont.
Define odds as o(X) = p(X)/p(~X) = p(X)/(1

p(X)).
Then p(X) = o(X) / (1+o(X)).
Then the equation becomes:
o ( H  (E1
E2) ) =
[o (H 
E1) / o(H
)] *
[o(H  E2) / o(H
)] * o(H)
This is the odds form of “Naïve
Bayes
inference”.
With more than two pieces of evidence:
o ( H  (E1
E2
...
En) ) =
[o (H  E1) / o(H)] * [o(H  E2) / o(H)] *...
* [o(H  En) / o(H)] * o(H)
So positive evidence increases odds and negative evidence
decreases odds.
To use, convert probabilities to odds; apply the above
formula; convert odds back to probabilities.
Bayesian reasoning for air defense
A ship would like to assign probabilities of
hostility to objects observed on radar.
Factors that can be used: speed, altitude, use
of abrupt turns, whether in an airlane,
source airport, and apparent destination,
Keep statistics on objects observed in some
area of the world, and correlate this to the
eventual identities discovered for the
objects. Use these to derive odds of
hostility.
Odds for each factor can be learned from
experience, though an adversary could try to
fool you.
Bayesian reasoning for naval air defense
Stochastic grammar rules to generate behavior
Another way to use probabilities is to use them to
generate behavior. For instance, attach them to rules of
a “context

free grammar” to generate random strings
–
like random error messages:
Prob. 0.4: msg :

write(‘Fatal error at ‘), number, write(‘: ‘), fault.
Prob. 0.6: msg :

write(‘Error in ’), number, write(‘: ‘), fault.
Prob. 0.5: number :

digit, digit, digit, digit, digit.
Prob. 0.5: number :

digit, digit, digit, digit.
Prob. 0.1: digit :

write(‘0’).
Prob. 0.1: digit :

write(‘1’).
….
Prob. 0.5: fault :

write(‘Segmentation fault’).
Prob. 0.3: fault :

write(‘Protection violation’).
Prob. 0.4: fault :

write(‘File not found’).
Artificial neural net example
Suppose you want to classify shapes in photographs
Construct something like an inference network (and

or

not graph) but where probabilities are
combined, not boolean operations computed: a
neural network.
For China Lake photographs, shapes in the image
can be "sky", "dirt", "aircraft", and "person".
Useful evidence is "blueness", "redness", "has
many right angles", and "color uniformity". Good
initial weights must be estimated by intuition.
Example artificial neural network
Equation relating outputs to inputs:
blueness
redness
# right angles
uniformity
sky
dirt
aircraft
person
manmade

ness
( ), ( )
5
1 2 43 4
,1,2,3
o g w i w i w m m g w i w i
k k k k
1
2
3
4
Neural nets
Like inference networks, but probabilities are computed
instead of just "true" or "false". Two alternatives:
(1) The probabilities are expressed digitally. Then
"and” and "or" gates are special

purpose integrated
circuits computing the formulae.
(2) The probabilities are analog voltages, and the gates
are analog integrated circuits.
Neural nets can "learn" by adjusting rule

strength
probabilities to improve performance.
However, there are many ways an AI system can learn by
itself besides using a neural net: caching, indexing,
reorganizing, and generalizing. Neural nets are
not
the
only way to make an AI system learn.
The artificial neuron
The most common way is a device or program that computes:
The are inputs; f is the output probability; and the w
sub i are adjustable constants ("weights"). The g and the h
represent some nonlinear monotonic function, often
or or or
1 minus these (the second is called the “hyperbolic tangent
function”).
Increasing inputs for g should means increasing outputs, but
as the inputs get large, the increase in f slows down. This
is like neurons in the brain. It helps prevent information
overload.
This is also like a liberal orcombine, but with included rule
strengths on each input.
...)
)
3
(
3
)
2
(
2
)
1
(
1
(
,...)
3
,
2
,
1
(
i
h
w
i
h
w
i
h
w
g
i
i
i
f
)
1
/(
2
2
x
x
)
/(
)
(
x
x
x
x
e
e
e
e
2
1
,
i
i
General 3

input 4

output 2

layer
artificial neural network
Input 1
Input 2
Input 3
More about artificial neural networks
•
You can have multiple levels (“layers”), with
neuron outputs as inputs of other neurons.
•
If your “g” function is linear, the artificial neurons
are “perceptrons”.
•
Inputs can be booleans (represented as 0 or 1),
weighted and combined just like probabilities.
•
You have one output neuron for every final
conclusion.
•
Output of a neuron can be compared to a
threshold; then you get a boolean, and can use
logical reasoning from then on.
Backward propagation
Most neural networks are multilayer.
"Backward propagation" or "backpropagation" is the
most popular way that these networks learn.
It works by estimating the partial derivative with
respect to each weight of an incorrect output
value, then uses that to determine how much to
change each.
It assumes:
The weight connecting concept i at layer j to concept
k at layer j+1 is the same as the weight connecting
concept k back to i.
This is much like assuming p(AB) = p(BA), rarely
true. Nonetheless, it often works!
Short practice questions on list processing
What is the first answer of a Prolog
interpreter to each of the following
queries, assuming the definitions of
list

processing predicates given in the
notes?
?

member(foo,[bar,foo,baz]).
?

member('foo',[bar,foo,baz]).
?

member(Foo,[bar,foo,baz]).
?

member('Foo',[bar,foo,baz]).
?

member(foo,[[bar,foo]]).
?

member([X,Y],[[bar,foo]]).
?

delete(bar,
[foo,bar,bag,bar,bag], Ans).
?

delete(bar,
[[foo,bar],[bag,bar,bag]], Ans).
?

delete(bar,[foo,bar],[foo]).
?

length([foo,bar,baz],Count).
?

length([foo,bar],3).
?

first([foo,bar,baz],First).
?

last([foo,bar,baz],Last).
?

append([foo,bar],[foo,bar],X).
?

append([foo,bar], X,
[foo,bar,baz,bag]).
?

append(X,Y,[foo,bar,baz,bag]).
?

append(X,X,[foo,bar,baz,bag]).
?

append([X], [Y],
[foo,bar,baz,bag]).
?

append([XY],Z,[foo,bar,baz]).
?

append([X,Y],Z,[foo,bar,baz]).
?

append(X, [barY],
[foo,bar,baz]).
Review on rule

cycle hybrid chaining
a :

v, t.
a :

b, u, not(t).
m(X) :

n(X), b.
b :

c.
t :

r, s.
u :

v, r.
r.
v.
c.
n(12).
Practice question on probabilities
Given:
1. 10 times out of 50, when a patient complains of
chest pains, it's a heart problem.
2. 35 times out 50, when a patient complains of
chest pains, it's indigestion.
3. 6 times out of 10, if a patient has a family
history of heart problems, then they have a
heart problem.
4. Tom complains of chest pains and has a family
history of heart problems.
5. 1 of 10 patients have heart problems.
Practice question on probabilities, cont.
(a) Write the above information as rules and facts.
(b) What is the probability that Tom has a heart
problem, using the independence assumption for
combining probabilities?
(c) Write a rule for the probability someone has a
heart problem given they both have chest pains
and a family history; again assume independence.
(d) Rewrite the last rule assuming chest pains and
family history have a degree of uncertainty and
they are independent of one another.
Practice question of probabilities, cont. (2)
(e) Using that last rule, what is the probability that
Tom has heart problem if he thinks with 0.7
probability he has chest pains and experts would
say with 0.8 probability that he has a family
history of heart problems?
(f) Using the Naïve Bayes odds formula for the rules
and facts without evidence uncertainty, what is the
probability that Tom has a heart problem?
(g) Using a perceptron neuron with equal weights of
1 on the two factors of chest pains and family
history, no input nonlinear function, and
x*x/(1+(x*x)) as the nonlinear output function,
what is the probability that Tom has a heart
problem with the evidence probabilities in (e)?
List

defining practice questions
1. Using
append
, define a Prolog function predicate
third_last(List,Item)
that returns the third item from the
end of a list.
2. Define in Prolog a function predicate
allpairs
that returns a
list of all the pairs of items that occur in a list. The first
item in each pair must have occurred before the second
item of the pair in the original list argument.
3. A Prolog rule to find the middle N items (that is, N items
exactly in the center) of a list L is:
(i)
midn(L,N,M) :

append(L1,L2,L), append(L3,M,L1),
length(M,N), length(L2,Q), length(L3,Q).
(ii)
midn(L,N,M) :

append(L1,L2,L), append(M,L3,L1),
length(M,N), length(L2,Q), length(L3,Q).
(iii)
midn(L,N,M) :

append(L1,L2,L), append(L3,M,L1),
length(M,Q), length(L2,Q), length(L3,N).
(iv)
midn(L,N,M) :

append(L1,L2,L), append(M,L3,L1),
length(M,Q), length(L2,Q), length(L3,N).
Another probability practice question
Suppose that previous experience says that in
general:
There's a 50% chance of a bug in 100 lines of Ada
code;
There's a 60% chance of a bug in 100 lines of C++
code;
Tom has a 30% chance of finding a bug in 100 lines
if there is one;
Dick has a 40% chance of finding a bug in 100 lines
if there is one.
Suppose a program contains 100 lines of C++ code
and 100 lines of Ada code. Tom debugs the C++
and Dick debugs the Ada. What is the probability
a bug will be found, assuming probabilities are
independent? Show your arithmetic.
Comments 0
Log in to post a comment