handouts

fancyfantasicAI and Robotics

Nov 7, 2013 (4 years and 6 days ago)

101 views

Auteur: Dr. M. Maris, University of Amsterdam


March 2007 (v5
)

1


Bayesian Networks




1

Introduction

Bayes
ian

Networks (BN
s
) is a methodology
to model and simulate the behavior of
discrete
-
event

s
ystems under uncertainty. The basic assumption of the Bayes
methodology is, that the state variables in the system to be

m
odel
ed and simulated can
be represented by probability functions (discrete variables) or

p
robability density
functions (continuous variables)
. The network can

draw conclusions and do

i
nference
about the future syste
m state using the Bayes theorem.
In this intr
oduction, we will only
treat discrete BNs.


A
-
priori knowledge and assumptions of the modeler concerning causal relationships as
well as

likelihoods of state variables in the modeled system can be integrated with
observed

data
(evidence)
to compute the a
-
p
osteriori degree of belief of a hypothesis.
Each observation of
an event

can incrementally increase or decrease the degree of

belief in a hypothesis or a model. Therefore, the Bayes methodology is significantly
more

flexible than

rule
-
based algorithms, whi
ch reject a hypothesis completely

when

it
is

inconsistent with observations. New observations can be classified step
-
by
-
step when
combining weighted probability

measures of alternative hypothesis. Therefore, the
Bayes methodology

can also nicely used in

re
altime classification systems.


A BN is created by exploiting causal relationships that exist between phenomena in the
world.
By connecting the variables in a network, a DAG (Directed A
-
cyclic Graph)
should be created. H
ence, there are no circular dependen
cies between nodes.


Applications of BNs are found in f
ault diagnosis

for

complex

systems with multiple
state variables and

multiple causal dependencies
. Furthermore, they are used for
c
lass
ification of objects
under uncertainty
using

noisy sensor data
.


T
here
are two meanings of probability

P
robabilities can describe frequencies of outcomes in random experiments, e.g. the
probability of a head when tossing a fair coin in repeated trials.

T
hey can also be used
to describe degrees of belief in propositions t
hat do not necessarily involve random
experiments, e.g. the probability that a certain production machine will fail, given the
evidence of a poor surface quality of the workpiece.



A
Simple Bay
e
sian Network

Between two variables a causal relation is encod
ed. For example, in
Figure
1
, the
variable
Y

causally depends on
X
.

Auteur: Dr. M. Maris, University of Amsterdam


March 2007 (v5
)

2



Figure
1

Causal relation between X and Y


2

Bayes rule

The Bayes theorem goes back to the seminal work of the English re
verent Thomas
Bayes in the 18th century on games of chances. The Bayes theorem can be written as:


)
(
)
(
).
|
(
)
|
(
X
P
Y
P
Y
X
P
X
Y
P


(
1
)


Where
:



P(
X
)
: a
-
priori probability of a hypothesis
X


representing the initial degree of belief
;



P(
Y
|
X
)
: Likelihood of the data
Y

given the hypothesis
X
;



P(X|Y
)
: a
-
posteriori probability of
X

given the data
in
Y
,

representing the degree of
belief in the

hypothesis after having
observed

the data
in
Y
;



P(
Y
)
: Marginal likelihood

of the observations
Y
.



3

Basic Rules


Notation

P(
x
) = P(
X
=true)

P(

x
) = P(
X
=
false
)


Conditional Probability Table (CPT)

The causal relation between two variables is encoded in a conditional probability table,
or CPT.


State variables

Each variable ha
s a number of state variables. Th
e most comm
o
n
set
for discre
te
network
s

is

two states {
true, false
}.
In principal
n

states are possible.


Marginal likelihood

The probability distribution for
P(X)

is known a
-
priori. The probability distribution for
P(Y)

can be computed from the network using
P(X)

and the CPT values.

P(Y)

is also
called the
marginal likelihood
.


Range

0 <
P(X)

<

1


Probabiliti
es

sum up to 1

Auteur: Dr. M. Maris, University of Amsterdam


March 2007 (v5
)

3

A
ssuming mut
ually exclusive state variables:



i
i
X
P
1
)
(

, where
i

denotes the index for the state variables.

(2)


Product Rule

)
(
).
|
(
)
(
).
|
(
)
,
(
Y
P
Y
X
P
X
P
X
Y
P
X
Y
P



(3
)


Independence

If
X

and
Y

are independent then:

)
(
).
(
)
,
(
Y
P
X
P
Y
X
P


(4
)


Joint
probability


The joint
probability

is the total distribution of the probabilities in the network,
typically comp
uted using the product rule:



n
i
i
i
n
X
parents
X
P
X
X
P
))
(
|
(
)
,...,
(
1

(
5
)


Factorization

A
factorization

of a joint probability is a list of
factors

from which one can
construct
the joint probability.
The factors are in fact all the functions
P(X
i

| parents(X
i
))

in the
joint probability.




The joint probability
P(W,X,Y,Z)
c
an be factorized as a set of conditional in
dependence
relations, as
:

P(W,X,Y,Z) = P(W).P(X).P(Y|W).
P(Z|X,Y)



Sum Rule




i
i
i
X
P
X
Y
P
Y
P
)
(
).
|
(
)
(

(6
)


Bayesian rule expanded

Combining (
1) and (6
), we get:


)
(
.
)
|
(
)
(
).
|
(
)
|
(
i
i
i
X
P
X
Y
P
X
P
X
Y
P
Y
X
P



(7
)


Auteur: Dr. M. Maris, University of Amsterdam


March 2007 (v5
)

4

Augmented Bayesian net
work

A bayesian network in whic
h a node is used which represents the belief abou
t

a relative
frequency.


Numerical example

1

In
Figure
2
,

a small BN is shown with two node
s
. The a
-
prior
i

probability values are
given for
P(X)

and
P
(Y)

is computed. The CPT is shown on the right.


The joint distribution is:


)
|
(
).
(
)
,
(
X
Y
P
X
P
Y
X
P


Note that
X

do
es not have parents, so
P(
X
|parents)

=
P(X)







Figure
2
. Example BN with two nodes

and CPT

(made in
Hugin
)
.

Ea
ch variablke has two states
(
State1

and
State2
)

We assume that they are {
true, false
}.
For example,

the probability of
Y
=true

given
X
=false
:
P(Y

|


X)

= 0.6
.



Using

(6
) we compute
P(Y)
:




i
i
i
X
P
X
Y
P
true
Y
P
)
(
).
|
(
)
(
= 0.2.0.3 +
0.6.0.7 =
0.48

P(Y=false) = 1
-
P(Y
=true)

= 0.52



Numerical example

2, encoding beliefs
:

Suppose, we have the network shown in
Figure
3
, encoding the relationship between
Rain

and
Wet
.
Here, the belief is encoded that it will rain. There is one observable
sympton
,

namely
Wet
.

We want to compute the
marginal
likelihood
distribution
P
(
Wet
)
,

or

P
(
W
).







Figure
3
.

A
simple Bayesian network encoding the causal relationship between being wet and rain.
Here,

it shows that
Wet

is a
n observabl
e

symptom

of
Rain
. The CPT is shown on the right. For
example
, P(Wet | not Rain)

= 0.4.


Auteur: Dr. M. Maris, University of Amsterdam


March 2007 (v5
)

5

)
(
.
)
|
(
)
(
i
i
i
R
P
R
W
P
W
P




P(w
) = P(W|R).P(R) + P(W|

R).P(

R)


= 0.7.0.4 + 0.4.0.6


= 0.28 + 0.24 = 0.52


and
,


P
(

w
) = 1
-

0.52 =
0.
48


So, the marginal
likelihood
distr
ibution
P
(
W
) = {0.52, 0.48}
.


Numerical example

3, two parents
:

The network of
Figure
3

is modified by adding an additional node
Snow
, as parent to
Wet
.
Again, we want to compute the m
arginal likelihood distribution

P
(
W
)
.






Figure
4
.
Simple converging Bayesian network encoding the causal relationship between being
wet and rain or snow. Here,

it shows that
Wet

is a
n observable

symptom

of
Rain

and
Snow
. The
CPT is shown on the right. For example,
P
(
Wet

|

Rain
,
Snow
) = 0.6.


)
(
).
(
.
)
,
|
(
)
(
,
j
i
j
i
j
i
S
P
R
P
S
R
W
P
W
P



P(W)

= P(W|R,S).P(R).P(S)


= 0.7.0.4.0.1

= 0.028

= P(W|

R,S).P(

R).P(S)

= 0.6.0.6.0.1

= 0.036

= P(W|R,

S).P(R).P(

S)

= 0.6.0.4.0.9

= 0.216

= P(W|

R,

S).P(

R).P(

S)

= 0.1.0.6.0.9

= 0.054






Total

= 0.
334


So, the marginal likelihood distribution
P
(
W
) = {0.33, 0.67
}.




Numerical example

4
:

This example shows a
slighly
more complicated network.
We want to
c
ompute
P
(
z
).


Auteur: Dr. M. Maris, University of Amsterdam


March 2007 (v5
)

6





Figure
5
.
Bayesian

network with three nodes

and CPT
s


For

the joint distribution
in

Figure
5
, we get:


)
,
|
(
).
|
(
).
(
)
,
,
(
Y
X
Z
P
X
Y
P
X
P
Z
Y
X
P



No we can compute any marginal probability
.

For example
P(Z=true)

or
P
(
z
)
:

)}
|
(
.
)
(
).
,
|
(
{
)
(
,
i
j
j
i
i
j
i
X
Y
P
X
P
Y
X
Z
P
z
P






= 0.2.0.4.0.1 + 0.4.0.6.0.8 + 0.5.0
.4.0.9 + 0.1.0.6.0.2


= 0.0
08

+ 0.1
92

+ 0.
18

+ 0.0
12


=
0.392



4

Evidence and Inference

Now, let’s add evidence to the observable node
Wet
.
This means that it is observed

(e.g.
by a sensor)

that it is indeed wet
.

In the sma
l
l BN
below
, we
can use this

evide
nce
Wet=true
. How is
P
(
Rain
) then computed?






Figure
6
. Showing evidence in a BN.


From Bayes’ rule follows:



P(w|r).P(r
) =
P(r|w
).
P(w
)


0.7. 0.4 =
P(r|w
).
0.52

So
,


P(r|w
)

=
0.28/0.52 = 0.54


Evidence in a serial network


Be
low, a simple serial network is shown.

Let’s compute
P
(
z
), then add evidence and
infer
P
(
w
,

given this evidence
), or
P
(
w

|
z
)
.

Auteur: Dr. M. Maris, University of Amsterdam


March 2007 (v5
)

7







Figure
7
.
Simple s
erial network


Joint distribution:
)
|
(
).
|
(
).
(
)
,
,
(
Y
Z
P
W
Y
P
W
P
Z
Y
W
P




j
i
j
j
i
i
W
P
W
Y
P
Y
z
P
z
P
,
)
(
).
|
(
).
|
(
)
(


= 0.1.0.
3.0
.
4 + 0.3.0.7.0.4 + 0.1.0.4.0.6 + 0.3.0.6.0.6


= 0.012 + 0.084 + 0.024 + 0.108


= 0.228


S
up
p
ose there is evidence that
Z
=true
.
What is the probability
P
(
w |
z
) ?





Figure
8
. Serial network with evidence



We h
ave to compute
P(W=true

|

Z=true)
.


From
the formula for
P(
z
)
and
(7) it follows that:




j
i
j
j
i
i
W
P
W
Y
P
Y
z
P
w
P
w
z
P
z
w
P
,
)
(
).
|
(
).
|
(
)
(
).
|
(
)
|
(


It holds that:



i
i
i
w
P
w
Y
P
Y
z
P
w
z
P
)
(
).
|
(
).
|
(
)
|
(

Auteur: Dr. M. Maris, University of Amsterdam


March 2007 (v5
)

8

So:





j
i
j
j
i
i
i
i
i
W
P
W
Y
P
Y
z
P
w
P
w
Y
P
Y
z
P
z
w
P
,
)
(
).
|
(
).
|
(
)
(
).
|
(
).
|
(
)
|
(



= (
0.1.0.
3.0.4 + 0.3.0.7.0.4) / 0.228


=
0.421



5

D
-
separation


Converging

network

Below is shown a converging network.
If there is evidence that
Wet

is
true
, then
this
observation

influences the posteriori proba
bilities for
Rain

an
Snow
.

If there is also
evidence that
Rain

=

true
, then also the proba
bility for
Snow

changes. How
ever, in case
there is evidence that
Rain
=
true

but not that
Wet
=
true
, then
P
(
Snow
) remains
unchanged
, see case (c)
.


This is called d
-
separ
a
tion.
In other words,

parent nodes are conditionally dependent,
depending on the
e
xistence of evidence in the child
node.

If there is
no
evidence in the
child

node, the parents are d
-
separated.




(a)



(b)



(c
)


(d)


Figure
9
.
A converging network demonstrating d
-
separation.

Auteur: Dr. M. Maris, University of Amsterdam


March 2007 (v5
)

9




Diverging network

Also in a dive
rging network is d
-
separatio
n available
.

In the figure
below, such a
network is shown.
There are one parent and two child nodes. It shows that evidence
in
the child node only influence evidence in the other child node if there is no evidence in
the

pa
r
e
nt node
, see case (d)
.


Hence,
in a diverging network, the child nodes are d
-
separated if there is evidence in
the parent node.


(a)


(b)


(c
)


(d)


Figure
10
. A
diverging

network demonstrating d
-
separation.


Serial network

In a serial network, d
-
separatio
n is observed in case the middle node has evidence. In
other words, the parent and the child node of a certain node in a serial network are d
-
separated in case the middle node has evidence, see figure below

case (c)

and (d)
.



(a)


(b)

Auteur: Dr. M. Maris, University of Amsterdam


March 2007 (v5
)

10



(c)


(d)


Fi
gure
11
.

A serial network demonstrating d
-
separation.




The advantages of d
-
separation

Generally speaking, d
-
separation determines whether or not any two
nodes in a DAG
are independent, if they are not connected by an edge.
Using

this notion, it is possible to
compute different parts of the network separately. This is advantageous for fast
computation as well as distributing the computational burden over a number of
computers, (as for example in a multi
-
agent system).





Auteur: Dr. M. Maris, University of Amsterdam


March 2007 (v5
)

11


6

Beta D
ensity Function

(this is not part of the MASDAI class)


If we want to represent the belief in an event, we have to find a way to encode this in a
function.
If, for example, we toss with a coin and want to calculate the probability that
for the number of ti
mes that heads shows up

(or the
belief

in heads)
, related to the total
number os tosses, we
typical
ly

use the normal distribution.
Hence
, the relative
frequency of heads is 0.5.


However, for most beliefs, the normal distr
ibution is not adequate. If we thr
ow with a
dice, the relative frequency of the number 5 will not follow a normal distribution.
If we
ask group of people if they use breakfast in th
e
morning, again a different distribution
function will be needed.
To easily create dedicated distributions
for a particular relative
frequency or beli
ef, the beta density function ca
n be used.

It is defined as:


1
1
)
1
(
)
(
)
(
)
(
)
(







b
a
f
f
b
a
N
f


;
N = a+b

(8)


Here
)
(
x

is the
gamma

function, defined as
:







0
1
)
(
dt
e
t
x
t
x

The gamma function is a g
enerilzation of the factorial function. Namely, it can be
shown that if
x

is an integer ≥ 1, it holds that:


)!
1
(
)
(



x
x

(9)


The
beta density function is usually written as
beta(f
; a,b)
.







Figure
12
. Two Beta de
nsity functions. On the left, the
beta(f
; 3,3) is shown, while on the right the
beta(f
; 10,5)
is depicted
.



Auteur: Dr. M. Maris, University of Amsterdam


March 2007 (v5
)

12

Why beta density functions
?

We want to be able to catch our belief in certain phenomena in a function.
The best way
to do this is to encode this be
lief in an appropriate function.
So, rather than assigning
just
probability values to the states of a variable, we want to assign a complete function

to compute the probability of a state
.

Beta density functions are very suitable for this
task.


For exampl
e, in the figure below, we want to compute the probability that
X
=1,
so

P
(
X
=1).

For this, a beta density function is attac
hed to compute this probability. This
can be written as:


P
(
X
=1 |
F=f
) =
f


In other words, the relative frequency (or probability, or

belief) that
X
=1

(given that we
know
f
)
equals
f
.




Figure
13
. Simple Augmented Baysian Network
. The probability distribution of F represents our
belief concerning the relative frequency with which
X

equals 1.



Th
is can be r
ewritten as the

estimate of the relative frequency

(or
expected value
)
,
defined

as:


P
(
X
=1
)

=
E
(
f
)

(10)


It can be shown that:

N
a
f
E

)
(

(11)

In other words
, for the probabilty
P
(
X
=1)
:

N
a
X
P


)
1
(


Learning the relative frequency
f


If we do not belief that one state is preferred over another, the density function
F

will
be the uniform density function
.
It can be interpreted as all beliefs in all
possibly
occurring
relative frequencies being
the same
,

and
1.


F

X

Auteur: Dr. M. Maris, University of Amsterdam


March 2007 (v5
)

13


Figure
14
. The uniform density function, or
beta
(
f
, 1,1).