Auteur: Dr. M. Maris, University of Amsterdam
March 2007 (v5
)
1
Bayesian Networks
1
Introduction
Bayes
ian
Networks (BN
s
) is a methodology
to model and simulate the behavior of
discrete

event
s
ystems under uncertainty. The basic assumption of the Bayes
methodology is, that the state variables in the system to be
m
odel
ed and simulated can
be represented by probability functions (discrete variables) or
p
robability density
functions (continuous variables)
. The network can
draw conclusions and do
i
nference
about the future syste
m state using the Bayes theorem.
In this intr
oduction, we will only
treat discrete BNs.
A

priori knowledge and assumptions of the modeler concerning causal relationships as
well as
likelihoods of state variables in the modeled system can be integrated with
observed
data
(evidence)
to compute the a

p
osteriori degree of belief of a hypothesis.
Each observation of
an event
can incrementally increase or decrease the degree of
belief in a hypothesis or a model. Therefore, the Bayes methodology is significantly
more
flexible than
rule

based algorithms, whi
ch reject a hypothesis completely
when
it
is
inconsistent with observations. New observations can be classified step

by

step when
combining weighted probability
measures of alternative hypothesis. Therefore, the
Bayes methodology
can also nicely used in
re
altime classification systems.
A BN is created by exploiting causal relationships that exist between phenomena in the
world.
By connecting the variables in a network, a DAG (Directed A

cyclic Graph)
should be created. H
ence, there are no circular dependen
cies between nodes.
Applications of BNs are found in f
ault diagnosis
for
complex
systems with multiple
state variables and
multiple causal dependencies
. Furthermore, they are used for
c
lass
ification of objects
under uncertainty
using
noisy sensor data
.
T
here
are two meanings of probability
P
robabilities can describe frequencies of outcomes in random experiments, e.g. the
probability of a head when tossing a fair coin in repeated trials.
T
hey can also be used
to describe degrees of belief in propositions t
hat do not necessarily involve random
experiments, e.g. the probability that a certain production machine will fail, given the
evidence of a poor surface quality of the workpiece.
A
Simple Bay
e
sian Network
Between two variables a causal relation is encod
ed. For example, in
Figure
1
, the
variable
Y
causally depends on
X
.
Auteur: Dr. M. Maris, University of Amsterdam
March 2007 (v5
)
2
Figure
1
Causal relation between X and Y
2
Bayes rule
The Bayes theorem goes back to the seminal work of the English re
verent Thomas
Bayes in the 18th century on games of chances. The Bayes theorem can be written as:
)
(
)
(
).

(
)

(
X
P
Y
P
Y
X
P
X
Y
P
(
1
)
Where
:
P(
X
)
: a

priori probability of a hypothesis
X
representing the initial degree of belief
;
P(
Y

X
)
: Likelihood of the data
Y
given the hypothesis
X
;
P(XY
)
: a

posteriori probability of
X
given the data
in
Y
,
representing the degree of
belief in the
hypothesis after having
observed
the data
in
Y
;
P(
Y
)
: Marginal likelihood
of the observations
Y
.
3
Basic Rules
Notation
P(
x
) = P(
X
=true)
P(
x
) = P(
X
=
false
)
Conditional Probability Table (CPT)
The causal relation between two variables is encoded in a conditional probability table,
or CPT.
State variables
Each variable ha
s a number of state variables. Th
e most comm
o
n
set
for discre
te
network
s
is
two states {
true, false
}.
In principal
n
states are possible.
Marginal likelihood
The probability distribution for
P(X)
is known a

priori. The probability distribution for
P(Y)
can be computed from the network using
P(X)
and the CPT values.
P(Y)
is also
called the
marginal likelihood
.
Range
0 <
P(X)
<
1
Probabiliti
es
sum up to 1
Auteur: Dr. M. Maris, University of Amsterdam
March 2007 (v5
)
3
A
ssuming mut
ually exclusive state variables:
i
i
X
P
1
)
(
, where
i
denotes the index for the state variables.
(2)
Product Rule
)
(
).

(
)
(
).

(
)
,
(
Y
P
Y
X
P
X
P
X
Y
P
X
Y
P
(3
)
Independence
If
X
and
Y
are independent then:
)
(
).
(
)
,
(
Y
P
X
P
Y
X
P
(4
)
Joint
probability
The joint
probability
is the total distribution of the probabilities in the network,
typically comp
uted using the product rule:
n
i
i
i
n
X
parents
X
P
X
X
P
))
(

(
)
,...,
(
1
(
5
)
Factorization
A
factorization
of a joint probability is a list of
factors
from which one can
construct
the joint probability.
The factors are in fact all the functions
P(X
i
 parents(X
i
))
in the
joint probability.
The joint probability
P(W,X,Y,Z)
c
an be factorized as a set of conditional in
dependence
relations, as
:
P(W,X,Y,Z) = P(W).P(X).P(YW).
P(ZX,Y)
Sum Rule
i
i
i
X
P
X
Y
P
Y
P
)
(
).

(
)
(
(6
)
Bayesian rule expanded
Combining (
1) and (6
), we get:
)
(
.
)

(
)
(
).

(
)

(
i
i
i
X
P
X
Y
P
X
P
X
Y
P
Y
X
P
(7
)
Auteur: Dr. M. Maris, University of Amsterdam
March 2007 (v5
)
4
Augmented Bayesian net
work
A bayesian network in whic
h a node is used which represents the belief abou
t
a relative
frequency.
Numerical example
1
In
Figure
2
,
a small BN is shown with two node
s
. The a

prior
i
probability values are
given for
P(X)
and
P
(Y)
is computed. The CPT is shown on the right.
The joint distribution is:
)

(
).
(
)
,
(
X
Y
P
X
P
Y
X
P
Note that
X
do
es not have parents, so
P(
X
parents)
=
P(X)
Figure
2
. Example BN with two nodes
and CPT
(made in
Hugin
)
.
Ea
ch variablke has two states
(
State1
and
State2
)
We assume that they are {
true, false
}.
For example,
the probability of
Y
=true
given
X
=false
:
P(Y

X)
= 0.6
.
Using
(6
) we compute
P(Y)
:
i
i
i
X
P
X
Y
P
true
Y
P
)
(
).

(
)
(
= 0.2.0.3 +
0.6.0.7 =
0.48
P(Y=false) = 1

P(Y
=true)
= 0.52
Numerical example
2, encoding beliefs
:
Suppose, we have the network shown in
Figure
3
, encoding the relationship between
Rain
and
Wet
.
Here, the belief is encoded that it will rain. There is one observable
sympton
,
namely
Wet
.
We want to compute the
marginal
likelihood
distribution
P
(
Wet
)
,
or
P
(
W
).
Figure
3
.
A
simple Bayesian network encoding the causal relationship between being wet and rain.
Here,
it shows that
Wet
is a
n observabl
e
symptom
of
Rain
. The CPT is shown on the right. For
example
, P(Wet  not Rain)
= 0.4.
Auteur: Dr. M. Maris, University of Amsterdam
March 2007 (v5
)
5
)
(
.
)

(
)
(
i
i
i
R
P
R
W
P
W
P
P(w
) = P(WR).P(R) + P(W
R).P(
R)
= 0.7.0.4 + 0.4.0.6
= 0.28 + 0.24 = 0.52
and
,
P
(
w
) = 1

0.52 =
0.
48
So, the marginal
likelihood
distr
ibution
P
(
W
) = {0.52, 0.48}
.
Numerical example
3, two parents
:
The network of
Figure
3
is modified by adding an additional node
Snow
, as parent to
Wet
.
Again, we want to compute the m
arginal likelihood distribution
P
(
W
)
.
Figure
4
.
Simple converging Bayesian network encoding the causal relationship between being
wet and rain or snow. Here,
it shows that
Wet
is a
n observable
symptom
of
Rain
and
Snow
. The
CPT is shown on the right. For example,
P
(
Wet

Rain
,
Snow
) = 0.6.
)
(
).
(
.
)
,

(
)
(
,
j
i
j
i
j
i
S
P
R
P
S
R
W
P
W
P
P(W)
= P(WR,S).P(R).P(S)
= 0.7.0.4.0.1
= 0.028
= P(W
R,S).P(
R).P(S)
= 0.6.0.6.0.1
= 0.036
= P(WR,
S).P(R).P(
S)
= 0.6.0.4.0.9
= 0.216
= P(W
R,
S).P(
R).P(
S)
= 0.1.0.6.0.9
= 0.054
Total
= 0.
334
So, the marginal likelihood distribution
P
(
W
) = {0.33, 0.67
}.
Numerical example
4
:
This example shows a
slighly
more complicated network.
We want to
c
ompute
P
(
z
).
Auteur: Dr. M. Maris, University of Amsterdam
March 2007 (v5
)
6
Figure
5
.
Bayesian
network with three nodes
and CPT
s
For
the joint distribution
in
Figure
5
, we get:
)
,

(
).

(
).
(
)
,
,
(
Y
X
Z
P
X
Y
P
X
P
Z
Y
X
P
No we can compute any marginal probability
.
For example
P(Z=true)
or
P
(
z
)
:
)}

(
.
)
(
).
,

(
{
)
(
,
i
j
j
i
i
j
i
X
Y
P
X
P
Y
X
Z
P
z
P
= 0.2.0.4.0.1 + 0.4.0.6.0.8 + 0.5.0
.4.0.9 + 0.1.0.6.0.2
= 0.0
08
+ 0.1
92
+ 0.
18
+ 0.0
12
=
0.392
4
Evidence and Inference
Now, let’s add evidence to the observable node
Wet
.
This means that it is observed
(e.g.
by a sensor)
that it is indeed wet
.
In the sma
l
l BN
below
, we
can use this
evide
nce
Wet=true
. How is
P
(
Rain
) then computed?
Figure
6
. Showing evidence in a BN.
From Bayes’ rule follows:
P(wr).P(r
) =
P(rw
).
P(w
)
0.7. 0.4 =
P(rw
).
0.52
So
,
P(rw
)
=
0.28/0.52 = 0.54
Evidence in a serial network
Be
low, a simple serial network is shown.
Let’s compute
P
(
z
), then add evidence and
infer
P
(
w
,
given this evidence
), or
P
(
w

z
)
.
Auteur: Dr. M. Maris, University of Amsterdam
March 2007 (v5
)
7
Figure
7
.
Simple s
erial network
Joint distribution:
)

(
).

(
).
(
)
,
,
(
Y
Z
P
W
Y
P
W
P
Z
Y
W
P
j
i
j
j
i
i
W
P
W
Y
P
Y
z
P
z
P
,
)
(
).

(
).

(
)
(
= 0.1.0.
3.0
.
4 + 0.3.0.7.0.4 + 0.1.0.4.0.6 + 0.3.0.6.0.6
= 0.012 + 0.084 + 0.024 + 0.108
= 0.228
S
up
p
ose there is evidence that
Z
=true
.
What is the probability
P
(
w 
z
) ?
Figure
8
. Serial network with evidence
We h
ave to compute
P(W=true

Z=true)
.
From
the formula for
P(
z
)
and
(7) it follows that:
j
i
j
j
i
i
W
P
W
Y
P
Y
z
P
w
P
w
z
P
z
w
P
,
)
(
).

(
).

(
)
(
).

(
)

(
It holds that:
i
i
i
w
P
w
Y
P
Y
z
P
w
z
P
)
(
).

(
).

(
)

(
Auteur: Dr. M. Maris, University of Amsterdam
March 2007 (v5
)
8
So:
j
i
j
j
i
i
i
i
i
W
P
W
Y
P
Y
z
P
w
P
w
Y
P
Y
z
P
z
w
P
,
)
(
).

(
).

(
)
(
).

(
).

(
)

(
= (
0.1.0.
3.0.4 + 0.3.0.7.0.4) / 0.228
=
0.421
5
D

separation
Converging
network
Below is shown a converging network.
If there is evidence that
Wet
is
true
, then
this
observation
influences the posteriori proba
bilities for
Rain
an
Snow
.
If there is also
evidence that
Rain
=
true
, then also the proba
bility for
Snow
changes. How
ever, in case
there is evidence that
Rain
=
true
but not that
Wet
=
true
, then
P
(
Snow
) remains
unchanged
, see case (c)
.
This is called d

separ
a
tion.
In other words,
parent nodes are conditionally dependent,
depending on the
e
xistence of evidence in the child
node.
If there is
no
evidence in the
child
node, the parents are d

separated.
(a)
(b)
(c
)
(d)
Figure
9
.
A converging network demonstrating d

separation.
Auteur: Dr. M. Maris, University of Amsterdam
March 2007 (v5
)
9
Diverging network
Also in a dive
rging network is d

separatio
n available
.
In the figure
below, such a
network is shown.
There are one parent and two child nodes. It shows that evidence
in
the child node only influence evidence in the other child node if there is no evidence in
the
pa
r
e
nt node
, see case (d)
.
Hence,
in a diverging network, the child nodes are d

separated if there is evidence in
the parent node.
(a)
(b)
(c
)
(d)
Figure
10
. A
diverging
network demonstrating d

separation.
Serial network
In a serial network, d

separatio
n is observed in case the middle node has evidence. In
other words, the parent and the child node of a certain node in a serial network are d

separated in case the middle node has evidence, see figure below
case (c)
and (d)
.
(a)
(b)
Auteur: Dr. M. Maris, University of Amsterdam
March 2007 (v5
)
10
(c)
(d)
Fi
gure
11
.
A serial network demonstrating d

separation.
The advantages of d

separation
Generally speaking, d

separation determines whether or not any two
nodes in a DAG
are independent, if they are not connected by an edge.
Using
this notion, it is possible to
compute different parts of the network separately. This is advantageous for fast
computation as well as distributing the computational burden over a number of
computers, (as for example in a multi

agent system).
Auteur: Dr. M. Maris, University of Amsterdam
March 2007 (v5
)
11
6
Beta D
ensity Function
(this is not part of the MASDAI class)
If we want to represent the belief in an event, we have to find a way to encode this in a
function.
If, for example, we toss with a coin and want to calculate the probability that
for the number of ti
mes that heads shows up
(or the
belief
in heads)
, related to the total
number os tosses, we
typical
ly
use the normal distribution.
Hence
, the relative
frequency of heads is 0.5.
However, for most beliefs, the normal distr
ibution is not adequate. If we thr
ow with a
dice, the relative frequency of the number 5 will not follow a normal distribution.
If we
ask group of people if they use breakfast in th
e
morning, again a different distribution
function will be needed.
To easily create dedicated distributions
for a particular relative
frequency or beli
ef, the beta density function ca
n be used.
It is defined as:
1
1
)
1
(
)
(
)
(
)
(
)
(
b
a
f
f
b
a
N
f
;
N = a+b
(8)
Here
)
(
x
is the
gamma
function, defined as
:
0
1
)
(
dt
e
t
x
t
x
The gamma function is a g
enerilzation of the factorial function. Namely, it can be
shown that if
x
is an integer ≥ 1, it holds that:
)!
1
(
)
(
x
x
(9)
The
beta density function is usually written as
beta(f
; a,b)
.
Figure
12
. Two Beta de
nsity functions. On the left, the
beta(f
; 3,3) is shown, while on the right the
beta(f
; 10,5)
is depicted
.
Auteur: Dr. M. Maris, University of Amsterdam
March 2007 (v5
)
12
Why beta density functions
?
We want to be able to catch our belief in certain phenomena in a function.
The best way
to do this is to encode this be
lief in an appropriate function.
So, rather than assigning
just
probability values to the states of a variable, we want to assign a complete function
to compute the probability of a state
.
Beta density functions are very suitable for this
task.
For exampl
e, in the figure below, we want to compute the probability that
X
=1,
so
P
(
X
=1).
For this, a beta density function is attac
hed to compute this probability. This
can be written as:
P
(
X
=1 
F=f
) =
f
In other words, the relative frequency (or probability, or
belief) that
X
=1
(given that we
know
f
)
equals
f
.
Figure
13
. Simple Augmented Baysian Network
. The probability distribution of F represents our
belief concerning the relative frequency with which
X
equals 1.
Th
is can be r
ewritten as the
estimate of the relative frequency
(or
expected value
)
,
defined
as:
P
(
X
=1
)
=
E
(
f
)
(10)
It can be shown that:
N
a
f
E
)
(
(11)
In other words
, for the probabilty
P
(
X
=1)
:
N
a
X
P
)
1
(
Learning the relative frequency
f
If we do not belief that one state is preferred over another, the density function
F
will
be the uniform density function
.
It can be interpreted as all beliefs in all
possibly
occurring
relative frequencies being
the same
,
and
1.
F
X
Auteur: Dr. M. Maris, University of Amsterdam
March 2007 (v5
)
13
Figure
14
. The uniform density function, or
beta
(
f
, 1,1).
Comments 0
Log in to post a comment