Bayesian Networks
Introduction
A problem domain is modeled by a list of
variables X
1
, …, X
n
Knowledge about the problem domain is
represented by a joint probability
P(X
1
, …, X
n
)
Introduction
Example: Alarm
The story: In LA burglary and earthquake are not
uncommon. They both can cause alarm. In case
of alarm, two neighbors John and Mary may call
Problem: Estimate the probability of a burglary
based who has or has not called
Variables: Burglary (B), Earthquake (E), Alarm
(A), JohnCalls (J), MaryCalls (M)
Knowledge required to solve the problem:
P(B, E, A, J, M)
Introduction
What is the probability of burglary given
that Mary called, P(B = y  M = y)?
Compute marginal probability:
P(B , M) =
E, A, J
P(B, E, A, J, M)
Use the definition of conditional
probability
Answer:
Introduction
Difficulty: Complexity in model
construction and inference
In Alarm example:
–
31 numbers needed
–
Computing P(B = y  M = y) takes 29 additions
In general
–
P(X
1
, … X
n
) needs at least 2
n
–
1numbers to
specify the joint probability
–
Exponential storage and inference
Conditional Independence
Overcome the problem of exponential size
by exploiting conditional independence
The chain rule of probabilities:
Conditional Independence
Conditional independence in the problem
domain:
Domain usually allows to identify a subset
pa(X
i
)
µ
{X
1
, …, X
i
–
1
} such that given
pa(X
i
), X
i
is independent of all variables in
{X
1
, …, X
i

1
}
\
pa{X
i
}, i.e.
P(X
i
 X
1
, …, X
i
–
1
) = P(X
i
 pa(X
i
))
Then
Conditional Independence
As a result, the joint probability
P(X
1
, …, X
n
) can be represented as the
conditional probabilities P(X
i
 pa(X
i
))
Example continued:
P(B, E, A, J, M)
=P(B)P(EB)P(AB,E)P(JA,B,E)P(MB,E,A,J)
=P(B)P(E)P(AB,E)P(JA)P(MA)
pa(B) = {}, pa(E) = {}, pa(A) = {B, E},
pa{J} = {A}, pa{M} = {A}
Conditional probability table specifies: P(B),
P(E), P(A  B, E), P(M  A), P(J  A)
Conditional Independence
As a result:
Model size reduced
Model construction easier
Inference easier
Graphical Representation
To graphically represent the conditional
independence relationships, construct a directed
graph by drawing an arc from X
j
to X
i
iff
X
j
pa(X
i
)
pa(B) = {}, pa(E) = {}, pa(A) = {B, E}, pa{J} = {A}, pa{M} = {A}
A
B
E
J
M
Graphical Representation
We also attach the conditional probability
table P(X
i
 pa(X
i
)) to node X
i
The result: Bayesian network
A
B
E
J
M
P(B)
P(E)
P(J  A)
P(M  A)
P(A  B, E)
Formal Definition
A Bayesian network is:
An acyclic directed graph (DAG), where
Each node represents a random variable
And is associated with the conditional
probability of the node given its parents
Intuition
A BN can be understood as a DAG where arcs
represent direct probability dependence
Absence of arc indicates probability
independence: a variable is
conditionally independent of all its
nondescendants given its parents
From the graph: B
?
E, J
?
B  A, J
?
E  A
A
B
E
J
M
Construction
Procedure for constructing BN:
Choose a set of variables describing the
application domain
Choose an ordering of variables
Start with empty network and add
variables to the network one by one
according to the ordering
Construction
To add i

th variable X
i
:
–
Determine pa(X
i
) of variables already in the
network (X
1
, …, X
i
–
1
) such that
P(X
i
 X
1
, …, X
i
–
1
) = P(X
i
 pa(X
i
))
(domain knowledge is needed there)
–
Draw an arc from each variable in pa(X
i
) to X
i
Example
Order: B, E, A, J, M
–
pa(B)=pa(E)={},
pa(A)={B,E}, pa(J)={A},
pa{M}={A}
Order: M, J, A, B, E
–
pa{M}={}, pa{J}={M},
pa{A}={M,J}, pa{B}={A},
pa{E}={A,B}
Order: M, J, E, B, A
–
Fully connected graph
A
B
E
J
M
A
B
E
J
M
A
B
E
J
M
Construction
Which variable order?
Naturalness of probability
assessment
M, J, E, B, A is bad because of
P(B  J, M, E) is not natural
Minimize number of arcs
M, J, E, B, A is bad (too many arcs),
the first is good
Use casual relationship: cause come
before their effects
M, J, E, B, A is bad because M and
J are effects of A but come before A
A
B
E
J
M
A
B
E
J
M
VS
Casual Bayesian Networks
A causal Bayesian network, or simply
causal networks, is a Bayesian network
whose arcs are interpreted as indicating
cause

effect relationships
Build a causal network:
–
Choose a set of variables that describes the
domain
–
Draw an arc to a variable from each of its
direct
causes (Domain knowledge required)
Example
Visit Africa
Tuberculosis
X

Ray
Smoking
Lung Cancer
Bronchitis
Dyspnea
Tuberculosis or
Lung Cancer
Casual BN
Causality is not a well
understood concept.
–
No widely accepted denition.
–
No consensus on whether it is a
property of the world or a concept
in our minds
Sometimes causal relations are
obvious:
–
Alarm causes people to leave
building.
–
Lung Cancer causes mass on
chest X

ray.
At other times, they are not that
clear.
Doctors believe smoking
causes lung cancer but
the tobacco industry has
a different story:
S
C
Surgeon General (1964)
*
C
S
Tobacco Industry
Inference
Posterior queries to BN
–
We have observed the values of some
variables
–
What are the posterior probability distributions
of other variables?
Example: Both John and Mary reported
alarm
–
What is the probability of burglary
P(BJ=y,M=y)?
Inference
General form of query P(Q  E = e) = ?
Q is a list of query variables
E is a list of evidence variables
e denotes observed variables
Inference Types
Diagnostic inference: P(B  M = y)
Predictive/Casual Inference: P(M  B = y)
Intercasual inference (between causes of
a common effect) P(B  A = y, E = y)
Mixed inference (combining two or more
above) P(A  J = y, E = y) (diagnostic and
casual)
All the types are handled in the same way
Naïve Inference
Naïve algorithm for solving P(QE = e) in BN
Get probability distribution P(X) over all
variables X by multiplying conditional
probabilities
BN structure is not used, for many
variables the algorithm is not practical
Generally exact inference is NP

hard
Inference
Though generally exact inference is NP

hard, in some cases the problem is
tractable, e.g. if BN has a (poly)

tree
structure efficient algorithm exists
(a poly tree is a directed acyclic graph in which no two
nodes have more than one path between them)
Another practical approach: Stochastic
Simulation
A general sampling algorithm
For i = 1 to n
1.
Find parents of X
i
(X
p(i, 1)
, …, X
p(i, n)
)
2.
Recall the values that those parents where
randomly given
3.
Look up the table for
P(X
i
 X
p(i, 1)
= x
p(i, 1)
, …, X
p(i, n)
= x
p(i, n)
)
4.
Randomly set x
i
according to this probability
Stochastic Simulation
We want to know P(Q = q E = e)
Do a lot of random samplings and count
–
N
c
: Num. samples in which E = e
–
N
s
: Num. samples in which Q = q and E = e
–
N: number of random samples
If N is big enough
–
N
c
/ N is a good estimate of P(E = e)
–
N
s
/ N is a good estimate of P(Q = q, E = e)
–
N
s
/ N
c
is then a good estimate of P(Q = q  E = e)
Parameter Learning
Example:
–
given a BN structure
–
A dataset
–
Estimate conditional probabilities P(X
i
 pa(X
i
))
X
1
X
3
X
5
X
4
X
2
X
1
X
2
X
3
X
4
X
5
0
0
1
1
0
1
0
0
1
0
0
?
0
0
?
…
…
…
…
…
? means missing values
Parameter Learning
We consider cases with full data
Use maximum likelihood (ML) algorithm
and bayesian estimation
Mode of learning:
–
Sequential learning
–
Batch learning
Bayesian estimation is suitable both for
sequential and batch learning
ML is suitable only for batch learning
ML in BN with Complete Data
n variables X
1
, …, X
n
Number of states of X
i
: r
i
= 
X
i

Number of configurations of parents of X
i
:
q
i
= 
pa(X
i
)

Parameters to be estimated:
ijk
=P(X
i
= j  pa(X
i
) = k),
i = 1, …, n; j = 1, …, r
i
; k = 1, …, q
i
ML in BN with Complete Data
Example: consider a BN. Assume all
variables are binary taking values 1, 2.
ijk
=P(X
i
= j  pa(X
i
) = k)
X
1
X
3
X
2
Number of parents configuration
ML in BN with Complete Data
A complete case: D
l
is a vector of values,
one for each variable (all data is known).
Example: D
l
= (X
1
= 1, X
2
= 2, X
3
= 2)
Given:
A set of complete cases: D = {D
1
, …, D
m
}
Find: the ML estimate of the parameters
ML in BN with Complete Data
Loglikelihood:
l(
 D) = log L(
 D) = log P(D 
)
= log
l
P(D
l

) =
l
log P(D
l

)
The term log P(D
l

):
–
D
4
= (1, 2, 2)
log P(D
4

) = log P(X
1
= 1, X
2
= 2, X
3
= 2 
)
= log P(X
1
=1 
) P(X
2
=2 
) P(X
3
=2  X
1
=1, X
2
=2,
)
= log
111
+ log
221
+ log
322
Recall:
={
111
,
121
,
211
,
221
,
311
,
312
,
313
,
314
,
321
,
322
,
323
,
324
}
X
1
X
3
X
2
ML in BN with Complete Data
Define the characteristic function of D
l
:
When l = 4, D
4
= {1, 2, 2}
(1,1,1:D
4
)=
(2,2,1:D
4
)=
(3,2,2:D
4
)=1,
(i, j, k: D
4
) = 0 for all other i, j, k
So, log P(D
4

) =
ijk
(i, j, k: D
4
) log
ijk
In general,
log P(D
l

) =
ijk
(i, j, k: D
l
) log
ijk
X
1
X
3
X
2
ML in BN with Complete Data
Define: m
ijk
=
l
(i, j, k: D
l
)
the number of data cases when
X
i
= j and pa(X
i
) = k
Then l(
 D) =
l
log P(D
l

)
=
l
i, j, k
(i, j, k : D
l
) log
ijk
=
i, j, k
l
(i, j, k : D
l
) log
ijk
=
i, j, k
m
ijk
log
ijk
=
i,k
j
m
ijk
log
ijk
ML in BN with Complete Data
We want to find:
argmax l(
 D) = argmax
i,k
j
m
ijk
log
ijk
ijk
Assume that
ijk
= P(X
i
= j  pa(X
i
) = k) is not
related to
i’j’k’
provided that i
i’ OR k
k’
Consequently we can maximize separately each
term in the summation
i, k
[…]
argmax
j
m
ijk
log
ijk
ijk
ML in BN with Complete Data
As a result we have:
In words, the ML estimate for
ijk
= P( X
i
= j  pa(X
i
) = k) is
number of cases where X
i
=j and pa(X
i
) = k
number of cases where pa(X
i
) = k
More to do with BN
Learning parameters with some values
missing
Learning the structure of BN from training
data
Many more…
References
Pearl, Judea,
Probabilistic Reasoning in Intelligent Systems:
Networks of Plausible Inference
, Morgan Kaufmann, San Mateo,
CA, 1988.
Heckerman, David, "A Tutorial on Learning with Bayesian
Networks," Technical Report MSR

TR

95

06, Microsoft Research,
1995.
www.ai.mit.edu/~murphyk/Software
http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html
R. G. Cowell, A. P. Dawid, S. L. Lauritzen and D. J. Spiegelhalter.
"Probabilistic Networks and Expert Systems".
Springer

Verlag.
1999.
http://www.ets.org/research/conferences/almond2004.html#software
Comments 0
Log in to post a comment