Bayesian Networks
Bayesian Networks
Dan Witzner Hansen
2
Outline
Outline
Bayestheorem
Maximum likelihood (ML) hypothesis
Maximum a posteriori (MAP) hypothesis
Bayesian belief networks
Time dependent models (HMM)
3
Basic Formulas for Probabilities
Basic Formulas for Probabilities
•Product Rule : probability P(AB) of a conjunction of two eventsA
and B:
•Sum Rule: probability of a disjunction of two events A and B:
•Theorem of Total Probability : if events A1, …., An are mutually
exclusive with
)()()()(),(APABPBPBAPBAP
=
=
)()()()(ABPBPAPBAP−
+
=
+
)()()(
1
i
n
i
i
APABPBP
∑
=
=
for A’s mutually exclusive
4
Basic Approach
Basic Approach
BayesRule
:
)(
)()(
)(
DP
hPhDP
DhP=
P(h) = prior probability of hypothesis h
P(D) = prior probability of training data D (evidence)
P(hD) = probability of h given D (posterior density )
P(Dh) = probability of D given h (likelihood of D given h)
5
Bayes
Bayes
Theorem
Theorem
P(hD) = P(Dh) P(h) / P(D)
posterior = likelihood x prior / evidence
By observing the data D we can convert the prior probability P(h) to the
a posteriori probability (posterior) P(hD)
The posterior is probability that h holds after data D has been observed.
The evidence P(D) can be viewed merely as a scale factor that
guarantees that the posterior probabilities sum to one.
6
Choosing Hypotheses
Choosing Hypotheses
P(hD) = P(Dh) P(h) / P(D)
We often need the most probable hypothesis given the
training data (Maximum A Posteriorihypothesis)
Maximum a posteriori
hypothesis hMAP
hMAP
= argmaxh∈H
P(hD)
= argmaxh∈H
P(Dh) P(h) / P(D)
= argmaxh∈H
P(Dh) P(h)
If the priors of hypothesis are equally likely P(hi)=P(hj) then
one can choose the
maximum likelihood
(ML) hypothesis
hML
= argmaxh∈H
P(Dh)
7
MAP vs.
MAP vs.
Bayes
Bayes
Method
Method
The maximum posterior hypothesis estimates an
point
hMAP
in the
hypothesis space H.
Bayesmethod instead estimates and uses a complete
distribution
P(hD).
The difference appears when inference MAP or Bayesmethod are used
for inference of unseen instances and one compares the distributions
P(xD)
MAP: P(xD) = hMAP(x) with h
ML
= argmax
h∈H
P(hD)
Bayes: P(xD) = Σ
h
i
∈H
P(xh
i
) P(hiD)
For reasonable prior distributions P(h) MAP and Bayessolution are
equivalent in the asymptotic limit of infinite training data D.
8
Basic Formula for Probabilities
Basic Formula for Probabilities
Product rule: P(A∧B) = P(A) P(B)
Sum rule: P(A∨B) = P(A) + P(B) P(A∧B)
Theorem of total probability: if A
1, A2, …, An
are mutually
exclusive events Σi
P(Ai) = 1, then
P(B) =
Σi
P(BAi
) P(Ai
)
9
An Example
An Example
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which the
disease is actually present, and a correct negative result in only 97% of
the cases in which the disease is not present. Furthermore, .008of the
entire population have this cancer.
)(
)()(
)(
)(
)()(
)(
97.)(,03.)(
02.)(,98.)(
992.)(,008.)(
+
¬¬+
=+¬
+
+
=+
=¬−=¬+
=−=+
=
¬
=
P
cancerPcancerP
cancerP
P
cancerPcancerP
cancerP
cancerPcancerP
cancerPcancerP
cancerPcancerP
10
The independence hypothesis
The independence hypothesis
…
…
…makes computation possible
…yields optimal classifiers when satisfied
…but is seldom satisfied in practice, as attributes
(variables) are often correlated.
Attempts to overcome this limitation:
Bayesian networks, that combine Bayesian reasoning with
causal relationships between attributes
Decision trees, that reason on one attribute at the time,
considering most important attributes first
11
Bayesian Networks
Bayesian Networks
Bayesian Networks
Learning BN
Inference in BN
•Exact inference
•Approximate Inference
12
Joint Probability Distribution(JPD)
Joint Probability Distribution(JPD)
P(A, B)
JPD, Probability of both A
and B.
P(A,B)<P(AB)
P(AB)
Conditional probability. The
probability of A, given that B
already happen.
A
B
13
Conditional Independence
Conditional Independence
Definition: X is conditionally independent of Y given Z is the probability
distribution governing X is independent of the value of Y given the value
of Z, that is, if
∀xi,yj,zk
P(X=xiY=yj,Z=zk) = P(X=xiZ=zk)
or more compactly P(XY,Z) = P(XZ)
Example:
Thunder
is conditionally independent of
Rain
given
Lightning
P(
Thunder

Rain
,
Lightning
) = P(
Thunder

Lightning
)
Notice: P(
Thunder

Rain
) ≠P(
Thunder
)
Naïve Bayesuses conditional independence to justify:
P(X,YZ) = P(XY,Z) P(YZ) = P(XZ) P(YZ)
14
Example
Example
I'm at work, neighbor John calls to say my alarm is ringing, butneighbor Mary doesn't
call. Sometimes it's set off by minor earthquakes. Is there a burglar?
Variables:
Burglary
,
Earthquake
,
Alarm
,
JohnCalls
,
MaryCalls
15
Joint Distributions
Joint Distributions
A0A1
J0J1J0J1
0.9875250.9875250.000750.00075
M0
0.0099750.0099750.001750.00175
M1
Given Alarm, John call is independent from Marry call. (redundancyof
the data)
Considering other variables: E, B, then table become larger. (efficiency)
3 variables A, B, C need a table of 8 values.
100 variables, need ... Too big to be acceptable.
16
Imagine the following set of rules:
If it is raining or sprinklers are on then the street is wet.
If it is raining or sprinklers are on then the lawn is wet.
If the lawn is wet then the soil is moist.
If the soil is moist then the roses are OK.
Warm Ups
Warm Ups
17
Bayesian (belief) Networks
Bayesian (belief) Networks
Naïve assumption of conditional independency too
restrictive
Full probability distribution intractable due to lack of data
Bayesian network describe conditional independence
among subsets of variables (attributes): combining prior
knowledge about dependencies among variables with
observed training data.
Bayesian Net
Node = variables
Arc = dependency
DAG, with direction on arc representing causality
Allows combining prior knowledge about causal
relationships among variables with observed data
18
Dependencies & Independencies
Dependencies & Independencies
Dependencies
Intuitive.
Two connected nodes influence
each other. Symmetric.
Independencies
Example: P(J;MA), P(B;E)
Others? P(B;EA)?
19
Consider a more complex
tree/network:
If an event E at a leaf node happens
(say, M) and we wish to know
whether this supports A, we need to
‘chain’our Bayesian rule as follows:
P(A,C,F,M)=P(AC,F,M)*P(CF,M)*
P(FM)*P(M)
Warm Ups
Warm Ups
),(),....(
1
1
MPaxPMxxP
ii
n
i
n
=
Π=
)(
ii
xparentPa=
General Product Rule:
General Product Rule:
20
Compactness
Compactness
A CPT for Boolean
X
i
with
k
Boolean parents has
2
k
rows for the combinations of parent
values
Each row requires one number
p
for
X
i
= true
(the number for
X
i
=
false
is just
1p
)
If each variable has no more than
k
parents, the complete network requires
O(n
∙2k)
numbers
I.e., grows linearly with
n
, vs.
O(2
n
)
for the full joint distribution
For burglary net, 1 + 1 + 4 + 2 + 2 = 10 numbers (vs. 2
51 = 31)
21
A more complex example
A more complex example
A 37 nodes BN need a JPD
table with
137,438,953,472
values.
22
A 448 nodes example
A 448 nodes example
23
BN formal definition(1/2)
BN formal definition(1/2)
NonDescendants(Xi)
Denote the variables in the graph that
are not descendants of Xi.
NonDescendants(A)={B, E} I(M; BA)
Denote given A, variables M and B are
independent. P(B)=P(BE) Pa(Xi)
Denote the parents of Xi. Pa (A) = {B,
E}.
24
BN formal definition(2/2)
BN formal definition(2/2)
A Bayesian network structureG is a directed acyclicgraph whose nodes
represent random variables X1...Xn. Then G encodesthe following set of
conditional independence assumptions:
For each variable Xi, we have that
I(Xi; NonDescendants(Xi)  Pa(Xi))
i.e., Xi is independent of its nondescendants given its parents.
From Daphne Koller „Representing Complex Distributions“
Example: I(M; JA).
25
Constructing Bayesian networks
Constructing Bayesian networks
1. Choose an ordering of variables
X
1
, …,
X
n
2. For
i
= 1 to
n
add
X
i
to the network
select parents from
X
1
, …,X
i1
such that
P(X
i
 Parents(X
i
)) = P(X
i
 X
1
, ... X
i1
)
This choice of parents guarantees:
P(X
1
, …,X
n
) =
π
i =1
P(X
i
 X
1
, …, X
i1
)
(chain rule)
=
π
i =1
P(X
i
 Parents(X
i
))
(by construction)
n
n
We can construct conditional probabilities for each
(binary) attribute to reflect our knowledge of the world:
26
Example
Example
I'm at work, neighbor John calls to say my alarm is ringing, butneighbor Mary doesn't
call. Sometimes it's set off by minor earthquakes. Is there a burglar?
Variables:
Burglary
,
Earthquake
,
Alarm
,
JohnCalls
,
MaryCalls
Network topology reflects "causal" knowledge:
A burglar can set the alarm off
An earthquake can set the alarm off
The alarm can cause Mary to call
The alarm can cause John to call
27
An Example
An Example
Conditional probability
distribution in the table.
P(B,E,A,J, M)=
P(B)*P(E)*P(AB,E)
*P(JA)*P(MA)
28
Suppose we choose the ordering
M, J, A, B, E
P(J  M) = P(J)?
Example
Example
29
Suppose we choose the ordering
M, J, A, B, E
P(J  M) = P(J)?
No
P(A  J, M) = P(A  J)
?
P(A  J, M) = P(A)
?
Example
Example
30
Suppose we choose the ordering
M, J, A, B, E
P(J  M) = P(J)?
No
P(A  J, M) = P(A  J)
?
P(A  J, M) = P(A)
? No
P(B  A, J, M) = P(B  A)
?
P(B  A, J, M) = P(B)
?
Example
Example
31
Suppose we choose the ordering M, J, A, B, E
P(J  M) = P(J)?
No
P(A  J, M) = P(A  J)
?
P(A  J, M) = P(A)
? No
P(B  A, J, M) = P(B  A)
? Yes
P(B  A, J, M) = P(B)
? No
P(E  B, A ,J, M) = P(E  A)
?
P(E  B, A, J, M) = P(E  A, B)
?
Example
Example
32
Suppose we choose the ordering M, J, A, B, E
P(J  M) = P(J)?
No
P(A  J, M) = P(A  J)
?
P(A  J, M) = P(A)
? No
P(B  A, J, M) = P(B  A)
? Yes
P(B  A, J, M) = P(B)
? No
P(E  B, A ,J, M) = P(E  A)
? No
P(E  B, A, J, M) = P(E  A, B)
? Yes
Example
Example
33
Example contd.
Example contd.
Deciding conditional independence is hard in noncausaldirections
Network is less compact
34
Learning Bayesian Networks
Learning Bayesian Networks
Suppose structure is known, variables are partially observable
Similar to train neural network with hidden units
Learn conditional probability tables using gradient ascent
Converge to network h that (locally) maximizes P(Dh)
35
Inference in Bayesian Networks
Inference in Bayesian Networks
How can one infer the probabilities of values of one or more
network variables given observed values of others?
Bayesnet contains all the information needed for this
inference.
If only one variable with unknown value, easy to infer
In the general case, inference in Bayesian networks is NPhard
In practice exact inference methods work well for some
network structures.
Approximate Inference
36
Query
Query
Interesting information from joint probabilities.
What is the probability of both Mary call and John call if a
Burglary happens? P(M, JB)
What is the most probable explanation of Marry call?
Query can be answered by inferencing the BN.
P(M,JB)=P(B, M, J)/P(B) =
∑
AE
BPJMABEP
,
)(/)),,,,((
37
The reasons for using BN
The reasons for using BN
Whatever you want can
be get from it.
Diagnostic task.(from effect
to cause)
P(
JohnCallsBurglary=T)
Prediction task.(from cause
to effect)
P(
BurglaryJohnCalls=T)
Other probabilistic queries
(queries on joint distributions).
P(
Alarm)
Reveal the structure of JPD
Dependencies are given by
arrow. Independencies are
specified, too (later) More compact than JPD
Get rid of redundancy.
All kinds of query can be
calculated (later)
38
Exact Inference in Bayesian Networks
Exact Inference in Bayesian Networks
Compute the posterior for a set of query variables
given some observed event
Inference by Enumeration
For example:
39
Vairable
Vairable


Elimination Algorithm
Elimination Algorithm
Idea: Sum up one variable at a time, generating a new
distribution with respect to other variables connecting with
the eliminated variable.
40
Complexity for exact inference
Complexity for exact inference
Exact inference problem is NPhard, the cost of
computing is decided by the size of intermediate
factors.
41
Approximate Inference in BN
Approximate Inference in BN
Sampling
Construct samples according to
probabilities given in a BN.
Alarm example
: (Choose the right
sampling sequence)
1) Sampling:P(B)=<0.001, 0.999>
suppose it is false, B0. Same for E0.
P(AB0, E0)=<0.001, 0.999>
suppose it is false...
2) Frequency counting:
In the samples right,
P(JA0)=P(J,A0)/P(A0)=<1/9, 8/9>
J1M1A1B0E1
J0M0A0B0E0
J0M0A0B0E0
J0M0A0B0E0
J0M0A0B0E0
J0M0A0B0E0
J0M0A0B0E0
J0M0A0B0E0
J1M0A0B0E0
J0M0A0B0E0
42
Approximate Inference in BN
Approximate Inference in BN
Direct Sampling
We have seen it.
Rejection Sampling
Create samples like direct sampling, only count samples
which is consistent with given evidences.
Likelihood weighting, ...
Sample variables and calculate evidence weight. Only
create the samples which support the evidences.
43
Approximate Inference in BN
Approximate Inference in BN
To calculate P(JB1,M1)
Choose (B1,E0,A1,M1,J1) as a start
Evidences are B1, M1, variables are
A, E, J.
Choose next variable as A
Sample A by P(AMB(A))=P(AB1, E0,
M1, J1) suppose to be false.
(B1, E0, A0, M1, J1)
Choose next random variable as E ...
44
Complexity for Approximate Inference
Complexity for Approximate Inference
Approximate inference problem is
NPhard.
It will never reach the exact
probability distribution. But only
close to the value.
Much better than exact inference
when BN is big enough. In MCMC,
only consider P(XMB(X)) but not
the whole network.
45
Applications
Applications
•You have all used BayesBelief Networks, probably a few
dozen times, when you use Microsoft Office! (See
http://research.microsoft.com/~horvitz/lum.htm)
•Microsoft are also experimenting with a Bayesian antispam
filter and Bayesian techniques in healthcare devices.
46
Problems with
Problems with
Bayes
Bayes
nets
nets
•Loops can sometimes occur with belief networks and have to
be avoided.
•We have avoided the issue of where the probabilities come
from. The probabilities either are given or have to be learned.
Similarly, the network structure also has to be learned. (See
http://www.bayesware.com/products/discoverer/discoverer.html)
•The number of paths to explore grows exponentially with each
node. (The problem of exact probabilistic inference in Bayes
network is NP=hard. Approximation techniques may have to be
used.)
47
Summary
Summary
Bayesian networks provide a natural
representation for (causally induced) conditional
independence
Topology + CPTs= compact representation of
joint distribution
Generally easy for domain experts to construct
48
49
50
Vairable
Vairable


Elimination Algorithm(2/3)
Elimination Algorithm(2/3)
),(1
),(*)(
BAf
BEAPEP
E
→
∑
f1(A1,B1)=
P(E0)*P(A1E0,B1)
+P(E1)*P(A1E1,B1)
0.9984220.059980.0015780.94002
B0B1B0B1
A0A1
51
Vairable
Vairable


Elimination Algorithm(3/3)
Elimination Algorithm(3/3)
Go on eliminating A using the distribution created by last
step.
Any query P(XY) where X = {X1..Xn} Y ={Y1..Ym} Z={Z1..Zk}=Other
nodes except X and Y, can be calculated by the below way, where P(X,
Y, Z) is given by BN structure and Σcan be eliminate step by step.
∑∑
=
=
ZXZ
ZYXPZYXP
YPYXPYXP
,
),,(/),,(
)(/),()(
Comments 0
Log in to post a comment