A Tutorial on Learning with
Bayesian Networks
David Heckerman
What is a Bayesian Network?
“a graphical model for probabilistic relationships
among a set of variables.”
Why use Bayesian Networks?
•
Don’t need complete data set
•
Can learn causal relationships
•
Combines domain knowledge and data
•
Avoids
overfitting
–
don’t need test data
Probability
•
2 types
1.
Bayesian
2.
Classical
Bayesian Probability
•
‘Personal’ probability
•
Degree of belief
•
Property of person who assigns it
•
Observations are fixed, imagine all possible
values of parameters from which they could
have come
“I think the coin will land on heads 50% of the time”
Classical Probability
•
Property of environment
•
‘Physical’ probability
•
Imagine all data sets of size N that could be
generated by sampling from the distribution
determined by parameters. Each data set occurs
with some probability and produces an estimate
“The probability of getting heads on this particular
coin is 50%”
Notation
•
Variable: X
•
State of X = x
•
Set of variables:
Y
•
Assignment of variables (configuration):
y
•
Probability that X = x of a person with state of
information
ξ
:
•
Uncertain variable:
Θ
•
Parameter:
θ
•
Outcome of
l
th
try: X
l
•
D = {X
1
= x
1
, ... X
N
=
x
N
} observations
)

(
x
X
p
Example
•
Thumbtack problem: will it land on the point
(heads) or the flat bit (tails)?
•
Flip it N times
•
What will it do on the N+1th time?
How to compute p(x
N+1
D,
ξ
) from p(
θ

ξ
)?
Step 1
•
Use
Bayes
’ rule to get probability distribution
for
Θ
given D and
ξ
where
)

(
)
,

(
)

(
)
,

(
D
p
D
p
p
D
p
d
p
D
p
D
p
)

(
)
,

(
)

(
Step 2
•
Expand p(D
θ
,
ξ
)
–
likelihood function for
binomial sampling
•
Observations in D are mutually independent
–
probability of heads is
θ
and tails is 1

θ
•
Substitute into the previous equation...
)

(
)
1
(
)

(
)
,

(
D
p
p
D
p
t
h
Step 3
•
Average over possible values of
Θ
to
determine probability
•
E
p
(
θ
D,
ξ
)
(
θ
) is the expectation of
θ
w.r.t
. the
distribution p(
θ
D,
ξ
)
Prior Distribution
•
The prior is taken from a beta distribution:
P(
θ

ξ
) = Beta (
θ

α
h
,
α
t
)
•
α
h
,
α
t
are
hyperparameters
to distinguish from
the
θ
parameter
–
sufficient statistics
•
Beta prior means posterior is beta too
Assessing the prior
•
Imagined future data:
–
Assess probability in first toss of thumbtack
–
Imagine you’ve seen outcomes of k flips
–
Reassess probability
•
Equivalent samples
–
Start with Beta(0,0) prior, observe
α
h
,
α
t
heads and
tails
–
posterior will be Beta(
α
h
,
α
t
)
–
Beta (0,0) is state of minimum information
–
Assess
α
h
,
α
t
by determining number of observations
of heads and tails equivalent to our current
knowledge
Can’t always use Beta prior
•
What if you bought the thumbtack in a magic
shop? It could be biased.
•
Need a mixture of Betas
–
introduces hidden
variable H
Distributions
•
We’ve only been talking about binomials so far
•
Observations could come from any physical
probability distribution
•
We can still use Bayesian methods. Same as
before:
–
Define variables for unknown parameters
–
Assign priors to variables
–
Use
Bayes
’ rule to update beliefs
–
Average over possible values of
Θ
to predict things
Exponential Family
•
For distributions in the exponential family
–
–
Calculation can be done efficiently and in closed
form
–
E.g. Binomial, multinomial, normal, Gamma,
Poisson...
•
Bernardo and Smith (1994) compiled
important quantities and Bayesian
computations for commonly used members of
the family
•
Paper focuses on multinomial sampling
Exponential Family
Multinomial sampling
•
X is discrete
–
r possible states x
1
...
x
r
•
Likelihood function:
•
Same number of parameters as states
•
Parameters = physical probabilities
•
Sufficient statistics for D = {X
1
= x
1
, ... X
N
=
x
N
}:
–
{N
1
, ... N
r
} where N
i
is the number of times X = x
i
in
D
Multinomial Sampling
•
Prior used is
Dirichlet
:
–
P(
θ

ξ
) = Dir(
θ

α
1
, ...,
α
r
)
•
Posterior is
Dirichlet
too
–
P(
θ

ξ
) = Dir(
θ

α
1
+N
1
, ...,
α
r
+N
r
)
•
Can assess this same way you can Beta
distribution
Bayesian Network
Network structure of BN:
–
Directed acyclic graph (DAG)
–
Each node of the graph represents a variable
–
Each arc asserts the dependence relationship
between the pair
of variables
–
A probability table associating each node to its
immediate
parent nodes
Bayesian Network (cont’d)
A Bayesian network for detecting credit

card fraud
Direction of arcs:
from parent to descendant
node
Parents of node X
i
:
Pa
i
Pa(
Jewelry
) = {Fraud, Age, Sex}
Bayesian Network (cont’d)
•
Network structure: S
•
Set of variables:
•
Parents of X
i
:
Pa
i
Joint distribution of
X
:
}
,...
,
{
2
1
N
X
X
X
X
N
i
i
i
x
p
p
1
)

(
)
(
pa
x
Markov condition:
ND(X
i
) =
nondescendent
nodes of X
i
)

(
)
),
(

(
i
i
i
i
i
x
p
x
x
p
pa
pa
nd
Constructing BN
Given set
}
,...
,
{
2
1
N
X
X
X
X
N
i
i
i
x
x
x
x
p
p
1
1
2
1
)
,...
,

(
)
(
x
(chain rule of
prob
)
Now, for every X
i
:
}
,...
,
{
1
2
1
i
i
X
X
X
such that X
i
and
X
\
i
are cond. independent given
i
N
i
i
i
x
p
p
1
)

(
)
(
x
i
Pa
i
Constructing BN (cont’d)
Using the ordering (F,A,S,G,J)
But by using the ordering (J,G,S,A,F)
we obtain a fully connected structure
Use some prior assumptions of the causal relationships
among variables
Inference in BN
The goal is to compute any probability of interest
(probabilistic inference)
Inference (even approximate) in an arbitrary BN for discrete
variables is NP

hard (Cooper, 1990 /
Dagum
and
Luby
, 1993)
Most commonly used algorithms:
Lauritzen
&
Spiegelhalter
(1988), Jensen et al. (1990) and
Dawid
(1992)
•
basic idea: transform BN to a tree
–
exploit mathematical
Properties of that tree
Inference in BN (cont’d)
Learning in BN
•
Learning the parameters from data
•
Learning the structure from data
Learning the parameters: known structure,
data is fully observable
Learning parameters in BN
•
Recall thumbtack problem:
)

(
)
,

(
)

(
)
,

(
D
p
D
p
p
D
p
Step 1:
Step 2: expand p(D
θ
,
ξ
)
Step 3: Average over possible values of
Θ
to
determine probability
•
Joint probability distribution:
Learning parameters in BN (cont’d)
N
i
h
i
i
i
h
s
S
x
p
S
p
1
)
,
,

(
)
,

(
pa
x
h
S
: Hypothesis of structure S
θ
i
: vectors of parameters for the local distribution
θ
s
: vector of {
θ
1
,
θ
2
,
...,
θ
N
}
D = {X
1,
X
2,...
X
N
} random sample
Goal is to calculate the posterior distribution:
)
,

(
h
s
S
D
p
Illustration with multinomial distr. :
•
Each X
1
is discrete: values from
•
Local distr. is a collection of multinomial
distros
, one for each
config
of
Pa
i
Learning parameters in BN (cont’d)
}
,...
,
{
2
1
i
r
i
i
i
x
x
x
i
q
i
i
i
pa
pa
pa
,...
,
2
1
configurations of
Pa
i
mutually independent
Parameter independence
:
Learning parameters in BN (cont’d)
Therefore
:
We can update each vector of
θ
ij
independently
Assume that prior distr. of
θ
ij
is
Thus, posterior distr. of
θ
ij
is:
where
N
ijk
is the number of cases in D in which
k
i
i
x
X
and
j
i
i
pa
pa
To compute , we have to
average over
possible
conf of
θ
s
:
Learning parameters in BN (cont’d)
)
,

(
1
h
N
S
D
X
p
Using parameter independence:
we obtain:
where
Comments 0
Log in to post a comment