# A Tutorial on Learning with

Τεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 4 χρόνια και 6 μήνες)

112 εμφανίσεις

A Tutorial on Learning with
Bayesian Networks

David Heckerman

What is a Bayesian Network?

“a graphical model for probabilistic relationships
among a set of variables.”

Why use Bayesian Networks?

Don’t need complete data set

Can learn causal relationships

Combines domain knowledge and data

Avoids
overfitting

don’t need test data

Probability

2 types

1.
Bayesian

2.
Classical

Bayesian Probability

‘Personal’ probability

Degree of belief

Property of person who assigns it

Observations are fixed, imagine all possible
values of parameters from which they could
have come

“I think the coin will land on heads 50% of the time”

Classical Probability

Property of environment

‘Physical’ probability

Imagine all data sets of size N that could be
generated by sampling from the distribution
determined by parameters. Each data set occurs
with some probability and produces an estimate

“The probability of getting heads on this particular
coin is 50%”

Notation

Variable: X

State of X = x

Set of variables:
Y

Assignment of variables (configuration):
y

Probability that X = x of a person with state of
information
ξ
:

Uncertain variable:
Θ

Parameter:
θ

Outcome of
l
th

try: X
l

D = {X
1

= x
1
, ... X
N

=
x
N
} observations

)
|
(

x
X
p

Example

Thumbtack problem: will it land on the point
(heads) or the flat bit (tails)?

Flip it N times

What will it do on the N+1th time?

How to compute p(x
N+1
|D,

ξ
) from p(
θ
|
ξ
)?

Step 1

Use
Bayes
’ rule to get probability distribution
for
Θ

given D and
ξ

where

)
|
(
)
,
|
(
)
|
(
)
,
|
(

D
p
D
p
p
D
p

d
p
D
p
D
p
)
|
(
)
,
|
(
)
|
(

Step 2

Expand p(D|
θ
,
ξ
)

likelihood function for
binomial sampling

Observations in D are mutually independent

θ

and tails is 1
-

θ

Substitute into the previous equation...

)
|
(
)
1
(
)
|
(
)
,
|
(

D
p
p
D
p
t
h

Step 3

Average over possible values of
Θ

to
determine probability

E
p

(
θ
|D,
ξ
)
(
θ
) is the expectation of
θ

w.r.t
. the
distribution p(
θ
|D,
ξ
)

Prior Distribution

The prior is taken from a beta distribution:

P(
θ
|
ξ
) = Beta (
θ
|
α
h
,
α
t
)

α
h
,
α
t
are
hyperparameters

to distinguish from
the
θ

parameter

sufficient statistics

Beta prior means posterior is beta too

Assessing the prior

Imagined future data:

Assess probability in first toss of thumbtack

Imagine you’ve seen outcomes of k flips

Reassess probability

Equivalent samples

α
h
,
α
t
tails

posterior will be Beta(
α
h
,
α
t
)

Beta (0,0) is state of minimum information

Assess
α
h
,
α
t

by determining number of observations
of heads and tails equivalent to our current
knowledge

Can’t always use Beta prior

What if you bought the thumbtack in a magic
shop? It could be biased.

Need a mixture of Betas

introduces hidden
variable H

Distributions

We’ve only been talking about binomials so far

Observations could come from any physical
probability distribution

We can still use Bayesian methods. Same as
before:

Define variables for unknown parameters

Assign priors to variables

Use
Bayes
’ rule to update beliefs

Average over possible values of
Θ

to predict things

Exponential Family

For distributions in the exponential family

Calculation can be done efficiently and in closed
form

E.g. Binomial, multinomial, normal, Gamma,
Poisson...

Bernardo and Smith (1994) compiled
important quantities and Bayesian
computations for commonly used members of
the family

Paper focuses on multinomial sampling

Exponential Family

Multinomial sampling

X is discrete

r possible states x
1

...
x
r

Likelihood function:

Same number of parameters as states

Parameters = physical probabilities

Sufficient statistics for D = {X
1

= x
1
, ... X
N

=
x
N
}:

{N
1
, ... N
r
} where N
i

is the number of times X = x
i

in
D

Multinomial Sampling

Prior used is
Dirichlet
:

P(
θ
|
ξ
) = Dir(
θ
|
α
1
, ...,
α
r
)

Posterior is
Dirichlet

too

P(
θ
|
ξ
) = Dir(
θ
|
α
1
+N
1
, ...,
α
r
+N
r
)

Can assess this same way you can Beta
distribution

Bayesian Network

Network structure of BN:

Directed acyclic graph (DAG)

Each node of the graph represents a variable

Each arc asserts the dependence relationship
between the pair
of variables

A probability table associating each node to its
immediate
parent nodes

Bayesian Network (cont’d)

A Bayesian network for detecting credit
-
card fraud

Direction of arcs:

from parent to descendant
node

Parents of node X
i
:
Pa
i

Pa(
Jewelry
) = {Fraud, Age, Sex}

Bayesian Network (cont’d)

Network structure: S

Set of variables:

Parents of X
i

:
Pa
i

Joint distribution of
X
:

}
,...
,
{
2
1
N
X
X
X

X

N
i
i
i
x
p
p
1
)
|
(
)
(
pa
x
Markov condition:

ND(X
i
) =
nondescendent

nodes of X
i

)
|
(
)
),
(
|
(
i
i
i
i
i
x
p
x
x
p
pa
pa
nd

Constructing BN

Given set

}
,...
,
{
2
1
N
X
X
X

X

N
i
i
i
x
x
x
x
p
p
1
1
2
1
)
,...
,
|
(
)
(
x
(chain rule of
prob
)

Now, for every X
i
:

}
,...
,
{
1
2
1

i
i
X
X
X
such that X
i

and
X
\

i

are cond. independent given

i

N
i
i
i
x
p
p
1
)
|
(
)
(

x

i
Pa
i

Constructing BN (cont’d)

Using the ordering (F,A,S,G,J)

But by using the ordering (J,G,S,A,F)

we obtain a fully connected structure

Use some prior assumptions of the causal relationships

among variables

Inference in BN

The goal is to compute any probability of interest
(probabilistic inference)

Inference (even approximate) in an arbitrary BN for discrete

variables is NP
-
hard (Cooper, 1990 /
Dagum

and
Luby
, 1993)

Most commonly used algorithms:

Lauritzen

&
Spiegelhalter

(1988), Jensen et al. (1990) and

Dawid

(1992)

basic idea: transform BN to a tree

exploit mathematical

Properties of that tree

Inference in BN (cont’d)

Learning in BN

Learning the parameters from data

Learning the structure from data

Learning the parameters: known structure,

data is fully observable

Learning parameters in BN

Recall thumbtack problem:

)
|
(
)
,
|
(
)
|
(
)
,
|
(

D
p
D
p
p
D
p

Step 1:

Step 2: expand p(D|
θ
,
ξ
)

Step 3: Average over possible values of
Θ

to

determine probability

Joint probability distribution:

Learning parameters in BN (cont’d)

N
i
h
i
i
i
h
s
S
x
p
S
p
1
)
,
,
|
(
)
,
|
(

pa
x
h
S
: Hypothesis of structure S

θ
i

: vectors of parameters for the local distribution

θ
s

: vector of {
θ
1

,

θ
2

,

...,

θ
N

}

D = {X
1,

X
2,...

X
N
} random sample

Goal is to calculate the posterior distribution:

)
,
|
(
h
s
S
D
p

Illustration with multinomial distr. :

Each X
1

is discrete: values from

Local distr. is a collection of multinomial
distros
, one for each
config

of
Pa
i

Learning parameters in BN (cont’d)

}
,...
,
{
2
1
i
r
i
i
i
x
x
x
i
q
i
i
i
pa
pa
pa
,...
,
2
1
configurations of
Pa
i

mutually independent

Parameter independence
:

Learning parameters in BN (cont’d)

Therefore
:

We can update each vector of
θ
ij

independently

Assume that prior distr. of
θ
ij

is

Thus, posterior distr. of
θ
ij

is:

where
N
ijk

is the number of cases in D in which

k
i
i
x
X

and

j
i
i
pa
pa

To compute , we have to
average over
possible
conf of
θ
s

:

Learning parameters in BN (cont’d)

)
,
|
(
1
h
N
S
D
X
p

Using parameter independence:

we obtain:

where