# Learning with Bayesian Networks

Τεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 4 χρόνια και 6 μήνες)

66 εμφανίσεις

1

Learning with Bayesian
Networks

Author: David Heckerman

Presented by Yan Zhang

April 24 2006

2

Outline

Bayesian Approach

Bayes Therom

Bayesian vs. classical probability methods

coin toss

an example

Bayesian Network

Structure

Inference

Learning Probabilities

Learning the Network Structure

Two coin toss

an example

Conclusions

Exam Questions

3

Bayes Theorem

where

p
A
B
p
B
A
p
A
p
B
p
B
p
B
A
p
A
A
Or

p
B
i
p
B
A
i
p
A
i

p(

|D
)= p(

|D
)p(

)/p(
D
)

p(
S
h
|
D
)=p(
D
|
S
h
)p(
S
h
)/p(
D
)

4

Bayesian vs. the Classical Approach

The Bayesian probability of an event
x
,
represents the person’s degree of belief
or confidence in that event’s occurrence
based on prior and observed facts.

Classical probability refers to the true or
actual

probability of the event and is not
concerned with observed behavior.

5

Bayesian vs. the Classical Approach

Bayesian approach restricts its
prediction to the next (N+1)
occurrence of an event given the
observed previous (N) events.

Classical approach is to predict
likelihood of any given event
regardless of the number of
occurrences.

6

Example

Toss a coin 100 times, denote r.v. X as the
outcome of one flip

,
p(X=tail) =1
-

Before doing this experiment, we have some
belief in our mind:

Prior Probability p(

|

)=beta(

|a=5,
b=5
)

E[

]=
a/(a+b)=0.5,
Var(

)=
ab/[(a+b)
2
(a+b+1)]

Experiment finished

h = 65, t = 35

p(

|D,

)= ?

p(

|D,

)=p(
D|

,

)p(

|

)/p(
D|

)

=[k
1

h
(1
-

)
t
][k
2

a
-
1
(1
-

)
b
-
1
]/k
3

=beta(

|a=5+
h
, b=5+
t
)

 E[ |D]=
a/(a+b)=(5+65)/(5+65+5+35)
= 0.64

p
Beta
,
1
1
1
7

Example

8

Integration

To find the probability that
X
n+1
=
must integrate over all possible values of

to
find the average value of

which yields:

p
X
N
1
D
,
p
X
N
1
,
p
D
,
p
D
,
E
D
0.64
9

Bayesian Probabilities

Posterior Probability, p(

|
D,

): Probability of
a particular value of

given that
D

has been
observed (our final value of

) . In this case

= {
D
}.

Prior Probability, p(

|

): Prior Probability of a
particular value of

given no observed data
(our previous “belief”)

Observed Probability or “Likelihood”, p(
D|

,

):
Likelihood of sequence of coin tosses
D

being
observed given that

is a particular value. In
this case

= {

}.

p(
D|

): Raw probability of
D

10

Priors

In the previous example, we used a beta prior to encode
the states of a r.v. It is because there are only 2
states/outcomes of the variable X.

In general, if the observed variable X is discrete, having r
possible states {1,…,r}, the likelihood function is given by

p(X=x
k

|

,

)=

k
, where k=1,…,r and

=

{

1
,…,

r

}, ∑

k
=1

We use Dirichlet distribution as prior:

p
Dir
1
,
...
,
r
k
1
r
k
k
1
r
k
k
1
p
D
,
Dir
1
N
1
,
...
,
r
N
r

And we can derive the posterior distribution

11

Outline

Bayesian Approach

Bayes Therom

Bayesian vs. classical probability methods

coin toss

an example

Bayesian Network

Structure

Inference

Learning Probabilities

Learning the Network Structure

Two coin toss

an example

Conclusions

Exam Questions

12

Introduction to Bayesian Networks

Bayesian networks represent an advanced
form of general Bayesian probability

A Bayesian network is a graphical model
that encodes probabilistic relationships
among variables of interest

The model has several advantages for
data analysis over rule based decision
trees

13

Advantages of Bayesian Techniques (1)

How do Bayesian techniques compare to
other learning models?

Bayesian networks can readily
handle incomplete data sets.

14

Advantages of Bayesian Techniques (2)

Bayesian networks allow one to
learn about causal relationships

We can use observed knowledge to
determine the validity of the acyclic
graph that represents the Bayesian
network.

Observed knowledge may strengthen or
weaken this argument.

15

Advantages of Bayesian Techniques (3)

Bayesian networks readily facilitate
use of prior knowledge

Construction of prior knowledge is
relatively straightforward by
constructing “causal” edges between
any two factors that are believed to be
correlated.

Causal networks represent prior
knowledge where as the weight of the
directed edges can be updated in a
posterior manner based on new data

16

Advantages of Bayesian Techniques (4)

Bayesian methods provide an
efficient method for preventing the
over fitting of data (there is no
need for pre
-
processing).

Contradictions do not need to be
removed from the data.

Data can be “smoothed” such that all
available data can be used

17

Example Network

Consider a credit fraud network designed to
determine the probability of credit fraud based on
certain events

Variables include:

Fraud(f): whether fraud occurred or not

Gas(g): whether gas was purchased within 24 hours

Jewelry(J): whether jewelry was purchased in the
last 24 hours

Age(a): Age of card holder

Sex(s): Sex of card holder

Task of determining which variables to include is
not trivial, involves decision analysis.

18

Example Network

Jewelry

Sex

Age

Fraud

Gas

A set of Variables
X
={
X
1
,…,
X
n
}

A Network Structure

Conditional Probability Table (CPT)

X1

X2

X3

Ө
ijk

yes

<30

m

Ө
511

yes

<30

f

Ө
521

X5 = yes

yes

30
-
50

m

Ө521

yes

30
-
50

f

Ө541

yes

>50

m

Ө551

yes

>50

f

Ө561

no

Ө5 12 1

X5 = no

yes

<30

m

Ө512

X
1

X
2

X
3

X
4

X
5

19

Example Network

Jewelry

Sex

Age

Fraud

Gas

X
1

X
2

X
3

X
4

X
5

p(a|f) = p(a)

p(s|f,a) = p(s)

p(g|f,a, s) = p(g|f)

p(j|f,a,s,g) = p(j|f,a,s)

Using the graph of
expected causes,
we can check for
conditional
independence of
the following
probabilities given
initial sample data

20

Inference in a Bayesian Network

p
f
yes
a
,
s
,
g
,
j
p
f
yes
,
a
,
s
,
g
,
j
p
a
,
s
,
g
,
j
p
f
yes
p
a
p
s
p
g
f
yes
p
j
f
yes
,
a
,
s
i
1
2
p
f
i
p
a
p
s
p
g
f
i
p
j
f
i
,
a
,
s

To determine various probabilities
of interests from the model

Probabilistic inference

The
computation of a probability of
interest given a model

21

Learning Probabilities in a Bayesian Network

Jewelry

Sex

Age

Fraud

Gas

The physical joint probability distribution for
X
=(
X
1

X
5
)
can be encoded as following expression

X1

X2

X3

Ө
ijk

yes

<30

m

Ө
511

yes

<30

f

Ө
521

X5 = yes

yes

30
-
50

m

Ө521

yes

30
-
50

f

Ө541

yes

>50

m

Ө551

yes

>50

f

Ө561

no

Ө5 12 1

X5 = no

yes

<30

m

Ө512

X
1

X
2

X
3

X
4

X
5

p
x
s
,
S
h
i
1
n
p
x
i
pa
i
,
i
,
S
h

where

s
=
(

1

n
)

22

Learning Probabilities in a Bayesian Network

As new data come, the probabilities in CPTs need to be
updated

Then we can update each vector of parameters

ij

independently, just as one
-
variable case.

Assuming each vector

ij

has the prior distribution
Dir(

ij

|a
ij1
,…,
a
ijr
i
)

Posterior distribution

p
(

ij
|D,S
h
)=Dir(

ij
|
a
ij1
+
N
ij1

, …,
a
ijr
i
+
N
ijr
i
)

Where
N
ijk
is the number of cases in
D
in which
X
i
=
x
i
k
and Pa
i
=pa
i
j

p
s
D
,
S
h
i
1
n
j
1
q
i
p
ij
D
,
S
h
23

Learning the Network Structure

Sometimes the causal relations are not
obvious, so that we are uncertain with the
network structure

Theoretically, we can use bayesian
approach to get the posterior distribution of
the network structure

Unfortunately, the number of possible
network structure increase exponentially
with n

the number of nodes

p
S
h
D
p
D
S
h
p
S
h
i
1
m
p
D
S
h
S
h
i
p
S
h
i
24

Learning the Network Structure

Model Selection

To select a “good” model (i.e. the network
structure) from all possible models, and use it as
if it were the correct model.

Selective Model Averaging

To select a manageable number of good models
from among all possible models and pretend that
these models are exhaustive.

Questions

How do we choose search for good models?

How do we decide whether or not a model is
“Good”?

25

Two Coin Toss Example

Experiment: flip two coins and observe the
outcome

We have had two network structures in
mind:
S
h
1
or
S
h
2

If p(
S
h
1
)=p(
S
h
2
)=0.5

After observing some data, which model is
more possible for this collection of data?

X
1

X
2

X
1

X
2

p(H)=p(T)=0.5

p(H)=p(T)=0.5

p(H|H)

= 0.1

p(T|H)

= 0.9

p(H|T)

= 0.9

p(T|T)

= 0.1

S
h
1

S
h
2

26

Two Coin Toss Example

X
1

X
2

1

T

T

2

T

H

3

H

T

4

H

T

5

H

H

6

H

T

7

T

H

8

T

H

9

H

T

10

H

T

p
S
h
D
p
D
S
h
p
S
h
i
1
m
p
D
S
h
S
h
i
p
S
h
i
p
S
h
D
p
D
S
h
p
S
h
i
1
2
p
D
S
h
S
h
i
p
S
h
i
p
D
S
h
i
1
2
p
D
S
h
S
h
i
p
D
S
h
d
1
10
i
1
2
p
X
di
Pa
i
,
S
h
p
S
h
1
D
0.1
p
S
h
2
D
0.9
27

Outline

Bayesian Approach

Bayes Therom

Bayesian vs. classical probability methods

coin toss

an example

Bayesian Network

Structure

Inference

Learning Probabilities

Learning the Network Structure

Two coin toss

an example

Conclusions

Exam Questions

28

Conclusions

Bayesian method

Bayesian network

Structure

Inference

Learn parameters and structure

29

Question1: What is Bayesian Probability?

A person’s degree of belief in a
certain event

i.e. Your own degree of certainty
that a tossed coin will land “heads”

30

Question 2: What are the advantages and
disadvantages of the Bayesian and classical
approaches to probability?

Bayesian Approach:

+Reflects an expert’s knowledge

+The belief is kept updating when new
data item arrives

-

Arbitrary (More subjective)

Classical Probability:

+Objective and unbiased

-

Generally not available

It takes a long time to measure the
object’s physical characteristics

31

Question 3: Mention at least 3
Advantages of Bayesian analysis

Handle incomplete data sets

Learning about causal relationships

Combine domain knowledge and
data

Avoid over fitting

32

The End

Any Questions?