Bayessian Networks

lettuceescargatoireΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

88 εμφανίσεις

Bayesian Networks

Introduction

A problem domain is modeled by a list of
variables X
1
, …, X
n

Knowledge about the problem domain is
represented by a joint probability

P(X
1
, …, X
n
)

Introduction

Example: Alarm

The story: In LA burglary and earthquake are not
uncommon. They both can cause alarm. In case
of alarm, two neighbors John and Mary may call

Problem: Estimate the probability of a burglary
based who has or has not called

Variables: Burglary (B), Earthquake (E), Alarm
(A), JohnCalls (J), MaryCalls (M)

Knowledge required to solve the problem:


P(B, E, A, J, M)

Introduction

What is the probability of burglary given
that Mary called, P(B = y | M = y)?

Compute marginal probability:

P(B , M) =

E, A, J

P(B, E, A, J, M)

Use the definition of conditional

probability

Answer:

Introduction

Difficulty: Complexity in model
construction and inference

In Alarm example:


31 numbers needed


Computing P(B = y | M = y) takes 29 additions

In general


P(X
1
, … X
n
) needs at least 2
n



1numbers to
specify the joint probability


Exponential storage and inference

Conditional Independence

Overcome the problem of exponential size
by exploiting conditional independence

The chain rule of probabilities:


Conditional Independence

Conditional independence in the problem
domain:

Domain usually allows to identify a subset
pa(X
i
)
µ
{X
1
, …, X
i


1
} such that given
pa(X
i
), X
i

is independent of all variables in
{X
1
, …, X
i
-

1
}
\

pa{X
i
}, i.e.


P(X
i

| X
1
, …, X
i


1
) = P(X
i

| pa(X
i
))

Then



Conditional Independence

As a result, the joint probability

P(X
1
, …, X
n
) can be represented as the
conditional probabilities P(X
i

| pa(X
i
))

Example continued:

P(B, E, A, J, M)


=P(B)P(E|B)P(A|B,E)P(J|A,B,E)P(M|B,E,A,J)


=P(B)P(E)P(A|B,E)P(J|A)P(M|A)

pa(B) = {}, pa(E) = {}, pa(A) = {B, E},

pa{J} = {A}, pa{M} = {A}

Conditional probability table specifies: P(B),
P(E), P(A | B, E), P(M | A), P(J | A)

Conditional Independence

As a result:

Model size reduced

Model construction easier

Inference easier

Graphical Representation

To graphically represent the conditional
independence relationships, construct a directed
graph by drawing an arc from X
j

to X
i

iff


X
j

pa(X
i
)

pa(B) = {}, pa(E) = {}, pa(A) = {B, E}, pa{J} = {A}, pa{M} = {A}

A

B

E

J

M


Graphical Representation

We also attach the conditional probability
table P(X
i

| pa(X
i
)) to node X
i

The result: Bayesian network


A

B

E

J

M

P(B)

P(E)

P(J | A)

P(M | A)

P(A | B, E)

Formal Definition

A Bayesian network is:

An acyclic directed graph (DAG), where

Each node represents a random variable

And is associated with the conditional
probability of the node given its parents

Intuition

A BN can be understood as a DAG where arcs
represent direct probability dependence

Absence of arc indicates probability
independence: a variable is

conditionally independent of all its
nondescendants given its parents

From the graph: B
?
E, J
?
B | A, J
?
E | A


A

B

E

J

M

Construction

Procedure for constructing BN:

Choose a set of variables describing the
application domain

Choose an ordering of variables

Start with empty network and add
variables to the network one by one
according to the ordering

Construction

To add i
-
th variable X
i
:


Determine pa(X
i
) of variables already in the
network (X
1
, …, X
i


1
) such that

P(X
i

| X
1
, …, X
i


1
) = P(X
i

| pa(X
i
))

(domain knowledge is needed there)


Draw an arc from each variable in pa(X
i
) to X
i

Example

Order: B, E, A, J, M


pa(B)=pa(E)={},
pa(A)={B,E}, pa(J)={A},
pa{M}={A}

Order: M, J, A, B, E


pa{M}={}, pa{J}={M},
pa{A}={M,J}, pa{B}={A},
pa{E}={A,B}

Order: M, J, E, B, A


Fully connected graph

A

B

E

J

M

A

B

E

J

M

A

B

E

J

M

Construction

Which variable order?

Naturalness of probability
assessment

M, J, E, B, A is bad because of

P(B | J, M, E) is not natural

Minimize number of arcs

M, J, E, B, A is bad (too many arcs),
the first is good

Use casual relationship: cause come
before their effects


M, J, E, B, A is bad because M and
J are effects of A but come before A

A

B

E

J

M

A

B

E

J

M

VS

Casual Bayesian Networks

A causal Bayesian network, or simply
causal networks, is a Bayesian network
whose arcs are interpreted as indicating
cause
-
effect relationships

Build a causal network:


Choose a set of variables that describes the
domain


Draw an arc to a variable from each of its
direct

causes (Domain knowledge required)


Example

Visit Africa

Tuberculosis

X
-
Ray

Smoking

Lung Cancer

Bronchitis

Dyspnea

Tuberculosis or

Lung Cancer

Casual BN

Causality is not a well
understood concept.


No widely accepted denition.


No consensus on whether it is a
property of the world or a concept
in our minds


Sometimes causal relations are
obvious:


Alarm causes people to leave
building.


Lung Cancer causes mass on
chest X
-
ray.

At other times, they are not that
clear.


Doctors believe smoking
causes lung cancer but
the tobacco industry has
a different story:

S

C

Surgeon General (1964)

*

C

S

Tobacco Industry

Inference

Posterior queries to BN


We have observed the values of some
variables


What are the posterior probability distributions
of other variables?

Example: Both John and Mary reported
alarm


What is the probability of burglary
P(B|J=y,M=y)?

Inference

General form of query P(Q | E = e) = ?

Q is a list of query variables

E is a list of evidence variables

e denotes observed variables

Inference Types

Diagnostic inference: P(B | M = y)

Predictive/Casual Inference: P(M | B = y)

Intercasual inference (between causes of
a common effect) P(B | A = y, E = y)

Mixed inference (combining two or more
above) P(A | J = y, E = y) (diagnostic and
casual)

All the types are handled in the same way

Naïve Inference

Naïve algorithm for solving P(Q|E = e) in BN

Get probability distribution P(X) over all
variables X by multiplying conditional
probabilities



BN structure is not used, for many
variables the algorithm is not practical

Generally exact inference is NP
-
hard


Inference

Though generally exact inference is NP
-
hard, in some cases the problem is
tractable, e.g. if BN has a (poly)
-
tree
structure efficient algorithm exists

(a poly tree is a directed acyclic graph in which no two
nodes have more than one path between them)

Another practical approach: Stochastic
Simulation

A general sampling algorithm

For i = 1 to n

1.
Find parents of X
i

(X
p(i, 1)
, …, X
p(i, n)
)

2.
Recall the values that those parents where
randomly given

3.
Look up the table for

P(X
i

| X
p(i, 1)
= x
p(i, 1)
, …, X
p(i, n)
= x
p(i, n)
)

4.
Randomly set x
i

according to this probability

Stochastic Simulation

We want to know P(Q = q| E = e)

Do a lot of random samplings and count


N
c
: Num. samples in which E = e


N
s
: Num. samples in which Q = q and E = e


N: number of random samples

If N is big enough


N
c

/ N is a good estimate of P(E = e)


N
s

/ N is a good estimate of P(Q = q, E = e)


N
s

/ N
c

is then a good estimate of P(Q = q | E = e)

Parameter Learning

Example:


given a BN structure


A dataset









Estimate conditional probabilities P(X
i

| pa(X
i
))

X
1

X
3

X
5

X
4

X
2

X
1

X
2

X
3

X
4

X
5

0

0

1

1

0

1

0

0

1

0

0

?

0

0

?











? means missing values

Parameter Learning

We consider cases with full data

Use maximum likelihood (ML) algorithm
and bayesian estimation

Mode of learning:


Sequential learning


Batch learning

Bayesian estimation is suitable both for
sequential and batch learning

ML is suitable only for batch learning

ML in BN with Complete Data

n variables X
1
, …, X
n

Number of states of X
i
: r
i

= |

X
i
|

Number of configurations of parents of X
i
:
q
i

= |

pa(X
i
)
|

Parameters to be estimated:


ijk
=P(X
i

= j | pa(X
i
) = k),

i = 1, …, n; j = 1, …, r
i
; k = 1, …, q
i

ML in BN with Complete Data

Example: consider a BN. Assume all
variables are binary taking values 1, 2.


ijk
=P(X
i

= j | pa(X
i
) = k)

X
1

X
3

X
2

Number of parents configuration

ML in BN with Complete Data

A complete case: D
l

is a vector of values,
one for each variable (all data is known).

Example: D
l

= (X
1

= 1, X
2

= 2, X
3

= 2)

Given:

A set of complete cases: D = {D
1
, …, D
m
}

Find: the ML estimate of the parameters


ML in BN with Complete Data

Loglikelihood:

l(


| D) = log L(


| D) = log P(D |

)


= log

l

P(D
l

|

) =

l

log P(D
l

|

)

The term log P(D
l

|

):


D
4

= (1, 2, 2)

log P(D
4

|

) = log P(X
1

= 1, X
2

= 2, X
3

= 2 |

)


= log P(X
1
=1 |

) P(X
2
=2 |

) P(X
3
=2 | X
1
=1, X
2
=2,

)

= log

111

+ log

221

+ log

322

Recall:


={

111
,

121
,

211
,

221
,

311
,

312
,

313
,

314
,

321
,

322
,

323
,

324
}

X
1

X
3

X
2

ML in BN with Complete Data

Define the characteristic function of D
l
:



When l = 4, D
4

= {1, 2, 2}


(1,1,1:D
4
)=

(2,2,1:D
4
)=

(3,2,2:D
4
)=1,


(i, j, k: D
4
) = 0 for all other i, j, k

So, log P(D
4

|

) =

ijk


(i, j, k: D
4
) log

ijk

In general,

log P(D
l

|

) =

ijk


(i, j, k: D
l
) log

ijk

X
1

X
3

X
2

ML in BN with Complete Data

Define: m
ijk

=

l


(i, j, k: D
l
)

the number of data cases when

X
i

= j and pa(X
i
) = k

Then l(


| D) =

l

log P(D
l

|

)


=

l


i, j, k


(i, j, k : D
l
) log

ijk


=


i, j, k


l


(i, j, k : D
l
) log

ijk


=


i, j, k

m
ijk

log

ijk
=


i,k


j

m
ijk

log

ijk

ML in BN with Complete Data

We want to find:

argmax l(

| D) = argmax

i,k


j

m
ijk

log

ijk





ijk

Assume that

ijk

= P(X
i

= j | pa(X
i
) = k) is not
related to

i’j’k’
provided that i


i’ OR k


k’

Consequently we can maximize separately each
term in the summation

i, k
[…]


argmax

j

m
ijk

log

ijk



ijk

ML in BN with Complete Data

As a result we have:



In words, the ML estimate for


ijk

= P( X
i

= j | pa(X
i
) = k) is

number of cases where X
i
=j and pa(X
i
) = k


number of cases where pa(X
i
) = k

More to do with BN

Learning parameters with some values
missing

Learning the structure of BN from training
data

Many more…

References

Pearl, Judea,
Probabilistic Reasoning in Intelligent Systems:
Networks of Plausible Inference
, Morgan Kaufmann, San Mateo,
CA, 1988.

Heckerman, David, "A Tutorial on Learning with Bayesian
Networks," Technical Report MSR
-
TR
-
95
-
06, Microsoft Research,
1995.

www.ai.mit.edu/~murphyk/Software

http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html

R. G. Cowell, A. P. Dawid, S. L. Lauritzen and D. J. Spiegelhalter.
"Probabilistic Networks and Expert Systems".

Springer
-
Verlag.
1999.

http://www.ets.org/research/conferences/almond2004.html#software