Intelligent Systems (2II40)

cabbageswerveΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

102 εμφανίσεις

Intelligent Systems (2II40)

C7

Alexandra I. Cristea


October

200
5

Computing posterior probability
from Full joint distribution


P(
Cavity|Toothache
)=?

Basic rules


Conditional probability
:

P(A
|
B)= P(A

B) / P(B) if P(B)≠0


Product rule
:

P(A

B) = P(A|B) P(B)


Bayes’ Rule
:

P(A|B)= P(B|A)P(A) / P(B)


P(A|B) = P(A

B) / P(B)



Basic rules

P(A|B) = P(A

B) / P(B)


P(Cavity|Toothache) = P(Cavity


Toothache
) / P(Toothache)


P(Cavity


Toothache
) = ? : read from table:





P(Cavity


Toothache
) = 0.04


P(Toothache) = ?



Full joint distribution


cont.

P(Toothache) =

c=Cavity, t for “Toothache=true”

P(c,t)





P(Toothache) = 0.04 + 0.01 = 0.05


P(Cavity|Toothache) = P(Cavity


Toothache
) / P(Toothache) =

= 0.04 / 0.05 = 0.8


Inference from joint distribution

VI.2. Probabilistic reasoning


A.
Conditional independence

B.
Bayesian networks: syntax and semantics

C.
Exact inference

D.
Approximate inference

Independence


Two random variables A B are (absolutely)
independent

iff


P(A,B)=P(A)P(B)


If
n

Boolean variables are independent, the full
joint is P(X1,…,Xn)=

i
P(X
i
)



Two random variables A B given C are
conditionally independent

iff
P(A,B|C)=P(A|C)P(B|C)

Conditional independence example

VI.2.B. Belief networks

(Bayesian networks)

Belief network example


Neighbors John and Mary promised to call
if the alarm goes off; sometimes it starts
because of earthquakes. Is there a burglar?



Variables:
Burglary
,
Earthquake
,
Alarm
,
JohnCalls
,
MaryCalls
(n=5 variables)



Network topology reflects “causal”
knowledge

Belief network example


cont.


for k parents

O(d
k
n) numbers vs. O(d
n
)

Semantics in belief networks


“Global” semantics

defines the full joint
distribution as the product of the local conditional
distributions:


P(X1,…,Xn)=


n

i=1
P(Xi|Parents(Xi))

e.g., P(J

M

A

B

E
) is given by??




= P(

B
)P(

E
)P(A|

B

E
)P(J|A)P(M|A) =


= 0,999 x 0,998 x 0,001 x 0,90 x 0,70 =


= 0,000062


“Local” semantics
: each node is conditionally
independent of its nondescendents given its parents


Theorem: Local semantics


global semantics

Markov blanket


Each node is
conditionally
independent

of all
others given its
Markov blanket
:

parents +


children +
children’s parents

Homework 7

1.
In the Burglary Example
*)

compute the
following:

a)
The Markov blanket for node ‘Alarm’

b)
The probability of Mary to call, given there is a
Burglary:
P(M=true | B=true)
.

2.
Continue with Steps 14, 15 of your project.
DEADLINE: before Final Project Presentation.

3.
Perform Step 16 of your project (Final
Presentation).

*)

Point 1) of this homework is not obligatory; deadline before project
presentation; you can upgrade your homework mark with it (it
cannot lower your average, only increase it)

Constructing

belief networks


locally

testable assertions of conditional
independence


global

semantics

1.
Choose an ordering of variables X1,…,Xn

2.
For i=1 to n


add Xi to the network


select parents from X1,…,Xi
-
1 such that


P(Xi|Parents(Xi)) = P(Xi|X1,…,Xi
-
1)


This choice of parents guarantees the global semantics:



P(X1,…,Xn) =

n
i=1
P(Xi|X1,…,Xi
-
1)) = #
chain rule



=

n
i=1
P(Xi|Parents(Xi)) by construction

Constructing belief networks
example

Constructing belief networks
example

Constructing belief networks
example

Constructing belief networks
example

Constructing belief networks
example

Example: car diagnosis

initial evidence

testable

vars

diagnosis vars

hidden vars

Example: car insurance

predict claim

application

form data

Efficient conditional distributions


CPT grows exponentially w. no. of parents


CPT becomes infinite w. continuous
variables


Other, more compact methods are needed

Compact conditional distributions


-

cont.


Noisy
-
OR

distributions model multiple non
-
interacting
causes

1.
Parents
U
1 …
U
k include all causes (can add
leak

node)

2.
Independent failure probability

qi

for each cause alone


P(X|
U
1…
U
j,

U
j+1,…

U
k)=1
-


j
i=1
qi











Number of parameters: linear in number of parents

Hybrid (discrete + continuous)
networks

Discrete (
Subsidy?

and
Buys?
);

Continuous (
Harvest

and

Cost
)



How to deal with this?

Probability density

functions (PDF)


instead of probability distributions


for
continuous variables


Ex.: let
X

denote tomorrow’s maximum
temperature in the summer in Eindhoven


Belief that
X

is distributed
uniformly

between 18
and 26 degree Celsius:

P(
X
=
x
) =
U
[18,26](
x
)

P(
X
=20,5) =
U
[18,26](20,5)=0,125/C

Probability density functions (PDF)

Cumulative density functions (CDF)

Continuous Random Variable


Probability distribution (density function)
over continuous values

X

[0,10]
P
(
x
)

0


0
10

P
(
x

)
dx

= 1


P
(5


x



7) =

5
7

P
(
x

)
dx


more on PDF at:

http://people.hofstra.edu/faculty/Stefan_Waner/cprob/cprob2.html

Hybrid (discrete + continuous)
networks

Discrete (
Subsidy?

and
Buys?
);

Continuous (
Harvest

and
Cost
)



How to deal with this?

Hybrid (discrete + continuous)
networks

Discrete (
Subsidy?

and
Buys?
);

Continuous (
Harvest

and
Cost
)


Option 1:
discretization





possibly large errors, large CPTs

Option 2:
finitely parameterized
canonical (PDF) families

a)
Continuous variable,
discrete + continuous
parents (e.g.,
Cost
)

b)
Discrete variable,
continuous parents (e.g.,
Buys?
)

a) Continuous child variables


conditional density

functions for child var


e.g., linear Gaussian

:







mean
Cost

varies linearly w.
Harvest


variance is fixed


linear variation is unreasonable over full range







(why?)

Continuous child variables


ex.







All
-
continuous network w. LG distribution




full joint is a multivariate Gaussian



Discrete + continuous LG network is a
conditional
Gaussian network
, i.e., a
multivariate Gaussian

over all
continuous variables for each combination of discrete
variable values

b) Discrete child, continuous parent


P(Buys|Cost=c) =

((
-
c +

) /

)



with


-

threshold for buying


Probit distribution
:




-

the
integral

on the standard normal distribution




Logit distribution
:


Uses the
sigmoid

function






x
dx
x
N
x
)
)(
1
,
0
(
)
(
x
e
x
2
1
1
)
(




VI.2. Probabilistic reasoning

A.
Conditional independence

B.
Bayesian networks: syntax and semantics

C.
Exact inference


i.
Exact inference by enumeration

ii.
Exact inference by variable elimination

D.
Approximate inference

i.
Approximate inference by stochastic simulation

ii.
Approximate inference by Markov chain Monte
Carlo

Exact inference w. enumeration




y
y
e
X
P
e
X
e
X
P
)
,
,
(
)
,
(
)
|
(


P








e
a
e
a
a
m
P
a
j
P
e
b
a
P
e
P
b
P
m
j
a
e
b
P
m
j
b
P
m
j
b
P
)
|
(
)
|
(
)
,
|
(
)
(
)
(
)
,
,
,
,
(
)
,
,
(
)
,
|
(







e
a
m
j
a
e
B
P
m
j
B
m
j
B
P
)
,
,
,
,
(
)
,
,
(
)
,
|
(


P
n;d
n
=
n;2
n

d
n
= 2
n

Enumeration algorithm


Exhaustive depth
-
first enumeration: O(n) space, O(d
n
) time

Inference by variable elimination


Enumeration is inefficient: repeated computation


e.g., computes P(
J
=
true
|
a
)P(M=
true
|
a
) for each value of
e


Variable elimination: summation from right to left, storing
intermediate results (
factors
) to avoid recomputation



Variable elimination: basic operations


Pointwise product

of factors f1 and f2:

f1(x1,…,xj,y1,…,yk)


f2(y1,…,yk,z1,…,zl)


= f(x1,…,xj,y1,…,yk,z1,…,zl)

e.g., f1(a,b)


f2(b,c) = f(a,b,c)



Summing out

a variable from a product of factors:

move any constant factors outside the summation:



x
f1



fk = f1



fi

x
f
i+1



fk = f1




fi f
X’

assuming f1,…fi do not depend on X

Example pointwise product

A

B

f1(A,B)

B

C

f2(B,C)

A

B

C

f3(A,B,C)

T

T

.3

T

T

.2

T

T

T

T

F

.7

T

F

.8

T

T

F

F

T

.9

F

T

.6

T

F

T

F

F

.1

F

F

.4

T

F

F

F

T

T

F

T

F

F

F

T

F

F

F

Example pointwise product

A

B

f1(A,B)

B

C

f2(B,C)

A

B

C

f3(A,B,C)

T

T

.3

T

T

.2

T

T

T

.3 x .2

T

F

.7

T

F

.8

T

T

F

.3 x .8

F

T

.9

F

T

.6

T

F

T

.7 x .6

F

F

.1

F

F

.4

T

F

F

.7 x .4

F

T

T

.9 x .2

F

T

F

.9 x .8

F

F

T

.1 x .6

F

F

F

.1 x .4

Variable elimination algorithm

Complexity of exact inference


Polytrees

(singly connected network) = network in
which there is
at most one undirected path

between any two nodes


Time, space complexity of exact inference on polytrees
=
linear

in size of network


Multiply connected networks

≠polytrees


Variable elimination can have exponential time and
space complexity


inference in Bayesian networks is
NP
-
hard


includes
inference in propositional logics

as special case

VI.2. D. Approximate inference

Inference by stochastic simulation

Basic idea:

1.
Draw
N

samples

from a
sampling distribution

S

2.
Compute
approximate posterior probability

P’

3.
Show it
converges

to the true probability
P

VI.2. D. Approximate inference

i.
Sampling

from an empty network

ii.
Rejection

sampling: reject samples
disagreeing w. evidence

iii.
Likelihood weighting
: use evidence to
weight samples

iv.
MCMC
: sample from a stochastic
process whose stationary distribution is
the true posterior

i. Sampling from an empty network

function PRIOR
-
SAMPLE(
bn
)




returns an event sampled from the prior specified by
bn


x


an event w.
n

elements


for
i
=1 to
n

do


xi



a random sample from P(
Xi
|
parents
(
Xi
))


return
x



P(
Cloudy
)= <0.5,0.5>

i. Sampling from an empty network


cont.


Probability that PRIOR
-
SAMPLE generates a particular event:

S
PS
(
x
1, …
x
n)

=


n

i=1
P(
X
i|Parents(
X
i))=P(
x
1,…
x
n)


N
PS

(
Y
=
y
)

no. of samples generated for which
Y
=
y

for
any set of variables
Y
.


Then,
P’(
Y
=
y
)

= N
PS
(
Y
=
y
)/
N

and



lim
N


P’(
Y
=
y
) =

h
S
PS
(
Y
=
y
,
H
=
h
) =






=

h
P(
Y
=
y
,
H
=
h
) =






=
P(
Y
=
y
)




estimates derived from PRIOR
-
SAMPLE are
consistent

ii. Rejection sampling example


Estimate
P
(
Rain
|
Sprinkler
=
true
) = ? using


100

samples


27

samples have
Sprinkler
=
true
; out of these,




8

have
Rain
=
true

and




19

have
Rain
=
false
.


P
’(
Rain
|
Sprinkler
=
true
) =


= NORMALIZE(<8,19>) = <0.296,0.704>



Similar to a basic real
-
world empirical estimation procedure

ii. Rejection sampling

P’(
X
|
e
) is estimated from samples agreeing with evidence
e

PROBLEM:

a lot of collected samples are thrown away!!

iii. Likelihood weighting


Idea:


fix evidence variables
E
,


sample only nonevidence

var.,
X
,
Y


weight


sample by
likelihood it
accords to evidence
E

iii. Likelihood weighting example


Estimate
P
(
Rain
|
Sprinkler
=
true
,
WetGrass
=
true
)

iii. Likelihood weighting example

Sample generation process:

1.
w



1.0

2.
Sample
P
(
Cloudy
)=<0.5,0.5>; say
true

3.
Sprinkler

has value
true
, so


w



w



P(
Sprinkler
=
true

|
Cloudy
=
true
) = 0.1

4.
Sample
P
(
Rain
|
Cloudy
=
true
)=<0.8,0.2>; say
true

5.
WetGrass

has value
true
, so


w



w


P(
WetGrass
=
true
|
Sprinkler
=
true
,
Rain
=
true
)=0.099

iii. Likelihood weighting function

iii. Likelihood weighting analysis


Sampling probability for WEIGHTED
-
SAMPLE is



S
WS
(y,e) =


l

i=1
P(yi|
Parents
(
Yi
))


Note: pays attention to
evidence in ancestors only



somewhere “in between”
prior

and
posterior

distribution


Weight for a given sample y,e, is


w
(y,e) =


n

i=1
P(ei|
Parents
(
Ei
))


Weighted sampling probability is

S
WS
(y,e) w(y,e) =


l

i=1
P(
yi
|
Parents
(
Yi
)
)


m

i=1
P(
ei
|
Parents
(
Ei
)
)
=
P
(y,e)

# by standard global semantics of network



Hence, likelihood weighting is
consistent


But performance still degrades w. many evidence variables

iv. MCMC inference



State
” of network = current assignment to
all variables


Generate next state by
sampling one variable
given Markov blanket



Sample each variable in turn, keeping
evidence fixed



Approaches
stationary distribution
: long
-
run
fraction of time spent in each state is exactly
proportional to its posterior probability

Markov blanket
-

reminder


Each node is
conditionally
independent of all
others given its
Markov blanket
:

parents +


children +
children’s parents

MCMC algorithm

Conclusion on Uncertainty


We discussed:


Decision theory basics: Uncertainty, Probability, Syntax,
Semantics, Inference Rules


Probabilistic reasoning: Conditional independence, Bayesian
networks, Exact inference, Approximate inference


Just as in reasoning without uncertainty, prior knowledge
can be used to diminish the state space (see inference
with Joint Distribution vs. Bayesian networks)


If probabilities are unknown, or if computation has to be
reduced, they can be estimated with Approximate
Inference



Questions?


Information Final Project
Presentations


Info:


http://www.win.tue.nl/~acristea/IS/Informatio
nProjectPresentation.txt


Appointments:


http://www.win.tue.nl/~acristea/IS/FinalPrese
ntationAppointments.txt


Information Exam


Subject: course material
(till constructing belief networks inclusively)

& homeworks


Samples (&some Solutions) at:
http://wwwis.win.tue.nl/~acristea/HTML/IS/CoursePowerpoints&
Demos/OldExams/


Check Sample Homework Solutions, compare with your own!


Place & time on course site (but ALWAYS check
http://owinfo.tue.nl/

as well)



Closed book (
simple

calculators

allowed


NOT computers
or phones, PDA, communication devices, etc. !!)


No communication




Any Questions?


Good luck!