Slides

blabbingunequaledΤεχνίτη Νοημοσύνη και Ρομποτική

24 Οκτ 2013 (πριν από 3 χρόνια και 11 μήνες)

97 εμφανίσεις

QUIZ!!


T/F:
You can always (theoretically) do BNs inference by enumeration.
TRUE


T/F:
In VE, always first marginalize, then join.
FALSE


T/F:
VE is a lot faster than enumeration, but not always exact.
FALSE


T/F:
The running time of VE is independent of the order you pick the r.v.s.
FALSE


T/F:
The more evidence, the faster VE runs.
TRUE



P(X|Y) sums to ... |Y|


P(x|Y) sums to ... ??


P(X|y) sums to ... 1


P(x|y) sums to ... P(x|y)


P(x,Y) sums to .. P(x)


P(X,Y) sums to ... 1




1

CSE
511
a: Artificial Intelligence

Spring
2013

Lecture 17: Bayes


Nets IV


Approximate
Inference (Sampling)

04/01/2013

Robert Pless


Via Kilian Weinberger via Dan Klein


UC Berkeley

Announcements


Project 4 out soon.


Project 3 due at midnight.

3

Exact Inference

Variable Elimination

4

General Variable Elimination


Query:



Start with initial factors:


Local CPTs (but instantiated by evidence)



While there are still hidden variables (not Q or evidence):


Pick a hidden variable H


Join all factors mentioning H


Eliminate (sum out) H



Join all remaining factors and normalize

5

6

7

Example: Variable elimination

Query:

What is the probability that
a student attends class, given that
they pass the exam?

[based on slides taken from UMBC CMSC
671
,
2005
]

P(
pr
|at,
st
)

at

st

0.9

T

T

0.5

T

F

0.7

F

T

0.1

F

F

attend

study

prepared

fair

pass

P(at)=.8

P(
st
)=.6

P(
fa
)=.
9

P(
pa
|at,
pre,fa
)

pr

at

fa

0.9

T

T

T

0.1

T

T

F

0.7

T

F

T

0.1

T

F

F

0.7

F

T

T

0.1

F

T

F

0.2

F

F

T

0.1

F

F

F

8

Join
study

factors

attend

study

prepared

fair

pass

P(at)=.8

P(
st
)=.6

P(
fa
)=.9

Original

Joint

Marginal

prep

study

attend

P(
pr
|at,
st
)

P(st)

P(
pr,st|sm
)

P(
pr|sm
)

T

T

T

0.9

0.6

0.54

0.74

T

F

T

0.5

0.4

0.2

T

T

F

0.7

0.6

0.42

0.46

T

F

F

0.1

0.4

0.04



F

T

T

0.1

0.6

0.06

0.26

F

F

T

0.5

0.4

0.2



F

T

F

0.3

0.6

0.18

0.54

F

F

F

0.9

0.4

0.36



P(
pa
|at,
pre,fa
)

pr

at

fa

0.9

T

T

T

0.1

T

T

F

0.7

T

F

T

0.1

T

F

F

0.7

F

T

T

0.1

F

T

F

0.2

F

F

T

0.1

F

F

F

9

Marginalize out
study

attend

prepared,

study

fair

pass

P(at)=.8

P(
fa
)=.9

Original

Joint

Marginal

prep

study

attend

P(
pr
|at,
st
)

P(st)

P(
pr,st
|ac
)

P(
pr
|at
)

T

T

T

0.9

0.6

0.54

0.74

T

F

T

0.5

0.4

0.2

T

T

F

0.7

0.6

0.42

0.46

T

F

F

0.1

0.4

0.04



F

T

T

0.1

0.6

0.06

0.26

F

F

T

0.5

0.4

0.2



F

T

F

0.3

0.6

0.18

0.54

F

F

F

0.9

0.4

0.36



P(
pa
|at,
pre,fa
)

pr

at

fa

0.9

T

T

T

0.1

T

T

F

0.7

T

F

T

0.1

T

F

F

0.7

F

T

T

0.1

F

T

F

0.2

F

F

T

0.1

F

F

F

10

Remove “study”

attend

prepared

fair

pass

P(at)=.8

P(
fa
)=.9

P(
pr
|at
)

pr

at

0.74

T

T

0.46

T

F

0.26

F

T

0.54

F

F

P(
pa
|at,
pre,fa
)

pr

at

fa

0.9

T

T

T

0.1

T

T

F

0.7

T

F

T

0.1

T

F

F

0.7

F

T

T

0.1

F

T

F

0.2

F

F

T

0.1

F

F

F

11

Join factors “
fair


attend

prepared

fair

pass

P(at)=.
8

P(
fa
)=.9

P(
pr
|at
)

prep

attend

0.74

T

T

0.46

T

F

0.26

F

T

0.54

F

F

Original

Joint

Marginal

pa

pre

attend

fair

P(
pa
|at,
pre,
fa
)

P(fair)

P(
pa,fa|sm,
pre
)

P(
pa|sm,pre
)

t

T

T

T

0.9

0.9

0.81

0.82

t

T

T

F

0.1

0.1

0.01



t

T

F

T

0.7

0.9

0.63

0.64

t

T

F

F

0.1

0.1

0.01



t

F

T

T

0.7

0.9

0.63

0.64

t

F

T

F

0.1

0.1

0.01



t

F

F

T

0.2

0.9

0.18

0.19

t

F

F

F

0.1

0.1

0.01

12

Marginalize out “
fair


attend

prepared

pass,

fair

P(at)=.8

P(
pr
|at
)

prep

attend

0.74

T

T

0.46

T

F

0.26

F

T

0.54

F

F

Original

Joint

Marginal

pa

pre

attend

fair

P(
pa
|at,
pre,fa
)

P(fair)

P(
pa,fa
|at,
pre
)

P(
pa
|at,
pre
)

T

T

T

T

0.9

0.9

0.81

0.82

T

T

T

F

0.1

0.1

0.01



T

T

F

T

0.7

0.9

0.63

0.64

T

T

F

F

0.1

0.1

0.01



T

F

T

T

0.7

0.9

0.63

0.64

T

F

T

F

0.1

0.1

0.01



T

F

F

T

0.2

0.9

0.18

0.19

T

F

F

F

0.1

0.1

0.01

13

Marginalize out “
fair


attend

prepared

pass

P(at)=.8

P(
pr
|at
)

prep

attend

0.74

T

T

0.46

T

F

0.26

F

T

0.54

F

F

P(
pa
|at,
pre
)

pa

pre

attend

0.82

t

T

T

0.64

t

T

F

0.64

t

F

T

0.19

t

F

F

14

Join factors “
prepared


attend

prepared

pass

P(at)=.8

Original

Joint

Marginal

pa

pre

attend

P(
pa
|at,pr
)

P(
pr
|at
)

P(
pa,pr|sm
)

P(
pa|sm
)

t

T

T

0.82

0.74

0.6068

0.7732

t

T

F

0.64

0.46

0.2944

0.397

t

F

T

0.64

0.26

0.1664



t

F

F

0.19

0.54

0.1026



15

Join factors “
prepared


attend

pass,

prepared

P(at)=.8

Original

Joint

Marginal

pa

pre

attend

P(
pa
|at,pr
)

P(
pr
|at
)

P(
pa,pr
|at
)

P(
pa
|at
)

t

T

T

0.82

0.74

0.6068

0.7732

t

T

F

0.64

0.46

0.2944

0.397

t

F

T

0.64

0.26

0.1664



t

F

F

0.19

0.54

0.1026



16

Join factors “
prepared


attend

pass

P(at)=.8

P(
pa
|at
)

pa

attend

0.7732

t

T

0.397

t

F

17

Join factors

attend

pass

P(at)=.8

Original

Joint

Normalized:

pa

attend

P(
pa
|at
)

P
(at)

P(
pa,sm
)

P
(
at|
pa
)

T

T

0.7732

0.8

0.61856

0.89

T

F

0.397

0.2

0.0794

0.11

18

Join factors

attend,

pass

Original

Joint

Normalized:

pa

attend

P(
pa
|at
)

P
(at)

P(
pa
,at
)

P
(
at|
pa
)

T

T

0.7732

0.8

0.61856

0.89

T

F

0.397

0.2

0.0794

0.11

Approximate Inference

Sampling (particle based method)

19

Approximate Inference

20

Sampling


the basics ...


Scrooge McDuck gives
you an ancient coin.


He wants to know what
is P(H)


You have no homework,
and nothing good is on
television


so you toss it
1
Million times.


You obtain
700000
x
Heads, and
300000
x
Tails.


What is P(H)?

21

Sampling


the basics ...


Exactly, P(H)=0.7


Why?



22

Monte Carlo Method

23

Who is more likely to win? Green or
Purple?


What is the probability that green
wins, P(G)?

Two ways to solve this:

1.
Compute the exact probability.

2.
Play 100,000 games and see
how many times green wins.

Approximate Inference


Simulation has a name: sampling



Sampling is a hot topic in machine learning,

and it

s really simple



Basic idea:


Draw N samples from a sampling distribution S


Compute an approximate posterior probability


Show this converges to the true probability P



Why sample?


Learning: get samples from a distribution you don

t know


Inference: getting a sample is faster than computing the right
answer (e.g. with variable elimination)


24

S

A

F

Forward Sampling

Cloudy

Sprinkler

Rain

WetGrass

Cloudy

Sprinkler

Rain

WetGrass

25

+c

0.5

-
c

0.5

+c


+s

0.1

-
s

0.9

-
c


+s

0.5

-
s

0.5

+c


+r

0.8

-
r

0.2

-
c


+r

0.2

-
r

0.8

+s




+r


+w

0.99

-
w

0.01

-
r


+w

0.90

-
w

0.10

-
s




+r


+w

0.90

-
w

0.10

-
r


+w

0.01

-
w

0.99

Samples:

+c,
-
s, +r, +w

-
c, +s,
-
r, +w



[Excel Demo]

Forward Sampling


This process generates samples with probability:




…i.e. the BN

s joint probability



Let the number of samples of an event be



Then




I.e., the sampling procedure is
consistent


26

Example


We

ll get a bunch of samples from the BN:


+c,
-
s, +r, +w


+c, +s, +r, +w


-
c, +s, +r,
-
w


+c,
-
s, +r, +w


-
c,
-
s,
-
r, +w


If we want to know P(W)


We have counts <+w:
4
,
-
w:
1
>


Normalize to get P(W) =
<+w:
0.8
,
-
w:
0.2
>


This will get closer to the true distribution with more samples


Can estimate anything else, too


What about P(C| +w
)?
P(C| +
r,
+w
)?
P(C|
-
r,
-
w
)?


Fast: can use fewer samples if less time (what

s the drawback?)

Cloudy

Sprinkler

Rain

WetGrass

C

S

R

W

27

Rejection Sampling


Let

s say we want P(C)


No point keeping all samples around


Just tally counts of C as we go



Let

s say we want
P(C| +s
)


Same thing: tally C outcomes, but
ignore (reject) samples which don

t
have S=+s


This is called rejection sampling


It is also consistent for conditional
probabilities (i.e., correct in the limit)


+c,
-
s, +r, +w


+c, +s, +r, +w


-
c, +s, +r,
-
w


+c,
-
s, +r, +w


-
c,
-
s,
-
r, +w

Cloudy

Sprinkler

Rain

WetGrass

C

S

R

W

28

Sampling Example


There are 2 cups.


The first contains 1 penny and 1 quarter


The second contains 2 quarters



Say I pick a cup uniformly at random, then pick a
coin randomly from that cup. It's a quarter (yes!).
What is the probability that the other coin in that
cup is also a quarter?

Likelihood Weighting


Problem with rejection sampling:


If evidence is unlikely, you reject a lot of samples


You don

t exploit your evidence as you sample


Consider P(B|+a)





Idea: fix evidence variables and sample the rest





Problem: sample distribution not consistent!


Solution: weight by probability of evidence given parents

Burglary

Alarm

Burglary

Alarm

30


-
b,
-
a


-
b,
-
a


-
b,
-
a


-
b,
-
a

+b, +a


-
b +a


-
b, +a


-
b, +a


-
b, +a

+b, +a

Likelihood Weighting


Sampling distribution if z sampled and e fixed evidence





Now, samples have weights





Together, weighted sampling distribution is consistent


Cloudy

R

C

S

W

31

Likelihood Weighting

32

+c

0.5

-
c

0.5

+c


+s

0.1

-
s

0.9

-
c


+s

0.5

-
s

0.5

+c


+r

0.8

-
r

0.2

-
c


+r

0.2

-
r

0.8

+s




+r


+w

0.99

-
w

0.01

-
r


+w

0.90

-
w

0.10

-
s




+r


+w

0.90

-
w

0.10

-
r


+w

0.01

-
w

0.99

Samples:

+c, +s, +r, +w



Cloudy

Sprinkler

Rain

WetGrass

Cloudy

Sprinkler

Rain

WetGrass


Inference:


Sum over weights that match query value


Divide by total sample weight


What is P(C|+
w,+r
)?










Likelihood Weighting Example

33

Cloudy

Rainy

Sprinkler

Wet Grass

Weight

0

1

1

1

0.495

0

0

1

1

0.45

0

0

1

1

0.45

0

0

1

1

0.45

1

0

1

1

0.09

Likelihood Weighting


Likelihood weighting is good


We have taken evidence into account as
we generate the sample


E.g. here, W

s value will get picked
based on the evidence values of S, R


More of our samples will reflect the state
of the world suggested by the evidence



Likelihood weighting doesn

t solve
all our problems


Evidence influences the choice of
downstream variables, but not upstream
ones (C isn

t more likely to get a value
matching the evidence)


We would like to consider evidence
when we sample every variable

34

Cloudy

Rain

C

S

R

W

Markov Chain Monte Carlo*


Idea
: instead of sampling from scratch, create samples
that are each like the last one.



Procedure
: resample one variable at a time, conditioned
on all the rest, but keep evidence fixed. E.g., for P(b|c):




Properties
: Now samples are not independent (in fact
they

re nearly identical), but sample averages are still
consistent estimators!



What

s the point
: both upstream and downstream
variables condition on evidence.

35

+a

+c

+b

+a

+c

-
b

-
a

+c

-
b

Random Walks


[Explain on Blackboard]

36

Gibbs Sampling

1.
Set all evidence E to e

2.
Do forward sampling t obtain x
1
,...,x
n

3.
Repeat:

1.
Pick any variable X
i

uniformly at random.

2.
Resample x
i
’ from p(X
i

| x
1
,..., x
i
-
1
, x
i+1
,..., x
n
)

3.
Set all other x
j
’=x
j

4.
The new sample is x
1’
,..., x
n




37

Markov Blanket

38

X

Markov blanket of X:

1.
All parents of X

2.
All children of X

3.
All parents of children of X
(except X itself)

X is conditionally independent from
all

other
variables in the BN, given all variables in the
markov blanket (besides X).

Gibbs Sampling

1.
Set all evidence E to e

2.
Do forward sampling t obtain x
1
,...,x
n

3.
Repeat:

1.
Pick any variable X
i

uniformly at random.

2.
Resample x
i
’ from p(X
i

|

markovblanket(X
i
))

3.
Set all other x
j
’=x
j

4.
The new sample is x
1

,..., x
n




39

Summary


Sampling can be your salvation


The dominating approach to inference in
BNs


Approaches:


Forward (/Prior) Sampling


Rejection Sampling


Likelihood Weighted Sampling


Gibbs Sampling

40

Learning in Bayes Nets


Task 1:
Given

the network structure and
given

data, where a data point is an
observed setting for the variables,
learn

the CPTs for the Bayes Net. Might also
start with priors for CPT probabilities.


Task 2:
Given

only the data (and possibly
a prior over Bayes Nets),
learn

the entire
Bayes Net (both Net structure and CPTs).

Turing Award for Bayes Nets

42