Undirected Probabilistic Graphical

tripastroturfΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 4 χρόνια και 1 μήνα)

109 εμφανίσεις

Undirected Probabilistic Graphical
Models

(Markov Nets)

(Slides from Sam
Roweis

Lecture)

Connection to MCMC:




MCMC requires sampling a node given its
markov

blanket



Need to use P(
x|MB
(x)).

For
Bayes

nets MB(x) contains more



nodes than are mentioned in the local distribution CPT(x)




For Markov nets,


Because neighbor relation is symmetric



nodes xi and
xj

are both neighbors of each other..

In contrast, note that in
Bayes

Nets, CPTs can be filled with
any

real numbers


between 0 and 1, and we can be sure the ensuing product will define a valid


joint distribution!

12/2


All project presentations on 12/14 (10min
each)


All project reports due on 12/14


On 12/7, we will read and discuss MLN paper

Today: Complete discussion of Markov Nets; Start towards MLN

A

B

C

D

Qn
: What is the


most likely


configuration of A&B?

Moral: Factors



are
not

marginals
!

Although A,B would

Like to agree, B&C

Need to agree,

C&D need to disagree

And D&A need to agree

.and the latter three have

Higher weights!



Mr. & Mrs. Smith example



Okay, you convinced me

that given any potentials

w
e will have a consistent

Joint.
But given any joint,

w
ill there be a potentials

I can provide?



Hammersley
-
Clifford



theorem…

We can have potentials


on any cliques

not just


the maximal ones.



So, for example we can



have a potential on A



in addition to the other



four
pairwise

potentials

Markov Networks


Undirected

graphical models

Cancer

Cough

Asthma

Smoking


Potential functions defined over cliques

Smoking

Cancer


Ф
(S,C)

False

False


4.5

False

True


4.5

True

False


2.7

True

True


4.5




c
c
c
x
Z
x
P
)
(
1
)
(




x
c
c
c
x
Z
)
(
Log
-
Linear models for Markov Nets

A

B

C

D

Factors are “functions” over their domains

Log linear model consists of





Features
f
i

(D
i

)
(functions over domains)




Weights
w
i

for
features

s.t
.

Without loss of generality!

Markov Networks


Undirected

graphical models


Log
-
linear model:

Weight of Feature
i

Feature
i







otherwise
0
Cancer
Smoking
if
1
)
Cancer
Smoking,
(
1
f
5
.
1
1

w
Cancer

Cough

Asthma

Smoking









i
i
i
x
f
w
Z
x
P
)
(
exp
1
)
(
Markov Nets vs. Bayes Nets

Property

Markov Nets

Bayes Nets

Form

Prod. potentials

Prod. potentials

Potentials

Arbitrary

Cond. probabilities

Cycles

Allowed

Forbidden

Partition func.

Z = ?
global

Z = 1
local

Indep. check

Graph separation

D
-
separation

Indep. props.

Some

Some

Inference

MCMC, BP, etc.

Convert to Markov

Inference in Markov Networks


Goal
: Compute
marginals

& conditionals of




Exact inference is #
P
-
complete


Most BN inference approaches work for MNs too


Variable Elimination used factor multiplication

and
should work without change..


Conditioning on Markov blanket is easy:





Gibbs sampling exploits this







exp ( )
( | ( ))
exp ( 0) exp ( 1)
i i
i
i i i i
i i
w f x
P x MB x
w f x w f x

  

 
1
( ) exp ( )
i i
i
P X w f X
Z
 

 
 

exp ( )
i i
X i
Z w f X
 

 
 
 
MCMC: Gibbs Sampling

state



random truth assignment

for

i



1
to

num
-
samples
do


for each

variable

x


sample
x

according to P(
x
|
neighbors
(
x
))


state



state

with new value of
x

P(
F
)


fraction of states in which
F

is true

Other Inference Methods


Many variations of MCMC


Belief propagation (sum
-
product)


Variational approximation


Exact methods

Learning Markov Networks


Learning parameters (weights)


Generatively


Discriminatively


Learning structure (features)


Easy Case:
Assume complete data

(If not: EM versions of algorithms)

Entanglement in log likelihood…

a

b

c

Learning for log
-
linear formulation

Use gradient ascent

Unimodal
, because Hessian is

Co
-
variance matrix over features

What is the expected

Value of the feature

g
iven the current

parameterization

o
f the network?


Requires inference to answer

(inference at every iteration



sort of like EM

)

Why should we spend so much time
computing gradient?


Given that gradient is being used only in doing
the gradient ascent iteration, it might look as if
we should just be able to approximate it in any
which way


Afterall
, we are going to take a step with some
arbitrary step size anyway..


..But the thing to keep in mind is that the
gradient is a
vector.

We are talking not just of
magnitude but direction. A mistake in magnitude
can change the direction of the vector and push
the search into a completely wrong direction…

Generative Weight Learning


Maximize likelihood or posterior probability


Numerical optimization (gradient or 2
nd

order)


No local maxima






Requires inference at each step (slow!)

No. of times feature

i
is true in data

Expected no. times feature
i

is true according to model



)
(
)
(
)
(
log
x
n
E
x
n
x
P
w
i
w
i
w
i




1
( ) exp ( )
i i
i
P X w f X
Z
 

 
 

exp ( )
i i
X i
Z w f X
 

 
 
 
Alternative Objectives to maximize..


Since log
-
likelihood requires
network inference to
compute the derivative, we
might want to focus on
other objectives whose
gradients are easier to
compute (and which also

hopefully

have optima at
the same parameter
values).


Two options:


Pseudo Likelihood


Contrastive Divergence


Given a single data instance
x
log
-
likelihood is

Log
prob

of data

Log
prob

of
all

other

possible



data instances (
w.r.t
. current
q

Maximize the distance


(“increase the divergence”)

Pick a sample of


typical other instances

(need to sample from
P
q


Run MCMC initializing with

the data..)


Compute likelihood of

each possible data instance

just using
markov

blanket


(approximate chain rule)

Pseudo
-
Likelihood




Likelihood of each variable given its neighbors
in the data


Does not require inference at each step


Consistent estimator


Widely used in vision, spatial statistics, etc.


But PL parameters may not work well for

long inference chains



i
i
i
x
neighbors
x
P
x
PL
))
(
|
(
)
(
[Which can lead to disasterous results]

Discriminative Weight Learning


Maximize conditional likelihood of query (
y
)
given evidence (
x
)






Approximate expected counts by counts in
MAP state of
y

given
x

No. of true groundings of clause

i
in data

Expected no. true groundings according to model



)
,
(
)
,
(
)
|
(
log
y
x
n
E
y
x
n
x
y
P
w
i
w
i
w
i




Structure Learning


How to learn the structure of a Markov
network?


… not too different from learning structure for a
Bayes network: discrete search through space of
possible graphs, trying to maximize data
probability….


MLNs: Points to ponder


Compared to ground
representations, MLNs have
easier learning but equal harder
inference


MLNs need to learn significantly
fewer parameters than a ground
network of similar size


MLNs may be compelled to exploit
the “relational” structure and thus
may spend time inventing lifted
inference methods



Inference approaches


Learning


Parameter


Why Pseudo Likelihood?


Structure

implies learning
clauses.. (what ILP does)



Connection to Dynamic
Bayes

Nets?


Relational

Markov Logic: Intuition


A logical KB is a set of
hard constraints

on the set of possible worlds


Let’s make them
soft constraints
:

When a world violates a formula,

It becomes less probable, not impossible


Give each formula a
weight

(Higher weight


却S潮o敲ec潮o瑲慩a琩






satisfies
it

formulas

of

weights
exp
P(world)
Markov Logic: Definition


A
Markov Logic Network (MLN)

is a set of pairs
(F, w)

where


F

is a formula in first
-
order logic


w

is a real number


Together with a set of constants,

it defines a Markov network with


One node for each grounding of each predicate in
the MLN


One feature for each grounding of each formula
F

in the MLN, with the corresponding weight
w

Example: Friends & Smokers

habits.

smoking

similar

have

Friends
cancer.

causes

Smoking
Example: Friends & Smokers



)
(
)
(
)
,
(
,
)
(
)
(
y
Smokes
x
Smokes
y
x
Friends
y
x
x
Cancer
x
Smokes
x





Example: Friends & Smokers



)
(
)
(
)
,
(
,
)
(
)
(
y
Smokes
x
Smokes
y
x
Friends
y
x
x
Cancer
x
Smokes
x





1
.
1
5
.
1
Example: Friends & Smokers



)
(
)
(
)
,
(
,
)
(
)
(
y
Smokes
x
Smokes
y
x
Friends
y
x
x
Cancer
x
Smokes
x





1
.
1
5
.
1
Two constants:
Anna

(A) and
Bob

(B)

Example: Friends & Smokers



)
(
)
(
)
,
(
,
)
(
)
(
y
Smokes
x
Smokes
y
x
Friends
y
x
x
Cancer
x
Smokes
x





1
.
1
5
.
1
Cancer(A)

Smokes(A)

Smokes(B)

Cancer(B)

Two constants:
Anna

(A) and
Bob

(B)

Example: Friends & Smokers



)
(
)
(
)
,
(
,
)
(
)
(
y
Smokes
x
Smokes
y
x
Friends
y
x
x
Cancer
x
Smokes
x





1
.
1
5
.
1
Cancer(A)

Smokes(A)

Friends(A,A)

Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)

Two constants:
Anna

(A) and
Bob

(B)

Example: Friends & Smokers



)
(
)
(
)
,
(
,
)
(
)
(
y
Smokes
x
Smokes
y
x
Friends
y
x
x
Cancer
x
Smokes
x





1
.
1
5
.
1
Cancer(A)

Smokes(A)

Friends(A,A)

Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)

Two constants:
Anna

(A) and
Bob

(B)

Example: Friends & Smokers



)
(
)
(
)
,
(
,
)
(
)
(
y
Smokes
x
Smokes
y
x
Friends
y
x
x
Cancer
x
Smokes
x





1
.
1
5
.
1
Cancer(A)

Smokes(A)

Friends(A,A)

Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)

Two constants:
Anna

(A) and
Bob

(B)

Markov Logic Networks


MLN is
template

for ground Markov nets


Probability of a world
x
:






Typed

variables and constants greatly reduce
size of ground Markov net


Functions, existential quantifiers, etc.


Infinite and continuous domains

Weight of formula
i

No. of true groundings of formula
i
in
x









i
i
i
x
n
w
Z
x
P
)
(
exp
1
)
(
Relation to Statistical Models


Special cases:


Markov networks


Markov random fields


Bayesian networks


Log
-
linear models


Exponential models


Max. entropy models


Gibbs distributions


Boltzmann machines


Logistic regression


Hidden Markov models


Conditional random fields



Obtained by making all
predicates zero
-
arity



Markov logic allows
objects to be
interdependent

(non
-
i.i.d.)



Relation to First
-
Order Logic


Infinite weights


First
-
order logic


Satisfiable KB, positive weights



Satisfying assignments = Modes of distribution


Markov logic allows contradictions between
formulas

MAP/MPE Inference


Problem:

Find most likely state of world given
evidence

)
|
(
max
arg
x
y
P
y
Query

Evidence

MAP/MPE Inference


Problem:

Find most likely state of world given
evidence








i
i
i
x
y
y
x
n
w
Z
)
,
(
exp
1
max
arg
MAP/MPE Inference


Problem:

Find most likely state of world given
evidence





i
i
i
y
y
x
n
w
)
,
(
max
arg
MAP/MPE Inference


Problem:

Find most likely state of world given
evidence




This is just the weighted MaxSAT problem


Use weighted SAT solver

(e.g., MaxWalkSAT
[Kautz et al., 1997]

)


Potentially faster than logical inference (!)


i
i
i
y
y
x
n
w
)
,
(
max
arg
The MaxWalkSAT Algorithm

for

i



1 to
max
-
tries

do


solution

= random truth assignment


for
j


1
to

max
-
flips

do


if
∑ weights(sat. clauses) > threshold

then


return
solution


c


random unsatisfied clause


with probability

p


flip a random variable in
c


else


flip variable in
c

that maximizes


∑ weights(sat. clauses)


return

failure, best
solution

found

But … Memory Explosion


Problem:


If there are
n

constants

and the highest clause arity is
c
,

the ground network requires
O(n )

memory



Solution:

Exploit sparseness; ground clauses lazily



LazySAT algorithm
[Singla & Domingos, 2006]

c

Computing Probabilities


P(
Formula
|MLN,C) = ?


MCMC: Sample worlds, check formula holds


P(
Formula1
|
Formula2
,MLN,C) = ?


If
Formula2

= Conjunction of ground atoms


First construct min subset of network necessary to
answer query (generalization of KBMC)


Then apply MCMC (or other)


Can also do lifted inference
[Braz et al, 2005]

Ground Network Construction

network



Ø

queue



query nodes

repeat


node



front(
queue
)


remove
node

from
queue


add
node

to
network


if
node

not in evidence
then


add neighbors(
node
) to queue

until

queue =
Ø

But … Insufficient for Logic


Problem:

Deterministic dependencies break MCMC

Near
-
deterministic ones make it
very

slow



Solution:

Combine MCMC and WalkSAT



MC
-
SAT algorithm
[Poon & Domingos, 2006]

Learning


Data is a relational database


Closed world assumption (if not: EM)


Learning parameters (weights)


Learning structure (formulas)


Parameter tying: Groundings of same clause







Generative learning: Pseudo
-
likelihood


Discriminative learning: Cond. likelihood,

use MC
-
SAT or MaxWalkSAT for inference

Weight Learning

No. of times clause

i
is true in data

Expected no. times clause
i

is true according to MLN



)
(
)
(
)
(
log
x
n
E
x
n
x
P
w
i
w
i
w
i




Structure Learning


Generalizes feature induction in Markov nets


Any inductive logic programming approach can be used,
but . . .


Goal is to induce any clauses, not just Horn


Evaluation function should be likelihood


Requires learning weights for each candidate


Turns out not to be bottleneck


Bottleneck is counting clause groundings


Solution: Subsampling



Structure Learning


Initial state:

Unit clauses or hand
-
coded KB


Operators:

Add/remove literal, flip sign


Evaluation function:


Pseudo
-
likelihood + Structure prior


Search:

Beam, shortest
-
first, bottom
-
up

[Kok

& Domingos, 2005; Mihalkova & Mooney, 2007]

Alchemy

Open
-
source software including:


Full first
-
order logic syntax


Generative & discriminative weight learning


Structure learning


Weighted satisfiability and MCMC


Programming language features


alchemy.cs.washington.edu

Alchemy

Prolog

BUGS

Represent
-
ation

F.O. Logic +
Markov nets

Horn
clauses

Bayes
nets

Inference

Model check
-

ing, MC
-
SAT

Theorem
proving

Gibbs
sampling

Learning

Parameters

& structure

No

Params.

Uncertainty

Yes

No

Yes

Relational

Yes

Yes

No