# Undirected Probabilistic Graphical

Τεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 4 χρόνια και 6 μήνες)

138 εμφανίσεις

Undirected Probabilistic Graphical
Models

(Markov Nets)

(Slides from Sam
Roweis

Lecture)

Connection to MCMC:

MCMC requires sampling a node given its
markov

blanket

Need to use P(
x|MB
(x)).

For
Bayes

nets MB(x) contains more

nodes than are mentioned in the local distribution CPT(x)

For Markov nets,

Because neighbor relation is symmetric

nodes xi and
xj

are both neighbors of each other..

In contrast, note that in
Bayes

Nets, CPTs can be filled with
any

real numbers

between 0 and 1, and we can be sure the ensuing product will define a valid

joint distribution!

12/2

All project presentations on 12/14 (10min
each)

All project reports due on 12/14

On 12/7, we will read and discuss MLN paper

Today: Complete discussion of Markov Nets; Start towards MLN

A

B

C

D

Qn
: What is the

most likely

configuration of A&B?

Moral: Factors

are
not

marginals
!

Although A,B would

Like to agree, B&C

Need to agree,

C&D need to disagree

And D&A need to agree

.and the latter three have

Higher weights!

Mr. & Mrs. Smith example

Okay, you convinced me

that given any potentials

w
e will have a consistent

Joint.
But given any joint,

w
ill there be a potentials

I can provide?

Hammersley
-
Clifford

theorem…

We can have potentials

on any cliques

not just

the maximal ones.

So, for example we can

have a potential on A

in addition to the other

four
pairwise

potentials

Markov Networks

Undirected

graphical models

Cancer

Cough

Asthma

Smoking

Potential functions defined over cliques

Smoking

Cancer

Ф
(S,C)

False

False

4.5

False

True

4.5

True

False

2.7

True

True

4.5

c
c
c
x
Z
x
P
)
(
1
)
(

x
c
c
c
x
Z
)
(
Log
-
Linear models for Markov Nets

A

B

C

D

Factors are “functions” over their domains

Log linear model consists of

Features
f
i

(D
i

)
(functions over domains)

Weights
w
i

for
features

s.t
.

Without loss of generality!

Markov Networks

Undirected

graphical models

Log
-
linear model:

Weight of Feature
i

Feature
i

otherwise
0
Cancer
Smoking
if
1
)
Cancer
Smoking,
(
1
f
5
.
1
1

w
Cancer

Cough

Asthma

Smoking

i
i
i
x
f
w
Z
x
P
)
(
exp
1
)
(
Markov Nets vs. Bayes Nets

Property

Markov Nets

Bayes Nets

Form

Prod. potentials

Prod. potentials

Potentials

Arbitrary

Cond. probabilities

Cycles

Allowed

Forbidden

Partition func.

Z = ?
global

Z = 1
local

Indep. check

Graph separation

D
-
separation

Indep. props.

Some

Some

Inference

MCMC, BP, etc.

Convert to Markov

Inference in Markov Networks

Goal
: Compute
marginals

& conditionals of

Exact inference is #
P
-
complete

Most BN inference approaches work for MNs too

Variable Elimination used factor multiplication

and
should work without change..

Conditioning on Markov blanket is easy:

Gibbs sampling exploits this

exp ( )
( | ( ))
exp ( 0) exp ( 1)
i i
i
i i i i
i i
w f x
P x MB x
w f x w f x

  

 
1
( ) exp ( )
i i
i
P X w f X
Z
 

 
 

exp ( )
i i
X i
Z w f X
 

 
 
 
MCMC: Gibbs Sampling

state

random truth assignment

for

i

1
to

num
-
samples
do

for each

variable

x

sample
x

according to P(
x
|
neighbors
(
x
))

state

state

with new value of
x

P(
F
)

fraction of states in which
F

is true

Other Inference Methods

Many variations of MCMC

Belief propagation (sum
-
product)

Variational approximation

Exact methods

Learning Markov Networks

Learning parameters (weights)

Generatively

Discriminatively

Learning structure (features)

Easy Case:
Assume complete data

(If not: EM versions of algorithms)

Entanglement in log likelihood…

a

b

c

Learning for log
-
linear formulation

Unimodal
, because Hessian is

Co
-
variance matrix over features

What is the expected

Value of the feature

g
iven the current

parameterization

o
f the network?

Requires inference to answer

(inference at every iteration

sort of like EM

)

Why should we spend so much time

Given that gradient is being used only in doing
the gradient ascent iteration, it might look as if
we should just be able to approximate it in any
which way

Afterall
, we are going to take a step with some
arbitrary step size anyway..

..But the thing to keep in mind is that the
vector.

We are talking not just of
magnitude but direction. A mistake in magnitude
can change the direction of the vector and push
the search into a completely wrong direction…

Generative Weight Learning

Maximize likelihood or posterior probability

Numerical optimization (gradient or 2
nd

order)

No local maxima

Requires inference at each step (slow!)

No. of times feature

i
is true in data

Expected no. times feature
i

is true according to model

)
(
)
(
)
(
log
x
n
E
x
n
x
P
w
i
w
i
w
i

1
( ) exp ( )
i i
i
P X w f X
Z
 

 
 

exp ( )
i i
X i
Z w f X
 

 
 
 
Alternative Objectives to maximize..

Since log
-
likelihood requires
network inference to
compute the derivative, we
might want to focus on
other objectives whose
gradients are easier to
compute (and which also

hopefully

have optima at
the same parameter
values).

Two options:

Pseudo Likelihood

Contrastive Divergence

Given a single data instance
x
log
-
likelihood is

Log
prob

of data

Log
prob

of
all

other

possible

data instances (
w.r.t
. current
q

Maximize the distance

(“increase the divergence”)

Pick a sample of

typical other instances

(need to sample from
P
q

Run MCMC initializing with

the data..)

Compute likelihood of

each possible data instance

just using
markov

blanket

(approximate chain rule)

Pseudo
-
Likelihood

Likelihood of each variable given its neighbors
in the data

Does not require inference at each step

Consistent estimator

Widely used in vision, spatial statistics, etc.

But PL parameters may not work well for

long inference chains

i
i
i
x
neighbors
x
P
x
PL
))
(
|
(
)
(
[Which can lead to disasterous results]

Discriminative Weight Learning

Maximize conditional likelihood of query (
y
)
given evidence (
x
)

Approximate expected counts by counts in
MAP state of
y

given
x

No. of true groundings of clause

i
in data

Expected no. true groundings according to model

)
,
(
)
,
(
)
|
(
log
y
x
n
E
y
x
n
x
y
P
w
i
w
i
w
i

Structure Learning

How to learn the structure of a Markov
network?

… not too different from learning structure for a
Bayes network: discrete search through space of
possible graphs, trying to maximize data
probability….

MLNs: Points to ponder

Compared to ground
representations, MLNs have
easier learning but equal harder
inference

MLNs need to learn significantly
fewer parameters than a ground
network of similar size

MLNs may be compelled to exploit
the “relational” structure and thus
may spend time inventing lifted
inference methods

Inference approaches

Learning

Parameter

Why Pseudo Likelihood?

Structure

implies learning
clauses.. (what ILP does)

Connection to Dynamic
Bayes

Nets?

Relational

Markov Logic: Intuition

A logical KB is a set of
hard constraints

on the set of possible worlds

Let’s make them
soft constraints
:

When a world violates a formula,

It becomes less probable, not impossible

Give each formula a
weight

(Higher weight

satisfies
it

formulas

of

weights
exp
P(world)
Markov Logic: Definition

A
Markov Logic Network (MLN)

is a set of pairs
(F, w)

where

F

is a formula in first
-
order logic

w

is a real number

Together with a set of constants,

it defines a Markov network with

One node for each grounding of each predicate in
the MLN

One feature for each grounding of each formula
F

in the MLN, with the corresponding weight
w

Example: Friends & Smokers

habits.

smoking

similar

have

Friends
cancer.

causes

Smoking
Example: Friends & Smokers

)
(
)
(
)
,
(
,
)
(
)
(
y
Smokes
x
Smokes
y
x
Friends
y
x
x
Cancer
x
Smokes
x

Example: Friends & Smokers

)
(
)
(
)
,
(
,
)
(
)
(
y
Smokes
x
Smokes
y
x
Friends
y
x
x
Cancer
x
Smokes
x

1
.
1
5
.
1
Example: Friends & Smokers

)
(
)
(
)
,
(
,
)
(
)
(
y
Smokes
x
Smokes
y
x
Friends
y
x
x
Cancer
x
Smokes
x

1
.
1
5
.
1
Two constants:
Anna

(A) and
Bob

(B)

Example: Friends & Smokers

)
(
)
(
)
,
(
,
)
(
)
(
y
Smokes
x
Smokes
y
x
Friends
y
x
x
Cancer
x
Smokes
x

1
.
1
5
.
1
Cancer(A)

Smokes(A)

Smokes(B)

Cancer(B)

Two constants:
Anna

(A) and
Bob

(B)

Example: Friends & Smokers

)
(
)
(
)
,
(
,
)
(
)
(
y
Smokes
x
Smokes
y
x
Friends
y
x
x
Cancer
x
Smokes
x

1
.
1
5
.
1
Cancer(A)

Smokes(A)

Friends(A,A)

Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)

Two constants:
Anna

(A) and
Bob

(B)

Example: Friends & Smokers

)
(
)
(
)
,
(
,
)
(
)
(
y
Smokes
x
Smokes
y
x
Friends
y
x
x
Cancer
x
Smokes
x

1
.
1
5
.
1
Cancer(A)

Smokes(A)

Friends(A,A)

Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)

Two constants:
Anna

(A) and
Bob

(B)

Example: Friends & Smokers

)
(
)
(
)
,
(
,
)
(
)
(
y
Smokes
x
Smokes
y
x
Friends
y
x
x
Cancer
x
Smokes
x

1
.
1
5
.
1
Cancer(A)

Smokes(A)

Friends(A,A)

Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)

Two constants:
Anna

(A) and
Bob

(B)

Markov Logic Networks

MLN is
template

for ground Markov nets

Probability of a world
x
:

Typed

variables and constants greatly reduce
size of ground Markov net

Functions, existential quantifiers, etc.

Infinite and continuous domains

Weight of formula
i

No. of true groundings of formula
i
in
x

i
i
i
x
n
w
Z
x
P
)
(
exp
1
)
(
Relation to Statistical Models

Special cases:

Markov networks

Markov random fields

Bayesian networks

Log
-
linear models

Exponential models

Max. entropy models

Gibbs distributions

Boltzmann machines

Logistic regression

Hidden Markov models

Conditional random fields

Obtained by making all
predicates zero
-
arity

Markov logic allows
objects to be
interdependent

(non
-
i.i.d.)

Relation to First
-
Order Logic

Infinite weights

First
-
order logic

Satisfiable KB, positive weights

Satisfying assignments = Modes of distribution

Markov logic allows contradictions between
formulas

MAP/MPE Inference

Problem:

Find most likely state of world given
evidence

)
|
(
max
arg
x
y
P
y
Query

Evidence

MAP/MPE Inference

Problem:

Find most likely state of world given
evidence

i
i
i
x
y
y
x
n
w
Z
)
,
(
exp
1
max
arg
MAP/MPE Inference

Problem:

Find most likely state of world given
evidence

i
i
i
y
y
x
n
w
)
,
(
max
arg
MAP/MPE Inference

Problem:

Find most likely state of world given
evidence

This is just the weighted MaxSAT problem

Use weighted SAT solver

(e.g., MaxWalkSAT
[Kautz et al., 1997]

)

Potentially faster than logical inference (!)

i
i
i
y
y
x
n
w
)
,
(
max
arg
The MaxWalkSAT Algorithm

for

i

1 to
max
-
tries

do

solution

= random truth assignment

for
j

1
to

max
-
flips

do

if
∑ weights(sat. clauses) > threshold

then

return
solution

c

random unsatisfied clause

with probability

p

flip a random variable in
c

else

flip variable in
c

that maximizes

∑ weights(sat. clauses)

return

failure, best
solution

found

But … Memory Explosion

Problem:

If there are
n

constants

and the highest clause arity is
c
,

the ground network requires
O(n )

memory

Solution:

Exploit sparseness; ground clauses lazily

LazySAT algorithm
[Singla & Domingos, 2006]

c

Computing Probabilities

P(
Formula
|MLN,C) = ?

MCMC: Sample worlds, check formula holds

P(
Formula1
|
Formula2
,MLN,C) = ?

If
Formula2

= Conjunction of ground atoms

First construct min subset of network necessary to
answer query (generalization of KBMC)

Then apply MCMC (or other)

Can also do lifted inference
[Braz et al, 2005]

Ground Network Construction

network

Ø

queue

query nodes

repeat

node

front(
queue
)

remove
node

from
queue

node

to
network

if
node

not in evidence
then

node
) to queue

until

queue =
Ø

But … Insufficient for Logic

Problem:

Deterministic dependencies break MCMC

Near
-
deterministic ones make it
very

slow

Solution:

Combine MCMC and WalkSAT

MC
-
SAT algorithm
[Poon & Domingos, 2006]

Learning

Data is a relational database

Closed world assumption (if not: EM)

Learning parameters (weights)

Learning structure (formulas)

Parameter tying: Groundings of same clause

Generative learning: Pseudo
-
likelihood

Discriminative learning: Cond. likelihood,

use MC
-
SAT or MaxWalkSAT for inference

Weight Learning

No. of times clause

i
is true in data

Expected no. times clause
i

is true according to MLN

)
(
)
(
)
(
log
x
n
E
x
n
x
P
w
i
w
i
w
i

Structure Learning

Generalizes feature induction in Markov nets

Any inductive logic programming approach can be used,
but . . .

Goal is to induce any clauses, not just Horn

Evaluation function should be likelihood

Requires learning weights for each candidate

Turns out not to be bottleneck

Bottleneck is counting clause groundings

Solution: Subsampling

Structure Learning

Initial state:

Unit clauses or hand
-
coded KB

Operators:

Add/remove literal, flip sign

Evaluation function:

Pseudo
-
likelihood + Structure prior

Search:

Beam, shortest
-
first, bottom
-
up

[Kok

& Domingos, 2005; Mihalkova & Mooney, 2007]

Alchemy

Open
-
source software including:

Full first
-
order logic syntax

Generative & discriminative weight learning

Structure learning

Weighted satisfiability and MCMC

Programming language features

alchemy.cs.washington.edu

Alchemy

Prolog

BUGS

Represent
-
ation

F.O. Logic +
Markov nets

Horn
clauses

Bayes
nets

Inference

Model check
-

ing, MC
-
SAT

Theorem
proving

Gibbs
sampling

Learning

Parameters

& structure

No

Params.

Uncertainty

Yes

No

Yes

Relational

Yes

Yes

No