Undirected Probabilistic Graphical
Models
(Markov Nets)
(Slides from Sam
Roweis
Lecture)
Connection to MCMC:
MCMC requires sampling a node given its
markov
blanket
Need to use P(
xMB
(x)).
For
Bayes
nets MB(x) contains more
nodes than are mentioned in the local distribution CPT(x)
For Markov nets,
Because neighbor relation is symmetric
nodes xi and
xj
are both neighbors of each other..
In contrast, note that in
Bayes
Nets, CPTs can be filled with
any
real numbers
between 0 and 1, and we can be sure the ensuing product will define a valid
joint distribution!
12/2
All project presentations on 12/14 (10min
each)
All project reports due on 12/14
On 12/7, we will read and discuss MLN paper
Today: Complete discussion of Markov Nets; Start towards MLN
A
B
C
D
Qn
: What is the
most likely
configuration of A&B?
Moral: Factors
are
not
marginals
!
Although A,B would
Like to agree, B&C
Need to agree,
C&D need to disagree
And D&A need to agree
.and the latter three have
Higher weights!
Mr. & Mrs. Smith example
Okay, you convinced me
that given any potentials
w
e will have a consistent
Joint.
But given any joint,
w
ill there be a potentials
I can provide?
Hammersley

Clifford
theorem…
We can have potentials
on any cliques
—
not just
the maximal ones.
So, for example we can
have a potential on A
in addition to the other
four
pairwise
potentials
Markov Networks
•
Undirected
graphical models
Cancer
Cough
Asthma
Smoking
Potential functions defined over cliques
Smoking
Cancer
Ф
(S,C)
False
False
4.5
False
True
4.5
True
False
2.7
True
True
4.5
c
c
c
x
Z
x
P
)
(
1
)
(
x
c
c
c
x
Z
)
(
Log

Linear models for Markov Nets
A
B
C
D
Factors are “functions” over their domains
Log linear model consists of
Features
f
i
(D
i
)
(functions over domains)
Weights
w
i
for
features
s.t
.
Without loss of generality!
Markov Networks
•
Undirected
graphical models
Log

linear model:
Weight of Feature
i
Feature
i
otherwise
0
Cancer
Smoking
if
1
)
Cancer
Smoking,
(
1
f
5
.
1
1
w
Cancer
Cough
Asthma
Smoking
i
i
i
x
f
w
Z
x
P
)
(
exp
1
)
(
Markov Nets vs. Bayes Nets
Property
Markov Nets
Bayes Nets
Form
Prod. potentials
Prod. potentials
Potentials
Arbitrary
Cond. probabilities
Cycles
Allowed
Forbidden
Partition func.
Z = ?
global
Z = 1
local
Indep. check
Graph separation
D

separation
Indep. props.
Some
Some
Inference
MCMC, BP, etc.
Convert to Markov
Inference in Markov Networks
•
Goal
: Compute
marginals
& conditionals of
•
Exact inference is #
P

complete
•
Most BN inference approaches work for MNs too
–
Variable Elimination used factor multiplication
—
and
should work without change..
•
Conditioning on Markov blanket is easy:
•
Gibbs sampling exploits this
exp ( )
(  ( ))
exp ( 0) exp ( 1)
i i
i
i i i i
i i
w f x
P x MB x
w f x w f x
1
( ) exp ( )
i i
i
P X w f X
Z
exp ( )
i i
X i
Z w f X
MCMC: Gibbs Sampling
state
←
random truth assignment
for
i
←
1
to
num

samples
do
for each
variable
x
sample
x
according to P(
x

neighbors
(
x
))
state
←
state
with new value of
x
P(
F
)
←
fraction of states in which
F
is true
Other Inference Methods
•
Many variations of MCMC
•
Belief propagation (sum

product)
•
Variational approximation
•
Exact methods
Learning Markov Networks
•
Learning parameters (weights)
–
Generatively
–
Discriminatively
•
Learning structure (features)
•
Easy Case:
Assume complete data
(If not: EM versions of algorithms)
Entanglement in log likelihood…
a
b
c
Learning for log

linear formulation
Use gradient ascent
Unimodal
, because Hessian is
Co

variance matrix over features
What is the expected
Value of the feature
g
iven the current
parameterization
o
f the network?
Requires inference to answer
(inference at every iteration
—
sort of like EM
)
Why should we spend so much time
computing gradient?
•
Given that gradient is being used only in doing
the gradient ascent iteration, it might look as if
we should just be able to approximate it in any
which way
–
Afterall
, we are going to take a step with some
arbitrary step size anyway..
•
..But the thing to keep in mind is that the
gradient is a
vector.
We are talking not just of
magnitude but direction. A mistake in magnitude
can change the direction of the vector and push
the search into a completely wrong direction…
Generative Weight Learning
•
Maximize likelihood or posterior probability
•
Numerical optimization (gradient or 2
nd
order)
•
No local maxima
•
Requires inference at each step (slow!)
No. of times feature
i
is true in data
Expected no. times feature
i
is true according to model
)
(
)
(
)
(
log
x
n
E
x
n
x
P
w
i
w
i
w
i
1
( ) exp ( )
i i
i
P X w f X
Z
exp ( )
i i
X i
Z w f X
Alternative Objectives to maximize..
•
Since log

likelihood requires
network inference to
compute the derivative, we
might want to focus on
other objectives whose
gradients are easier to
compute (and which also
–
hopefully
—
have optima at
the same parameter
values).
•
Two options:
–
Pseudo Likelihood
–
Contrastive Divergence
Given a single data instance
x
log

likelihood is
Log
prob
of data
Log
prob
of
all
other
possible
data instances (
w.r.t
. current
q
Maximize the distance
(“increase the divergence”)
Pick a sample of
typical other instances
(need to sample from
P
q
Run MCMC initializing with
the data..)
Compute likelihood of
each possible data instance
just using
markov
blanket
(approximate chain rule)
Pseudo

Likelihood
•
Likelihood of each variable given its neighbors
in the data
•
Does not require inference at each step
•
Consistent estimator
•
Widely used in vision, spatial statistics, etc.
•
But PL parameters may not work well for
long inference chains
i
i
i
x
neighbors
x
P
x
PL
))
(

(
)
(
[Which can lead to disasterous results]
Discriminative Weight Learning
•
Maximize conditional likelihood of query (
y
)
given evidence (
x
)
•
Approximate expected counts by counts in
MAP state of
y
given
x
No. of true groundings of clause
i
in data
Expected no. true groundings according to model
)
,
(
)
,
(
)

(
log
y
x
n
E
y
x
n
x
y
P
w
i
w
i
w
i
Structure Learning
•
How to learn the structure of a Markov
network?
–
… not too different from learning structure for a
Bayes network: discrete search through space of
possible graphs, trying to maximize data
probability….
MLNs: Points to ponder
•
Compared to ground
representations, MLNs have
easier learning but equal harder
inference
–
MLNs need to learn significantly
fewer parameters than a ground
network of similar size
–
MLNs may be compelled to exploit
the “relational” structure and thus
may spend time inventing lifted
inference methods
•
Inference approaches
•
Learning
–
Parameter
•
Why Pseudo Likelihood?
–
Structure
—
implies learning
clauses.. (what ILP does)
•
Connection to Dynamic
Bayes
Nets?
•
Relational
Markov Logic: Intuition
•
A logical KB is a set of
hard constraints
on the set of possible worlds
•
Let’s make them
soft constraints
:
When a world violates a formula,
It becomes less probable, not impossible
•
Give each formula a
weight
(Higher weight
†
却S潮o敲ec潮o瑲慩a琩
satisfies
it
formulas
of
weights
exp
P(world)
Markov Logic: Definition
•
A
Markov Logic Network (MLN)
is a set of pairs
(F, w)
where
–
F
is a formula in first

order logic
–
w
is a real number
•
Together with a set of constants,
it defines a Markov network with
–
One node for each grounding of each predicate in
the MLN
–
One feature for each grounding of each formula
F
in the MLN, with the corresponding weight
w
Example: Friends & Smokers
habits.
smoking
similar
have
Friends
cancer.
causes
Smoking
Example: Friends & Smokers
)
(
)
(
)
,
(
,
)
(
)
(
y
Smokes
x
Smokes
y
x
Friends
y
x
x
Cancer
x
Smokes
x
Example: Friends & Smokers
)
(
)
(
)
,
(
,
)
(
)
(
y
Smokes
x
Smokes
y
x
Friends
y
x
x
Cancer
x
Smokes
x
1
.
1
5
.
1
Example: Friends & Smokers
)
(
)
(
)
,
(
,
)
(
)
(
y
Smokes
x
Smokes
y
x
Friends
y
x
x
Cancer
x
Smokes
x
1
.
1
5
.
1
Two constants:
Anna
(A) and
Bob
(B)
Example: Friends & Smokers
)
(
)
(
)
,
(
,
)
(
)
(
y
Smokes
x
Smokes
y
x
Friends
y
x
x
Cancer
x
Smokes
x
1
.
1
5
.
1
Cancer(A)
Smokes(A)
Smokes(B)
Cancer(B)
Two constants:
Anna
(A) and
Bob
(B)
Example: Friends & Smokers
)
(
)
(
)
,
(
,
)
(
)
(
y
Smokes
x
Smokes
y
x
Friends
y
x
x
Cancer
x
Smokes
x
1
.
1
5
.
1
Cancer(A)
Smokes(A)
Friends(A,A)
Friends(B,A)
Smokes(B)
Friends(A,B)
Cancer(B)
Friends(B,B)
Two constants:
Anna
(A) and
Bob
(B)
Example: Friends & Smokers
)
(
)
(
)
,
(
,
)
(
)
(
y
Smokes
x
Smokes
y
x
Friends
y
x
x
Cancer
x
Smokes
x
1
.
1
5
.
1
Cancer(A)
Smokes(A)
Friends(A,A)
Friends(B,A)
Smokes(B)
Friends(A,B)
Cancer(B)
Friends(B,B)
Two constants:
Anna
(A) and
Bob
(B)
Example: Friends & Smokers
)
(
)
(
)
,
(
,
)
(
)
(
y
Smokes
x
Smokes
y
x
Friends
y
x
x
Cancer
x
Smokes
x
1
.
1
5
.
1
Cancer(A)
Smokes(A)
Friends(A,A)
Friends(B,A)
Smokes(B)
Friends(A,B)
Cancer(B)
Friends(B,B)
Two constants:
Anna
(A) and
Bob
(B)
Markov Logic Networks
•
MLN is
template
for ground Markov nets
•
Probability of a world
x
:
•
Typed
variables and constants greatly reduce
size of ground Markov net
•
Functions, existential quantifiers, etc.
•
Infinite and continuous domains
Weight of formula
i
No. of true groundings of formula
i
in
x
i
i
i
x
n
w
Z
x
P
)
(
exp
1
)
(
Relation to Statistical Models
•
Special cases:
–
Markov networks
–
Markov random fields
–
Bayesian networks
–
Log

linear models
–
Exponential models
–
Max. entropy models
–
Gibbs distributions
–
Boltzmann machines
–
Logistic regression
–
Hidden Markov models
–
Conditional random fields
•
Obtained by making all
predicates zero

arity
•
Markov logic allows
objects to be
interdependent
(non

i.i.d.)
Relation to First

Order Logic
•
Infinite weights
First

order logic
•
Satisfiable KB, positive weights
Satisfying assignments = Modes of distribution
•
Markov logic allows contradictions between
formulas
MAP/MPE Inference
•
Problem:
Find most likely state of world given
evidence
)

(
max
arg
x
y
P
y
Query
Evidence
MAP/MPE Inference
•
Problem:
Find most likely state of world given
evidence
i
i
i
x
y
y
x
n
w
Z
)
,
(
exp
1
max
arg
MAP/MPE Inference
•
Problem:
Find most likely state of world given
evidence
i
i
i
y
y
x
n
w
)
,
(
max
arg
MAP/MPE Inference
•
Problem:
Find most likely state of world given
evidence
•
This is just the weighted MaxSAT problem
•
Use weighted SAT solver
(e.g., MaxWalkSAT
[Kautz et al., 1997]
)
•
Potentially faster than logical inference (!)
i
i
i
y
y
x
n
w
)
,
(
max
arg
The MaxWalkSAT Algorithm
for
i
←
1 to
max

tries
do
solution
= random truth assignment
for
j
←
1
to
max

flips
do
if
∑ weights(sat. clauses) > threshold
then
return
solution
c
←
random unsatisfied clause
with probability
p
flip a random variable in
c
else
flip variable in
c
that maximizes
∑ weights(sat. clauses)
return
failure, best
solution
found
But … Memory Explosion
•
Problem:
If there are
n
constants
and the highest clause arity is
c
,
the ground network requires
O(n )
memory
•
Solution:
Exploit sparseness; ground clauses lazily
→
LazySAT algorithm
[Singla & Domingos, 2006]
c
Computing Probabilities
•
P(
Formula
MLN,C) = ?
•
MCMC: Sample worlds, check formula holds
•
P(
Formula1

Formula2
,MLN,C) = ?
•
If
Formula2
= Conjunction of ground atoms
–
First construct min subset of network necessary to
answer query (generalization of KBMC)
–
Then apply MCMC (or other)
•
Can also do lifted inference
[Braz et al, 2005]
Ground Network Construction
network
←
Ø
queue
←
query nodes
repeat
node
←
front(
queue
)
remove
node
from
queue
add
node
to
network
if
node
not in evidence
then
add neighbors(
node
) to queue
until
queue =
Ø
But … Insufficient for Logic
•
Problem:
Deterministic dependencies break MCMC
Near

deterministic ones make it
very
slow
•
Solution:
Combine MCMC and WalkSAT
→
MC

SAT algorithm
[Poon & Domingos, 2006]
Learning
•
Data is a relational database
•
Closed world assumption (if not: EM)
•
Learning parameters (weights)
•
Learning structure (formulas)
•
Parameter tying: Groundings of same clause
•
Generative learning: Pseudo

likelihood
•
Discriminative learning: Cond. likelihood,
use MC

SAT or MaxWalkSAT for inference
Weight Learning
No. of times clause
i
is true in data
Expected no. times clause
i
is true according to MLN
)
(
)
(
)
(
log
x
n
E
x
n
x
P
w
i
w
i
w
i
Structure Learning
•
Generalizes feature induction in Markov nets
•
Any inductive logic programming approach can be used,
but . . .
•
Goal is to induce any clauses, not just Horn
•
Evaluation function should be likelihood
•
Requires learning weights for each candidate
•
Turns out not to be bottleneck
•
Bottleneck is counting clause groundings
•
Solution: Subsampling
Structure Learning
•
Initial state:
Unit clauses or hand

coded KB
•
Operators:
Add/remove literal, flip sign
•
Evaluation function:
Pseudo

likelihood + Structure prior
•
Search:
Beam, shortest

first, bottom

up
[Kok
& Domingos, 2005; Mihalkova & Mooney, 2007]
Alchemy
Open

source software including:
•
Full first

order logic syntax
•
Generative & discriminative weight learning
•
Structure learning
•
Weighted satisfiability and MCMC
•
Programming language features
alchemy.cs.washington.edu
Alchemy
Prolog
BUGS
Represent

ation
F.O. Logic +
Markov nets
Horn
clauses
Bayes
nets
Inference
Model check

ing, MC

SAT
Theorem
proving
Gibbs
sampling
Learning
Parameters
& structure
No
Params.
Uncertainty
Yes
No
Yes
Relational
Yes
Yes
No
Comments 0
Log in to post a comment