Finding Optimal Bayesian
Networks with Greedy Search
Max Chickering
Outline
•
Bayesian

Network Definitions
•
Learning
•
Greedy Equivalence Search (GES)
•
Optimality of GES
Bayesian Networks
)
),
(

(
)
,

,...,
(
1
1
i
i
n
i
i
n
X
Par
X
p
S
X
X
p
Use
B
= (
S
,
) to represent
p(X
1
, …, X
n
)
Markov Conditions
X
Desc
Desc
Par
Par
Par
ND
ND
From factorization:
I(X, ND

Par(X))
Markov Conditions + Graphoid Axioms
characterize all independencies
Structure/Distribution Inclusion
p
is
included
in
S
if there exists
s.t. B(
S
,
) defines
p
X
Y
Z
p
All distributions
S
Structure/Structure Inclusion
T
≤
S
T
is
included
in
S
if every
p
included in
T
is included in
S
X
Y
Z
All distributions
X
Y
Z
S
T
(
S
is an I

map of
T
)
Structure/Structure Equivalence
T
S
X
Y
Z
All distributions
X
Y
Z
S
T
Reflexive, Symmetric, Transitive
Equivalence
V

structure
D
A
B
C
Theorem (Verma and Pearl, 1990)
S
T
same v

structures and skeletons
D
A
B
C
Skeleton
Learning Bayesian Networks
1.
Learn the structure
2.
Estimate the conditional distributions
X
Y
Z
0
1
1
1
0
1
0
1
0
.
.
.
1
0
1
X
Y
Z
p
*
iid
samples
Generative
Distribution
Observed Data
Learned
Model
Learning Structure
•
Scoring criterion
F(
D
,
S
)
•
Search procedure
Identify one or more structures with high values
for the scoring function
Properties of Scoring Criteria
•
Consistent
•
Locally Consistent
•
Score Equivalent
Consistent Criterion
S
includes p*,
T
does not include
p
*
F(
S
,
D
) > F(
T,
D
)
Both include p*, S has fewer parameters
F(
S
,
D
) > F(
T,
D
)
Criterion favors (in the limit) simplest model that
includes the generative distribution p*
X
Y
Z
X
Y
Z
X
Y
Z
X
Y
Z
p
*
Locally Consistent Criterion
X
Y
X
Y
S
T
If I(
X,Y
Par(
X
)) in
p
*
then
F(
S
,
D
) > F(
T,
D
)
Otherwise
F(
S
,
D
) < F(
T,
D
)
S
and
T
differ by one edge:
Score

Equivalent Criterion
S
T
F(
S
,
D
) = F(
T,
D
)
X
Y
X
Y
S
T
Bayesian Criterion
(Consistent, locally consistent and score equivalent)
S
h
: generative distribution
p
* has same
independence constraints as
S
.
F
Bayes
(S,
D
) =
log
p(S
h

D
)
= k +
log
p(
D
S
h
) +
log
p(S
h
)
Marginal Likelihood
(closed form w/ assumptions)
Structure Prior
(e.g. prefer simple)
Search Procedure
•
Set of states
•
Representation for the states
•
Operators to move between states
•
Systematic Search Algorithm
Greedy Equivalence Search
•
Set of states
Equivalence classes of DAGs
•
Representation for the states
Essential graphs
•
Operators to move between states
Forward and Backward Operators
•
Systematic Search Algorithm
Two

phase Greedy
Representation: Essential Graphs
E
A
B
C
F
Compelled Edges
Reversible Edges
D
E
A
B
C
F
D
GES Operators
Forward Direction
–
single edge additions
Backward Direction
–
single edge deletions
Two

Phase Greedy Algorithm
Phase 1: Forward Equivalence Search (FES)
•
Start with all

independence model
•
Run Greedy using forward operators
Phase 2: Backward Equivalence Search (BES)
•
Start with local max from FES
•
Run Greedy using backward operators
Forward Operators
•
Consider all DAGs in the current state
•
For each DAG, consider all single

edge
additions (acyclic)
•
Take the union of the resulting
equivalence classes
Forward

Operators Example
B
C
A
Current State:
All DAGs:
B
C
A
B
C
A
All DAGs resulting from single

edge addition:
B
C
A
B
C
A
B
C
A
B
C
A
B
C
A
B
C
A
B
C
A
B
C
A
B
C
A
B
C
A
B
C
A
B
C
A
Union of corresponding essential graphs:
Forward

Operators Example
B
C
A
B
C
A
B
C
A
B
C
A
B
C
A
Backward Operators
•
Consider all DAGs in the current state
•
For each DAG, consider all single

edge
deletions
•
Take the union of the resulting
equivalence classes
Backward

Operators Example
Current State:
All DAGs:
B
C
A
B
C
A
All DAGs resulting from single

edge deletion:
B
C
A
Union of corresponding essential graphs:
B
C
A
B
C
A
B
C
A
B
C
A
B
C
A
B
C
A
B
C
A
B
C
A
B
C
A
Backward

Operators Example
B
C
A
B
C
A
B
C
A
DAG Perfect
DAG

perfect distribution
p
Exists DAG
G
:
I(
X,Y

Z
) in
p
I(
X,Y

Z
) in
G
Non

DAG

perfect distribution
q
B
A
D
C
I(
A,D

B,C
)
I(
B,C

A,D
)
B
A
D
C
B
A
D
C
I(
B,C

A,D
)
I(
A,D

B,C
)
DAG

Perfect Consequence:
Composition Axiom Holds in
p
*
If
I(X,
Y

Z
)
then
I(X,Y 
Z
)
for some singleton
Y
Y
A
B
C
X
D
C
X
Optimality of GES
X Y Z
0 1 1
1 0 1
0 1 0
.
.
.
1 0 1
X
Y
Z
S
*
X
Y
Z
iid
samples
If
p
* is DAG

perfect wrt some
G
*
G
*
X
Y
Z
n
GES
S
p
*
For large
n
, S = S*
Optimality of GES
Proof Outline
•
After first phase (FES), current state includes S*
•
After second phase (BES), the current state =
S
*
FES
BES
All

independence
State includes S*
State equals S*
FES Maximum Includes S*
Assume: Local Max does
NOT
include
S*
Any DAG
G
from
S
Markov Conditions characterize independencies:
In
p
*, exists
X
not indep. non

desc given parents
E
A
B
C
X
D
I(X,{A,B,C,D}  E)
in
p
*
p
* is DAG

perfect
composition
axiom holds
E
A
B
C
X
D
I(X,C  E)
in
p
*
Locally consistent: adding C
X edge improves score, and EQ class is
a neighbor
BES Identifies S*
•
Current state always includes
S
*:
Local consistency of the criterion
•
Local Minimum is
S
*:
Meek’s conjecture
Meek’s Conjecture
Any pair of DAGs
G,H
such that
H
includes
G
(
G
≤
H
)
There exists a sequence of
(1)
c
overed
edge reversals in
G
(2) single

edge additions to
G
after each change
G
≤
H
after all changes
G
=
H
Meek’s Conjecture
B
A
C
D
B
A
C
D
I(A,B)
I(C,BA,D)
B
A
C
D
B
A
C
D
B
A
C
D
H
G
Meek’s Conjecture and BES
S
*
≤
S
Assume: Local Max S
Not S*
Any DAG
H
from
S
Any DAG
G
from
S*
G
H
Add
Add
Rev
Rev
Rev
Meek’s Conjecture and BES
S
*
≤
S
Assume: Local Max S
Not S*
Any DAG
H
from
S
Any DAG
G
from
S*
G
H
Add
Add
Rev
Rev
Rev
G
H
Del
Del
Rev
Rev
Rev
Meek’s Conjecture and BES
S
*
≤
S
Assume: Local Max S
Not S*
Any DAG
H
from
S
Any DAG
G
from
S*
G
H
Add
Add
Rev
Rev
Rev
S*
S
Neighbor
of S
in BES
G
H
Del
Del
Rev
Rev
Rev
Discussion Points
•
In practice, GES is as fast as DAG

based
search
Neighborhood of essential graphs can be
generated and scored very efficiently
•
When DAG

perfect assumption fails, we still
get optimality guarantees
As long as composition holds in generative
distribution, local maximum is inclusion

minimal
Thanks!
My Home Page:
http://research.microsoft.com/~dmax
Relevant Papers:
“Optimal Structure Identification with Greedy Search”
JMLR Submission
Contains detailed proofs of Meek’s conjecture and optimality of GES
“Finding Optimal Bayesian Networks”
UAI02 Paper with Chris Meek
Contains extension of optimality results of GES when not DAG perfect
Bayesian Criterion is Locally
Consistent
•
Bayesian score approaches BIC + constant
•
BIC is
decomposible:
•
Difference in score same for any DAGS that differ by
Y
X
edge if
X
has same parents
))
(
,
(
)
,
(
1
i
n
i
i
X
Par
X
F
S
BIC
D
X
Y
X
Y
Complete network (always includes p*)
Bayesian Criterion is Consistent
Assume Conditionals:
(1)
unconstrained multinomials
(2)
linear regressions
Network structures = curved exponential models
Bayesian Criterion is consistent
Geiger, Heckerman, King and Meek (2001)
Haughton (1988)
Bayesian Criterion is
Score Equivalent
S
T
F(
S
,
D
) = F(
T,
D
)
X
Y
X
Y
S
T
S
h
= T
h
S
h
: no independence constraints
T
h
: no independence constraints
Active Paths
Z

active Path between
X
and
Y
:
(non

standard)
1.
Neither
X
nor
Y
is in
Z
2.
Every pair of
colliding
edges meets at a member of
Z
3.
No other pair of edges meets at a member of
Z
X
Z
Y
G
≤
H
䥦I
Z

active path between
X
and
Y
in
G
then
Z

active path between
X
and
Y
in
H
Active Paths
X
A
Z
W
B
Y
A
B
C
•
X

Y
:
Out

of
X
and
In

to
Y
•
X

W
Out

of
both
X
and
W
•
Any sub

path between
A,B
Z
is also active
•
A
–
B, B
–
C,
at least one is
out

of
B
Active path between
A
and
C
Simple Active Paths
Y
X
B
A
A
B
contains
Y
X
Then
active path
X
Y
B
A
Y
X
(1) Edge appears exactly once
(2) Edge appears exactly twice
OR
Simplify discussion:
Assume (1) only
–
proofs for (2) almost identical
Typical Argument:
Combining Active Paths
X
Z
Y
B
A
X
Z
Y
B
A
X
A
Y
B
X
Z
Y
G
H
G’
: Suppose AP in G’ (
X
not in CS) with no
corresp. AP in
H
. Then
Z
not in CS.
Z
sink node
adj
X,Y
G
≤
H
Proof Sketch
Two DAGs
G, H
with
G
<
H
Identify either:
(1)
a covered edge
X
Y
in
G
that has
opposite orientation in
H
(2)
a new edge
X
Y
to be added to
G
such
that it remains included in
H
The Transformation
X
Y
Y
X
Y
X
Y
W
Y
X
Y
X
Y
X
Y
W
Y
Choose any node
Y
that is a sink in
H
Case 1a
:
Y
is a sink in
G
X
Par
H
(Y)
X
Par
G
(Y)
Case 1b
:
Y
is a sink in
G
same parents
Case 2a
:
X
s.t.
Y
X
covered
Case 2b
:
X
s.t.
Y
X
&
W
par of
Y
but not
X
Case 2c
: Every
Y
X
,
Par
(
Y
)
Par
(X
)
Preliminaries
•
The adjacencies in
G
are a subset of the
adjacencies in
H
•
If
X
Y
Z
is a v

structure in
G
but not
H
,
then
X
and
Z
are adjacent in
H
•
Any new active path that results from
adding
X
Y
to
G
includes
X
Y
(
G
≤
H
)
Proof Sketch: Case 1
Z
Y
is a sink in
G
X
Y
Case 1a
:
X
Par
H
(Y)
X
Par
G
(Y)
Case 1b
: Parents identical
Remove
Y
from both graphs: proof similar
X
Y
H
:
X
Y
G
:
Suppose there’s some new active path between
A
and
B
not in
H
Y
X
B
A
1.
Y is a sink in
G
, so it must be in CS
2.
Neither
X
nor next node
Z
is in CS
3.
In
H
, AP(A,Z), AP(X,B), Z
Y
X
Proof Sketch: Case 2
Case 2a
: There is a covered edge
Y
X
:
Reverse the edge
Case 2b
: There is a
non

covered
edge
Y
X
such that
W
is a
parent of
Y
but not a parent of
X
X
W
Y
X
W
Y
G
:
G’
:
X
W
Y
H
:
Y
must be in CS, else replace
W
X
by
W
Y
X
(not new).
If
X
not in CS, then in
H
active:
A

W,
X

B
,
W
Y
X
X
W
Y
Z
X
W
Y
H
:
G’
:
Z
A
B
B
A
Y
is
not
a sink in
G
Suppose there’s some new active path between
A
and
B
not in
H
Case 2c: The Difficult Case
All
non

covered
edges
Y
Z
have
Par
(
Y
)
Par
(
Z
)
W
1
Z
1
Y
Z
2
W
2
W
1
Z
1
Y
Z
2
W
2
G
H
W
1
Y
:
G
no longer <
H
(
Z
2

active path between
W
1
and
W
2
)
W
2
Y
:
G
<
H
Choosing Z
D
Y
Z
Descendants of Y in G
D
Y
Descendants of Y in G
G
H
D
is the maximal G

descendant in
H
Z
is any maximal child of
Y
such that
D
is a descendant of
Z
in
G
Choosing Z
W
1
Z
1
Y
Z
2
W
2
W
1
Z
1
Y
Z
2
W
2
G
H
Descendants of Y in G:
Y
,
Z
1
,
Z
2
Maximal descendant in
H
:
D=Z
2
Maximal child of
Y
in
G
that has D=
Z
2
as descendant
Z
2
Add W
2
Y
Difficult Case: Proof Intuition
D
Y
Z
G
H
W
Y
Z
W
1. W not in CS
2.
Y
not in CS, else active in
H
3
.
In
G
, next edges must be away from
Y
until
B
or CS reached
4. In
G
, neither
Z
nor desc in CS, else active before addition
5. From (1,2,4), AP (A,D) and (B,D) in
H
6. Choice of D: directed path from
D
to
B
or
CS
in
H
A
A
D
B or CS
B or CS
B
B
Optimality of GES
Definition
p
is
DAG

perfect
wrt
G:
Independence constraints in
p
are precisely those in G
Assumption
Generative distribution
p
* is perfect wrt some
G
* defined
over the observable variables
S* = Equivalence class containing G*
Under DAG

perfect assumption, GES results in S*
Important Definitions
•
Bayesian Networks
•
Markov Conditions
•
Distribution/Structure Inclusion
•
Structure/Structure Inclusion
Comments 0
Log in to post a comment