Finding Optimal Bayesian

cabbageswerveΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

74 εμφανίσεις

Finding Optimal Bayesian
Networks with Greedy Search


Max Chickering

Outline


Bayesian
-
Network Definitions


Learning


Greedy Equivalence Search (GES)


Optimality of GES

Bayesian Networks

)
),
(
|
(
)
,
|
,...,
(
1
1
i
i
n
i
i
n
X
Par
X
p
S
X
X
p





Use
B

= (
S
,

) to represent
p(X
1
, …, X
n
)

Markov Conditions

X

Desc

Desc

Par

Par

Par

ND

ND

From factorization:
I(X, ND

|
Par(X))

Markov Conditions + Graphoid Axioms


characterize all independencies

Structure/Distribution Inclusion

p

is
included

in
S

if there exists


s.t. B(
S
,

) defines
p

X

Y

Z

p

All distributions

S

Structure/Structure Inclusion

T


S

T

is
included

in
S

if every
p

included in
T

is included in
S

X

Y

Z

All distributions

X

Y

Z

S

T

(
S

is an I
-
map of
T
)

Structure/Structure Equivalence

T


S

X

Y

Z

All distributions

X

Y

Z

S

T

Reflexive, Symmetric, Transitive

Equivalence

V
-
structure

D

A

B

C

Theorem (Verma and Pearl, 1990)

S



T



same v
-
structures and skeletons

D

A

B

C

Skeleton

Learning Bayesian Networks

1.
Learn the structure

2.
Estimate the conditional distributions

X

Y

Z

0

1

1

1

0

1

0

1

0


.


.


.

1

0

1

X

Y

Z

p
*

iid

samples

Generative

Distribution

Observed Data

Learned

Model

Learning Structure


Scoring criterion



F(
D
,
S
)



Search procedure


Identify one or more structures with high values

for the scoring function

Properties of Scoring Criteria


Consistent



Locally Consistent



Score Equivalent

Consistent Criterion

S

includes p*,
T

does not include
p
*




F(
S
,
D
) > F(
T,
D
)

Both include p*, S has fewer parameters


F(
S
,
D
) > F(
T,
D
)

Criterion favors (in the limit) simplest model that
includes the generative distribution p*

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

p
*

Locally Consistent Criterion

X

Y

X

Y

S

T

If I(
X,Y
|Par(
X
)) in
p
*

then

F(
S
,
D
) > F(
T,
D
)

Otherwise



F(
S
,
D
) < F(
T,
D
)

S

and
T

differ by one edge:

Score
-
Equivalent Criterion


S

T



F(
S
,
D
) = F(
T,
D
)

X

Y

X

Y

S

T

Bayesian Criterion

(Consistent, locally consistent and score equivalent)

S
h

: generative distribution
p
* has same

independence constraints as
S
.


F
Bayes
(S,
D
) =
log

p(S
h

|
D
)




= k +
log

p(
D
|S
h
) +
log

p(S
h
)

Marginal Likelihood

(closed form w/ assumptions)

Structure Prior

(e.g. prefer simple)

Search Procedure


Set of states



Representation for the states



Operators to move between states



Systematic Search Algorithm

Greedy Equivalence Search


Set of states

Equivalence classes of DAGs


Representation for the states

Essential graphs


Operators to move between states

Forward and Backward Operators


Systematic Search Algorithm

Two
-
phase Greedy

Representation: Essential Graphs

E

A

B

C

F

Compelled Edges


Reversible Edges

D

E

A

B

C

F

D

GES Operators

Forward Direction


single edge additions

Backward Direction


single edge deletions

Two
-
Phase Greedy Algorithm

Phase 1: Forward Equivalence Search (FES)



Start with all
-
independence model



Run Greedy using forward operators

Phase 2: Backward Equivalence Search (BES)



Start with local max from FES



Run Greedy using backward operators

Forward Operators


Consider all DAGs in the current state



For each DAG, consider all single
-
edge
additions (acyclic)



Take the union of the resulting
equivalence classes


Forward
-
Operators Example

B

C

A

Current State:

All DAGs:

B

C

A

B

C

A

All DAGs resulting from single
-
edge addition:

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

Union of corresponding essential graphs:

Forward
-
Operators Example

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

Backward Operators


Consider all DAGs in the current state



For each DAG, consider all single
-
edge
deletions



Take the union of the resulting
equivalence classes


Backward
-
Operators Example

Current State:

All DAGs:

B

C

A

B

C

A

All DAGs resulting from single
-
edge deletion:

B

C

A

Union of corresponding essential graphs:

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

Backward
-
Operators Example

B

C

A

B

C

A

B

C

A

DAG Perfect

DAG
-
perfect distribution
p

Exists DAG
G
:

I(
X,Y
|
Z
) in
p



I(
X,Y
|
Z
) in
G


Non
-
DAG
-
perfect distribution
q

B

A

D

C

I(
A,D
|
B,C
)

I(
B,C
|
A,D
)

B

A

D

C

B

A

D

C

I(
B,C
|
A,D
)

I(
A,D
|
B,C
)

DAG
-
Perfect Consequence:


Composition Axiom Holds in
p
*

If

I(X,
Y

|
Z
)

then

I(X,Y |
Z
)

for some singleton
Y



Y

A

B

C

X

D

C

X

Optimality of GES

X Y Z

0 1 1

1 0 1

0 1 0


.


.


.

1 0 1

X

Y

Z

S
*

X

Y

Z

iid

samples

If
p
* is DAG
-
perfect wrt some
G
*

G
*

X

Y

Z

n

GES

S

p
*

For large
n
, S = S*

Optimality of GES

Proof Outline


After first phase (FES), current state includes S*


After second phase (BES), the current state =
S
*

FES

BES

All
-
independence

State includes S*

State equals S*

FES Maximum Includes S*

Assume: Local Max does
NOT
include
S*

Any DAG
G

from
S

Markov Conditions characterize independencies:

In
p
*, exists
X

not indep. non
-
desc given parents

E

A

B

C

X

D



I(X,{A,B,C,D} | E)

in
p
*

p
* is DAG
-
perfect


composition

axiom holds

E

A

B

C

X

D



I(X,C | E)

in
p
*

Locally consistent: adding C

X edge improves score, and EQ class is

a neighbor

BES Identifies S*


Current state always includes
S
*:


Local consistency of the criterion



Local Minimum is
S
*:


Meek’s conjecture

Meek’s Conjecture

Any pair of DAGs
G,H

such that
H

includes
G

(
G



H
)


There exists a sequence of


(1)
c
overed

edge reversals in
G



(2) single
-
edge additions to
G





after each change
G



H

after all changes
G
=
H

Meek’s Conjecture

B

A

C

D

B

A

C

D

I(A,B)

I(C,B|A,D)

B

A

C

D

B

A

C

D

B

A

C

D

H

G

Meek’s Conjecture and BES

S
*

S

Assume: Local Max S
Not S*

Any DAG
H

from
S

Any DAG
G

from
S*

G

H

Add

Add

Rev

Rev

Rev

Meek’s Conjecture and BES

S
*

S

Assume: Local Max S
Not S*

Any DAG
H

from
S

Any DAG
G

from
S*

G

H

Add

Add

Rev

Rev

Rev

G

H

Del

Del

Rev

Rev

Rev

Meek’s Conjecture and BES

S
*

S

Assume: Local Max S
Not S*

Any DAG
H

from
S

Any DAG
G

from
S*

G

H

Add

Add

Rev

Rev

Rev

S*

S

Neighbor

of S
in BES

G

H

Del

Del

Rev

Rev

Rev

Discussion Points


In practice, GES is as fast as DAG
-
based
search


Neighborhood of essential graphs can be
generated and scored very efficiently



When DAG
-
perfect assumption fails, we still
get optimality guarantees


As long as composition holds in generative
distribution, local maximum is inclusion
-
minimal

Thanks!

My Home Page:


http://research.microsoft.com/~dmax


Relevant Papers:


“Optimal Structure Identification with Greedy Search”

JMLR Submission

Contains detailed proofs of Meek’s conjecture and optimality of GES


“Finding Optimal Bayesian Networks”

UAI02 Paper with Chris Meek

Contains extension of optimality results of GES when not DAG perfect


Bayesian Criterion is Locally
Consistent


Bayesian score approaches BIC + constant



BIC is
decomposible:



Difference in score same for any DAGS that differ by
Y

X

edge if
X

has same parents

))
(
,
(
)
,
(
1
i
n
i
i
X
Par
X
F
S
BIC



D
X

Y

X

Y

Complete network (always includes p*)

Bayesian Criterion is Consistent

Assume Conditionals:

(1)
unconstrained multinomials

(2)
linear regressions

Network structures = curved exponential models

Bayesian Criterion is consistent

Geiger, Heckerman, King and Meek (2001)

Haughton (1988)

Bayesian Criterion is

Score Equivalent


S

T



F(
S
,
D
) = F(
T,
D
)

X

Y

X

Y

S

T

S
h

= T
h

S
h

: no independence constraints

T
h

: no independence constraints

Active Paths

Z
-
active Path between
X

and
Y
:

(non
-
standard)

1.
Neither
X

nor
Y

is in
Z

2.
Every pair of
colliding

edges meets at a member of
Z

3.
No other pair of edges meets at a member of
Z

X

Z

Y

G

H




䥦I
Z
-
active path between
X

and
Y

in
G



then
Z
-
active path between
X

and
Y

in
H


Active Paths

X

A

Z

W

B

Y

A

B

C



X
-
Y
:
Out
-
of

X
and
In
-
to

Y





X
-
W
Out
-
of

both
X

and
W



Any sub
-
path between
A,B

Z

is also active





A



B, B



C,
at least one is
out
-
of

B



Active path between
A

and
C

Simple Active Paths

Y

X

B

A

A

B

contains
Y

X

Then


active path


X

Y

B

A

Y

X

(1) Edge appears exactly once

(2) Edge appears exactly twice

OR

Simplify discussion:

Assume (1) only


proofs for (2) almost identical

Typical Argument:

Combining Active Paths

X

Z

Y

B

A

X

Z

Y

B

A

X

A

Y

B

X

Z

Y

G

H

G’

: Suppose AP in G’ (
X

not in CS) with no
corresp. AP in
H
. Then
Z

not in CS.

Z

sink node

adj
X,Y

G

H

Proof Sketch

Two DAGs
G, H

with
G
<
H


Identify either:

(1)
a covered edge
X

Y

in
G

that has
opposite orientation in
H


(2)
a new edge
X

Y

to be added to
G

such
that it remains included in
H

The Transformation

X

Y

Y

X

Y

X

Y

W

Y

X

Y

X

Y

X

Y

W

Y

Choose any node
Y

that is a sink in
H

Case 1a
:
Y

is a sink in
G



X



Par
H
(Y)



X



Par
G
(Y)


Case 1b
:
Y

is a sink in
G



same parents


Case 2a
:

X

s.t.
Y

X




covered




Case 2b
:

X

s.t.
Y

X

&
W



par of
Y

but not
X




Case 2c
: Every
Y

X
,



Par

(
Y
)


Par
(X
)


Preliminaries


The adjacencies in
G

are a subset of the
adjacencies in
H



If
X

Y

Z

is a v
-
structure in
G

but not
H
,
then
X

and
Z

are adjacent in
H



Any new active path that results from
adding
X

Y
to
G

includes
X

Y

(
G


H
)

Proof Sketch: Case 1

Z

Y

is a sink in
G

X

Y

Case 1a
:

X



Par
H
(Y)

X



Par
G
(Y)

Case 1b
: Parents identical



Remove
Y

from both graphs: proof similar

X

Y

H
:

X

Y

G
:

Suppose there’s some new active path between
A

and
B

not in
H

Y

X

B

A

1.
Y is a sink in
G
, so it must be in CS

2.
Neither
X

nor next node
Z

is in CS

3.
In
H
, AP(A,Z), AP(X,B), Z

Y

X


Proof Sketch: Case 2

Case 2a
: There is a covered edge
Y

X
:

Reverse the edge

Case 2b
: There is a
non
-
covered

edge
Y

X
such that
W

is a



parent of
Y

but not a parent of
X

X

W

Y

X

W

Y

G
:

G’
:

X

W

Y

H
:

Y

must be in CS, else replace
W

X

by
W



Y



X
(not new).

If
X

not in CS, then in
H

active:
A
-
W,

X
-
B
,
W

Y

X

X

W

Y

Z

X

W

Y

H
:

G’
:

Z

A

B

B

A

Y

is
not
a sink in
G

Suppose there’s some new active path between
A

and
B

not in
H

Case 2c: The Difficult Case

All
non
-
covered

edges
Y

Z
have
Par
(
Y
)


Par
(
Z
)


W
1

Z
1

Y

Z
2

W
2

W
1

Z
1

Y

Z
2

W
2

G

H

W
1

Y
:
G

no longer <
H

(
Z
2
-
active path between
W
1

and
W
2
)

W
2

Y
:
G

<
H

Choosing Z

D

Y

Z

Descendants of Y in G

D

Y

Descendants of Y in G

G

H

D

is the maximal G
-
descendant in
H

Z
is any maximal child of
Y

such that
D

is a descendant of
Z

in
G


Choosing Z

W
1

Z
1

Y

Z
2

W
2

W
1

Z
1

Y

Z
2

W
2

G

H

Descendants of Y in G:


Y
,
Z
1
,
Z
2


Maximal descendant in
H
:


D=Z
2


Maximal child of
Y

in
G

that has D=
Z
2

as descendant


Z
2

Add W
2

Y

Difficult Case: Proof Intuition

D

Y

Z

G

H

W

Y

Z

W

1. W not in CS

2.
Y

not in CS, else active in
H

3
.
In
G
, next edges must be away from
Y

until
B

or CS reached

4. In
G
, neither
Z

nor desc in CS, else active before addition

5. From (1,2,4), AP (A,D) and (B,D) in
H

6. Choice of D: directed path from
D

to
B

or
CS

in
H

A

A

D

B or CS

B or CS

B

B

Optimality of GES

Definition

p

is
DAG
-
perfect
wrt

G:

Independence constraints in
p

are precisely those in G


Assumption


Generative distribution
p
* is perfect wrt some
G
* defined

over the observable variables


S* = Equivalence class containing G*


Under DAG
-
perfect assumption, GES results in S*

Important Definitions


Bayesian Networks


Markov Conditions


Distribution/Structure Inclusion


Structure/Structure Inclusion