Finding Optimal Bayesian

Τεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 4 χρόνια και 7 μήνες)

101 εμφανίσεις

Finding Optimal Bayesian
Networks with Greedy Search

Max Chickering

Outline

Bayesian
-
Network Definitions

Learning

Greedy Equivalence Search (GES)

Optimality of GES

Bayesian Networks

)
),
(
|
(
)
,
|
,...,
(
1
1
i
i
n
i
i
n
X
Par
X
p
S
X
X
p

Use
B

= (
S
,

) to represent
p(X
1
, …, X
n
)

Markov Conditions

X

Desc

Desc

Par

Par

Par

ND

ND

From factorization:
I(X, ND

|
Par(X))

Markov Conditions + Graphoid Axioms

characterize all independencies

Structure/Distribution Inclusion

p

is
included

in
S

if there exists

s.t. B(
S
,

) defines
p

X

Y

Z

p

All distributions

S

Structure/Structure Inclusion

T

S

T

is
included

in
S

if every
p

included in
T

is included in
S

X

Y

Z

All distributions

X

Y

Z

S

T

(
S

is an I
-
map of
T
)

Structure/Structure Equivalence

T

S

X

Y

Z

All distributions

X

Y

Z

S

T

Reflexive, Symmetric, Transitive

Equivalence

V
-
structure

D

A

B

C

Theorem (Verma and Pearl, 1990)

S

T

same v
-
structures and skeletons

D

A

B

C

Skeleton

Learning Bayesian Networks

1.
Learn the structure

2.
Estimate the conditional distributions

X

Y

Z

0

1

1

1

0

1

0

1

0

.

.

.

1

0

1

X

Y

Z

p
*

iid

samples

Generative

Distribution

Observed Data

Learned

Model

Learning Structure

Scoring criterion

F(
D
,
S
)

Search procedure

Identify one or more structures with high values

for the scoring function

Properties of Scoring Criteria

Consistent

Locally Consistent

Score Equivalent

Consistent Criterion

S

includes p*,
T

does not include
p
*

F(
S
,
D
) > F(
T,
D
)

Both include p*, S has fewer parameters

F(
S
,
D
) > F(
T,
D
)

Criterion favors (in the limit) simplest model that
includes the generative distribution p*

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

p
*

Locally Consistent Criterion

X

Y

X

Y

S

T

If I(
X,Y
|Par(
X
)) in
p
*

then

F(
S
,
D
) > F(
T,
D
)

Otherwise

F(
S
,
D
) < F(
T,
D
)

S

and
T

differ by one edge:

Score
-
Equivalent Criterion

S

T

F(
S
,
D
) = F(
T,
D
)

X

Y

X

Y

S

T

Bayesian Criterion

(Consistent, locally consistent and score equivalent)

S
h

: generative distribution
p
* has same

independence constraints as
S
.

F
Bayes
(S,
D
) =
log

p(S
h

|
D
)

= k +
log

p(
D
|S
h
) +
log

p(S
h
)

Marginal Likelihood

(closed form w/ assumptions)

Structure Prior

(e.g. prefer simple)

Search Procedure

Set of states

Representation for the states

Operators to move between states

Systematic Search Algorithm

Greedy Equivalence Search

Set of states

Equivalence classes of DAGs

Representation for the states

Essential graphs

Operators to move between states

Forward and Backward Operators

Systematic Search Algorithm

Two
-
phase Greedy

Representation: Essential Graphs

E

A

B

C

F

Compelled Edges

Reversible Edges

D

E

A

B

C

F

D

GES Operators

Forward Direction

Backward Direction

single edge deletions

Two
-
Phase Greedy Algorithm

Phase 1: Forward Equivalence Search (FES)

-
independence model

Run Greedy using forward operators

Phase 2: Backward Equivalence Search (BES)

Run Greedy using backward operators

Forward Operators

Consider all DAGs in the current state

For each DAG, consider all single
-
edge

Take the union of the resulting
equivalence classes

Forward
-
Operators Example

B

C

A

Current State:

All DAGs:

B

C

A

B

C

A

All DAGs resulting from single
-

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

Union of corresponding essential graphs:

Forward
-
Operators Example

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

Backward Operators

Consider all DAGs in the current state

For each DAG, consider all single
-
edge
deletions

Take the union of the resulting
equivalence classes

Backward
-
Operators Example

Current State:

All DAGs:

B

C

A

B

C

A

All DAGs resulting from single
-
edge deletion:

B

C

A

Union of corresponding essential graphs:

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

Backward
-
Operators Example

B

C

A

B

C

A

B

C

A

DAG Perfect

DAG
-
perfect distribution
p

Exists DAG
G
:

I(
X,Y
|
Z
) in
p

I(
X,Y
|
Z
) in
G

Non
-
DAG
-
perfect distribution
q

B

A

D

C

I(
A,D
|
B,C
)

I(
B,C
|
A,D
)

B

A

D

C

B

A

D

C

I(
B,C
|
A,D
)

I(
A,D
|
B,C
)

DAG
-
Perfect Consequence:

Composition Axiom Holds in
p
*

If

I(X,
Y

|
Z
)

then

I(X,Y |
Z
)

for some singleton
Y

Y

A

B

C

X

D

C

X

Optimality of GES

X Y Z

0 1 1

1 0 1

0 1 0

.

.

.

1 0 1

X

Y

Z

S
*

X

Y

Z

iid

samples

If
p
* is DAG
-
perfect wrt some
G
*

G
*

X

Y

Z

n

GES

S

p
*

For large
n
, S = S*

Optimality of GES

Proof Outline

After first phase (FES), current state includes S*

After second phase (BES), the current state =
S
*

FES

BES

All
-
independence

State includes S*

State equals S*

FES Maximum Includes S*

Assume: Local Max does
NOT
include
S*

Any DAG
G

from
S

Markov Conditions characterize independencies:

In
p
*, exists
X

not indep. non
-
desc given parents

E

A

B

C

X

D

I(X,{A,B,C,D} | E)

in
p
*

p
* is DAG
-
perfect

composition

axiom holds

E

A

B

C

X

D

I(X,C | E)

in
p
*

X edge improves score, and EQ class is

a neighbor

BES Identifies S*

Current state always includes
S
*:

Local consistency of the criterion

Local Minimum is
S
*:

Meek’s conjecture

Meek’s Conjecture

Any pair of DAGs
G,H

such that
H

includes
G

(
G

H
)

There exists a sequence of

(1)
c
overed

edge reversals in
G

(2) single
-
G

after each change
G

H

after all changes
G
=
H

Meek’s Conjecture

B

A

C

D

B

A

C

D

I(A,B)

I(C,B|A,D)

B

A

C

D

B

A

C

D

B

A

C

D

H

G

Meek’s Conjecture and BES

S
*

S

Assume: Local Max S
Not S*

Any DAG
H

from
S

Any DAG
G

from
S*

G

H

Rev

Rev

Rev

Meek’s Conjecture and BES

S
*

S

Assume: Local Max S
Not S*

Any DAG
H

from
S

Any DAG
G

from
S*

G

H

Rev

Rev

Rev

G

H

Del

Del

Rev

Rev

Rev

Meek’s Conjecture and BES

S
*

S

Assume: Local Max S
Not S*

Any DAG
H

from
S

Any DAG
G

from
S*

G

H

Rev

Rev

Rev

S*

S

Neighbor

of S
in BES

G

H

Del

Del

Rev

Rev

Rev

Discussion Points

In practice, GES is as fast as DAG
-
based
search

Neighborhood of essential graphs can be
generated and scored very efficiently

When DAG
-
perfect assumption fails, we still
get optimality guarantees

As long as composition holds in generative
distribution, local maximum is inclusion
-
minimal

Thanks!

http://research.microsoft.com/~dmax

Relevant Papers:

“Optimal Structure Identification with Greedy Search”

JMLR Submission

Contains detailed proofs of Meek’s conjecture and optimality of GES

“Finding Optimal Bayesian Networks”

UAI02 Paper with Chris Meek

Contains extension of optimality results of GES when not DAG perfect

Bayesian Criterion is Locally
Consistent

Bayesian score approaches BIC + constant

BIC is
decomposible:

Difference in score same for any DAGS that differ by
Y

X

edge if
X

has same parents

))
(
,
(
)
,
(
1
i
n
i
i
X
Par
X
F
S
BIC

D
X

Y

X

Y

Complete network (always includes p*)

Bayesian Criterion is Consistent

Assume Conditionals:

(1)
unconstrained multinomials

(2)
linear regressions

Network structures = curved exponential models

Bayesian Criterion is consistent

Geiger, Heckerman, King and Meek (2001)

Haughton (1988)

Bayesian Criterion is

Score Equivalent

S

T

F(
S
,
D
) = F(
T,
D
)

X

Y

X

Y

S

T

S
h

= T
h

S
h

: no independence constraints

T
h

: no independence constraints

Active Paths

Z
-
active Path between
X

and
Y
:

(non
-
standard)

1.
Neither
X

nor
Y

is in
Z

2.
Every pair of
colliding

edges meets at a member of
Z

3.
No other pair of edges meets at a member of
Z

X

Z

Y

G

H

䥦I
Z
-
active path between
X

and
Y

in
G

then
Z
-
active path between
X

and
Y

in
H

Active Paths

X

A

Z

W

B

Y

A

B

C

X
-
Y
:
Out
-
of

X
and
In
-
to

Y

X
-
W
Out
-
of

both
X

and
W

Any sub
-
path between
A,B

Z

is also active

A

B, B

C,
at least one is
out
-
of

B

Active path between
A

and
C

Simple Active Paths

Y

X

B

A

A

B

contains
Y

X

Then

active path

X

Y

B

A

Y

X

(1) Edge appears exactly once

(2) Edge appears exactly twice

OR

Simplify discussion:

Assume (1) only

proofs for (2) almost identical

Typical Argument:

Combining Active Paths

X

Z

Y

B

A

X

Z

Y

B

A

X

A

Y

B

X

Z

Y

G

H

G’

: Suppose AP in G’ (
X

not in CS) with no
corresp. AP in
H
. Then
Z

not in CS.

Z

sink node

X,Y

G

H

Proof Sketch

Two DAGs
G, H

with
G
<
H

Identify either:

(1)
a covered edge
X

Y

in
G

that has
opposite orientation in
H

(2)
a new edge
X

Y

G

such
that it remains included in
H

The Transformation

X

Y

Y

X

Y

X

Y

W

Y

X

Y

X

Y

X

Y

W

Y

Choose any node
Y

that is a sink in
H

Case 1a
:
Y

is a sink in
G

X

Par
H
(Y)

X

Par
G
(Y)

Case 1b
:
Y

is a sink in
G

same parents

Case 2a
:

X

s.t.
Y

X

covered

Case 2b
:

X

s.t.
Y

X

&
W

par of
Y

but not
X

Case 2c
: Every
Y

X
,

Par

(
Y
)

Par
(X
)

Preliminaries

G

are a subset of the
H

If
X

Y

Z

is a v
-
structure in
G

but not
H
,
then
X

and
Z

H

Any new active path that results from
X

Y
to
G

includes
X

Y

(
G

H
)

Proof Sketch: Case 1

Z

Y

is a sink in
G

X

Y

Case 1a
:

X

Par
H
(Y)

X

Par
G
(Y)

Case 1b
: Parents identical

Remove
Y

from both graphs: proof similar

X

Y

H
:

X

Y

G
:

Suppose there’s some new active path between
A

and
B

not in
H

Y

X

B

A

1.
Y is a sink in
G
, so it must be in CS

2.
Neither
X

nor next node
Z

is in CS

3.
In
H
, AP(A,Z), AP(X,B), Z

Y

X

Proof Sketch: Case 2

Case 2a
: There is a covered edge
Y

X
:

Reverse the edge

Case 2b
: There is a
non
-
covered

edge
Y

X
such that
W

is a

parent of
Y

but not a parent of
X

X

W

Y

X

W

Y

G
:

G’
:

X

W

Y

H
:

Y

must be in CS, else replace
W

X

by
W

Y

X
(not new).

If
X

not in CS, then in
H

active:
A
-
W,

X
-
B
,
W

Y

X

X

W

Y

Z

X

W

Y

H
:

G’
:

Z

A

B

B

A

Y

is
not
a sink in
G

Suppose there’s some new active path between
A

and
B

not in
H

Case 2c: The Difficult Case

All
non
-
covered

edges
Y

Z
have
Par
(
Y
)

Par
(
Z
)

W
1

Z
1

Y

Z
2

W
2

W
1

Z
1

Y

Z
2

W
2

G

H

W
1

Y
:
G

no longer <
H

(
Z
2
-
active path between
W
1

and
W
2
)

W
2

Y
:
G

<
H

Choosing Z

D

Y

Z

Descendants of Y in G

D

Y

Descendants of Y in G

G

H

D

is the maximal G
-
descendant in
H

Z
is any maximal child of
Y

such that
D

is a descendant of
Z

in
G

Choosing Z

W
1

Z
1

Y

Z
2

W
2

W
1

Z
1

Y

Z
2

W
2

G

H

Descendants of Y in G:

Y
,
Z
1
,
Z
2

Maximal descendant in
H
:

D=Z
2

Maximal child of
Y

in
G

that has D=
Z
2

as descendant

Z
2

2

Y

Difficult Case: Proof Intuition

D

Y

Z

G

H

W

Y

Z

W

1. W not in CS

2.
Y

not in CS, else active in
H

3
.
In
G
, next edges must be away from
Y

until
B

or CS reached

4. In
G
, neither
Z

nor desc in CS, else active before addition

5. From (1,2,4), AP (A,D) and (B,D) in
H

6. Choice of D: directed path from
D

to
B

or
CS

in
H

A

A

D

B or CS

B or CS

B

B

Optimality of GES

Definition

p

is
DAG
-
perfect
wrt

G:

Independence constraints in
p

are precisely those in G

Assumption

Generative distribution
p
* is perfect wrt some
G
* defined

over the observable variables

S* = Equivalence class containing G*

Under DAG
-
perfect assumption, GES results in S*

Important Definitions

Bayesian Networks

Markov Conditions

Distribution/Structure Inclusion

Structure/Structure Inclusion