Introduction to Probabilistic

lettuceescargatoireΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

79 εμφανίσεις

Introduction to Probabilistic
Graphical Models

Eran Segal


Weizmann Institute

Logistics


Staff:


Instructor


Eran Segal (
eran.segal@weizmann.ac.il
, room 149)


Teaching Assistants


Ohad Manor (
ohad.manor@weizmann.ac.il
, room 125)


Course information:


WWW:
http://www.weizmann.ac.il/math/pgm


Course book:


“Bayesian Networks and Beyond”,

Daphne Koller (Stanford) & Nir Friedman (Hebrew U.)

Course structure


One weekly meeting


Sun: 9am
-
11am


Homework assignments


2 weeks to complete each


40% of final grade


Final exam


3 hour class exam, date will be announced


60% of final grade

Probabilistic Graphical Models



Tool for representing complex systems and
performing sophisticated reasoning tasks



Fundamental notion:

Modularity


Complex systems are built by combining simpler parts



Why have a model?


Compact and modular representation of complex systems


Ability to execute complex reasoning patterns


Make predictions


Generalize from particular problem

Probabilistic Graphical Models



Increasingly important in Machine Learning



Many classical probabilistic problems in statistics,
information theory, pattern recognition, and statistical
mechanics are special cases of the formalism


Graphical models provides a common framework


Advantage: specialized techniques developed in one field
can be transferred between research communities

Representation: Graphs



Intuitive data structure for modeling highly
-
interacting
sets of variables



Explicit model for modularity



Data structure that allows for design of efficient
general
-
purpose algorithms

Reasoning: Probability Theory



Well understood framework for
modeling uncertainty


Partial knowledge of the state of the world


Noisy observations


Phenomenon not covered by our model


Inherent stochasticity



Clear semantics



Can be learned from data

Probabilistic Reasoning


In this course we will learn:


Semantics of probabilistic graphical models (PGMs)


Bayesian networks


Markov networks


Answering queries in a PGMs (“inference”)


Learning PGMs from data (“learning”)


Modeling temporal processes with PGMs


Hidden Markov Models (HMMs) as a special case

Course Outline

Week

Topic

Reading

1

Introduction, Bayesian network representation

1
-
3

2

Bayesian network representation cont.

1
-
3

3

Local probability models

5

4

Undirected graphical models

4

5

Exact inference

9,10

6

Exact inference cont.

9,10

7

Approximate inference

12

8

Approximate inference cont.

12

9

Learning: Parameters

16,17

10

Learning: Parameters cont.

16,17

11

Learning: Structure

18

12

Partially observed data

19

13

Learning undirected graphical models

20

14

Template models

6

15

Dynamic Bayesian networks

15

A Simple Example


We want to model whether our neighbor will inform
us of the alarm being set off


The alarm can set off if


There is a burglary


There is an earthquake


Whether our neighbor calls depends on whether the
alarm is set off

A Simple Example


Variables


Earthquake (E), Burglary (B), Alarm (A), NeighborCalls (N)

E

B

A

N

Prob.

F

F

F

F

0.01

F

F

F

T

0.04

F

F

T

F

0.05

F

F

T

T

0.01

F

T

F

F

0.02

F

T

F

T

0.07

F

T

T

F

0.2

F

T

T

T

0.1

T

F

F

F

0.01

T

F

F

T

0.07

T

F

T

F

0.13

T

F

T

T

0.04

T

T

F

F

0.06

T

T

F

T

0.05

T

T

T

F

0.1

T

T

T

T

0.05

2
4
-
1 independent
parameters

A Simple Example

Alarm

NeighborCalls

Burglary

Earthquake

A

E

B

F

T

F

F

0.99

0.01

F

T

0.1

0.9

T

F

0.3

0.7

T

T

0.01

0.99

N

A

F

T

F

0.9

0.1

T

0.2

0.8

E

F

T

0.9

0.1

7 independent
parameters

B

F

T

0.7

0.3

Example Bayesian Network


The “Alarm” network for monitoring intensive care
patients


509 parameters (full joint 2
37
)


37 variables



PCWP

CO

HRBP

HREKG

HRSAT

ERRCAUTER

HR

HISTORY

CATECHOL

SAO2

EXPCO2

ARTCO2

VENTALV

VENTLUNG

VENITUBE

DISCONNECT

MINVOLSET

VENTMACH

KINKEDTUBE

INTUBATION

PULMEMBOLUS

PAP

SHUNT

ANAPHYLAXIS

MINOVL

PVSAT

FIO2

PRESS

INSUFFANESTH

TPR

LVFAILURE

ERRBLOWOUTPUT

STROEVOLUME

LVEDVOLUME

HYPOVOLEMIA

CVP

BP

Application: Clustering Users


Input: TV shows that each user watches


Output: TV show “clusters”


Assumption: shows watched by same users are similar



Class 1



Power rangers



Animaniacs



X
-
men



Tazmania



Spider man

Class 4



60 minutes



NBC nightly news



CBS eve news



Murder she wrote



Matlock

Class 2



Young and restless



Bold and the beautiful



As the world turns



Price is right



CBS eve news

Class 3



Tonight show



Conan O’Brien



NBC nightly news



Later with Kinnear



Seinfeld

Class 5



Seinfeld



Friends



Mad about you



ER



Frasier

App.: Recommendation Systems


Given user preferences, suggest recommendations


Example:
Amazon.com



Input: movie preferences of many users


Solution: model correlations between movie features


Users that like comedy, often like drama


Users that like action, often do not like cartoons


Users that like Robert Deniro films often like Al Pacino films


Given user preferences, can predict probability that new
movies match preferences

Diagnostic Systems


Diagnostic indexing for home health site at microsoft


Enter symptoms


recommend multimedia content

Online TroubleShooters

Expression level in each
module is a function of
expression of regulators

App.: Finding Regulatory Networks

Experiment

Gene

Expression

Module

Regulator
1

Regulator
2

Regulator
3

Level

What module does
gene “g” belong to?

Expression level of
Regulator
1

in experiment

BMH1


䝉䌲C


0

0

0

2

1

Module

P(Level | Module, Regulators)

HAP4


䍍䬱K


0

0

0

App.: Finding Regulatory Networks

Ypl230w

Hap4

Xbp1

Yer184c

Yap6

Gat1

Ime4

Lsg1

Msn4

Gac1

Gis1

Not3

Sip2

Amino acid

metabolism

Energy and

cAMP signaling

DNA and RNA

processing

nuclear

STRE

N41

HAP234

REPCAR

CAT8

N26

ADR1

HSF

HAC1

XBP1

MCM1

N30

ABF_C

N36

Kin82

Cmk1

Tpk1

Ppt1

N11

GATA

GCN4

CBF1_B

Tpk2

Pph3

N14

N13

Bmh1

Gcn20

GCR1

MIG1

N18

1

2

3

25

33

41

4

26

39

47

30

42

31

36

5

16

8

10

9

13

14

15

17

18

11

Regulation supported in literature

Regulator (Signaling molecule)

Regulator (transcription factor)

Inferred regulation

48

Module (number)

Experimentally tested regulator

Enriched
cis
-
Regulatory Motif







Prerequisites


Probability theory


Conditional probabilities


Joint distribution


Random variables


Information theory


Function optimization


Graph theory


Computational complexity


Probability Theory


Probability distribution
P

over (

, S) is a mapping from events
in S such that:


P(

)


0 for all

S


P(

) = 1


If

,

S and

=

, then P(

)=P(

)+P(

)



Conditional Probability:



Chain Rule:



Bayes Rule:



Conditional Independence:

)
(
)
(
)
|
(





P
P
P


)
(
)
(
)
|
(
)
|
(






P
P
P
P

)
(
)
|
(
)
(





P
P
P


)
|
(
)
|
(
)
|
(












P
P
Random Variables & Notation


Random variable:

Function from


to a value


Categorical / Ordinal / Continuous


Val(X)



set of possible values of RV X


Upper case letters denote RVs (e.g.,
X, Y, Z
)


Upper case bold letters denote set of RVs (e.g.,
X
,
Y
)


Lower case letters denote RV values (e.g.,
x, y, z
)


Lower case bold letters denote RV set values (e.g.,
x
)


Values for categorical RVs with
|Val(X)|=k: x
1
,x
2
,…,x
k


Marginal distribution

over X: P(X)


Conditional independence:

X
is independent of
Y

given
Z

if:


P
in

)
|
(
:
,
,
z
Z
y
Y
x
X
Z
z
Y
y
X
x








Expectation


Discrete RVs:



Continuous RVs:



Linearity of expectation:






Expectation of products:

(when X


Y in P)




X
x
P
x
xP
X
E
)
(
]
[



X
x
P
x
xp
X
E
)
(
]
[
]
[
]
[
)
(
)
(
)
,
(
)
,
(
)
,
(
)
,
(
)
,
(
)
(
]
[
,
,
,
Y
E
X
E
y
yP
x
xP
y
x
P
y
y
x
P
x
y
x
yP
y
x
xP
y
x
P
y
x
Y
X
E
x
y
x
y
x
y
y
x
y
x
y
x
P




















]
[
]
[
)
(
)
(
)
(
)
(
)
,
(
]
[
,
,
Y
E
X
E
y
yP
x
xP
y
P
x
xyP
y
x
xyP
XY
E
y
x
y
x
y
x
P






















Independence
assumption

Variance


Variance of RV:








If X and Y are independent: Var[X+Y]=Var[X]+Var[Y]



Var[aX+b]=a
2
Var[X]









]
[
]
[
]
[
]
[
]
[
2
]
[
]]
[
[
]]
[
2
[
]
[
]
[
]
[
2
]
[
]
[
2
2
2
2
2
2
2
2
2
X
E
X
E
X
E
X
E
X
E
X
E
X
E
E
X
XE
E
X
E
X
E
X
XE
X
E
X
E
X
E
X
Var
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P













Information Theory


Entropy:


We use log base 2 to interpret entropy as bits of information


Entropy of X is a lower bound on avg. # of bits to encode values of X


0


H
p
(X)


log|V
al(X)| for any distribution P(X)


Conditional entropy:



Information only helps:


Mutual information:




0


I
p
(X;Y)


H
p
(X)


Symmetry: I
p
(X;Y)= I
p
(Y;X)


I
p
(X;Y)=0 iff X and Y are independent


Chain rule of entropies:





X
x
P
x
P
x
P
X
H
)
(
log
)
(
)
(







Y
y
X
x
P
P
P
y
x
P
y
x
P
Y
H
Y
X
H
Y
X
H
,
)
|
(
log
)
,
(
)
(
)
,
(
)
|
(
)
,...,
|
(
...
)
(
)
,...,
(
1
1
1
1




n
n
P
P
n
P
X
X
X
H
X
H
X
X
H
)
(
)
|
(
X
H
Y
X
H
P
P







Y
y
X
x
P
P
P
x
P
y
x
P
y
x
P
Y
X
H
X
H
Y
X
I
,
)
(
)
|
(
log
)
,
(
)
|
(
)
(
)
;
(
Distances Between Distributions


Relative Entropy:


D(P

Q)

0


D(P

Q)=
0
iff P=Q


Not a distance metric (no symmetry and triangle inequality)


L
1

distance:


L
2

distance:


L


distance:



2
1
2
2
))
(
)
(
(





X
x
x
Q
x
P
Q
P



X
x
x
Q
x
P
x
P
X
Q
X
P
D
)
(
)
(
log
)
(
)
)
(
)
(
(





X
x
x
Q
x
P
Q
P
|
)
(
)
(
|
1
|
)
(
)
(
|
max
X
x
x
Q
x
P
Q
P





Optimization Theory


Find values

1
,…

n

such that


Optimization strategies


Solve gradient analytically and verify local maximum


Gradient search: guess initial values, and improve iteratively


Gradient ascent


Line search


Conjugate gradient


Lagrange multipliers


Solve maximization problem with constraints


Maximize

)
'
,...,
'
(
max
)
,...,
(
1
'
,...,
'
1
1
n
n
f
f
n











0
,
,...,
1
:








j
c
n
j
C











j
n
j
j
c
f
L




1
)
(
)
,
(
Graph Theory


Undirected graph


Directed graph


Complete graph (every two nodes connected)


Acyclic graph


Partially directed acyclic graph (PDAG)


Induced graph


Sub
-
graph


Graph algorithms


Shortest path from node X
1

to all other nodes (BFS)


Representing Joint Distributions


Random variables: X
1
,…,X
n


P is a joint distribution over X
1
,…,X
n


If X
1
,..,X
n

binary, need
2
n

parameters to describe P

Can we represent P more compactly?


Key: Exploit independence properties

Independent Random Variables


Two variables X and Y are independent if


P(X=x|Y=y) = P(X=x) for all values x,y


Equivalently, knowing Y does not change predictions of X



If X and Y are independent then:


P(X, Y) = P(X|Y)P(Y) = P(X)P(Y)



If X
1
,…,X
n

are independent then:


P(X
1
,…,X
n
) = P(X
1
)…P(X
n
)


O(n) parameters


All 2
n

probabilities are implicitly defined


Cannot represent many types of distributions

Conditional Independence


X and Y are conditionally independent given Z if


P(X=x|Y=y, Z=z) = P(X=x|Z=z) for all values x, y, z


Equivalently, if we know Z, then knowing Y does not change
predictions of X


Notation: Ind(X;Y | Z) or (X


Y | Z)


Conditional Parameterization


S = Score on test, Val(S) = {s
0
,s
1
}


I = Intelligence, Val(I) = {i
0
,i
1
}

I

S

P(I,S)

i
0

s
0

0.665

i
0

s
1

0.035

i
1

s
0

0.06

i
1

s
1

0.24

S

I

s
0

s
1

i
0

0.95

0.05

i
1

0.2

0.8

I

i
0

i
1

0.7

0.3

P(S|I)

P(I)

P(I,S)

Joint parameterization

Conditional parameterization

3
parameters

3 parameters

Alternative parameterization: P(S) and P(I|S)

Conditional Parameterization


S = Score on test, Val(S) = {s
0
,s
1
}


I = Intelligence, Val(I) = {i
0
,i
1
}


G = Grade, Val(G) = {g
0
,g
1
,g
2
}


Assume that G and S are independent given I


Joint parameterization


2

2

3=12
-
1=
11 independent parameters


Conditional parameterization has


P(I,S,G) = P(I)P(S|I)P(G|I,S) = P(I)P(S|I)P(G|I)


P(I)


1 independent parameter


P(S|I)


2

1 independent parameters


P(G|I)
-

2

2 independent parameters


7 independent parameters

Naïve Bayes Model


Class variable C, Val(C) = {c
1
,…,c
k
}


Evidence variables X
1
,…,X
n


Naïve Bayes assumption: evidence variables are conditionally
independent given C






Applications in medical diagnosis, text classification


Used as a classifier:




Problem: Double counting correlated evidence




n
i
i
n
C
X
P
C
P
X
X
C
P
1
1
)
|
(
)
(
)
,...,
,
(









n
i
i
i
n
n
c
C
x
P
c
C
x
P
c
C
P
c
C
P
x
x
c
C
P
x
x
c
C
P
1
2
1
2
1
1
2
1
1
)
|
(
)
|
(
)
(
)
(
)
,...,
|
(
)
,...,
|
(
Bayesian Network (Informal)


Directed acyclic graph G


Nodes represent random variables


Edges represent direct influences between random variables


Local probability models


I

S

I

S

G

C

X
1

X
n

X
2



Naïve Bayes

Example 2

Example
1

Bayesian Network (Informal)


Represent a joint distribution


Specifies the probability for P(
X
=
x
)


Specifies the probability for P(
X
=
x
|
E
=
e
)


Allows for reasoning patterns


Prediction

(e.g., intelligent


high scores)


Explanation

(e.g., low score


not intelligent)


Explaining away
(different causes for same effect interact)

I

S

G

Example 2

Bayesian Network Structure


Directed acyclic graph G


Nodes X
1
,…,X
n

represent random variables


G encodes local Markov assumptions


X
i

is independent of its non
-
descendants given its parents


Formally: (X
i



NonDesc(X
i
)

| Pa(X
i
))

A

B

C

E

G

D

F

E


{A,C,D,F} | B


Independency Mappings (I
-
Maps)


Let P be a distribution over
X


Let I(P) be the independencies (
X



Y

|
Z
) in P


A Bayesian network structure is an I
-
map
(independency mapping) of P if I(G)

I(P)

I

S

I

S

P(I,S)

i
0

s
0

0.25

i
0

s
1

0.25

i
1

s
0

0.25

i
1

s
1

0.25

I

S

I

S

P(I,S)

i
0

s
0

0.4

i
0

s
1

0.3

i
1

s
0

0.2

i
1

s
1

0.1

I(P)={I

S}


I(G)={I

S}


I(G)=



I(P)=




Factorization Theorem


If G is an I
-
Map of P, then


Proof:


wlog.
X
1
,…,X
n
is an ordering consistent with
G


By chain rule:


From assumption:



Since G is an I
-
Map


(
X
i
; NonDesc(X
i
)| Pa(X
i
))

I(P)





n
i
i
i
n
X
Pa
X
P
X
X
P
1
1
))
(
|
(
)
,...,
(
)
(
)
(
}
,
{
}
,
{
)
(
1
,
1
1
,
1
i
i
i
i
i
X
NonDesc
X
Pa
X
X
X
X
X
Pa











n
i
i
i
n
X
X
X
P
X
X
P
1
1
1
1
)
,...,
|
(
)
,...,
(
))
(
|
(
)
,...,
|
(
1
1
i
i
i
i
X
Pa
X
P
X
X
X
P


Factorization Implies I
-
Map





G is an I
-
Map of P


Proof:


Need to show
(
X
i
; NonDesc(X
i
)| Pa(X
i
))

I(P) or that
P
(
X
i
| NonDesc(X
i
)) = P(X
i

| Pa(X
i
))


wlog.
X
1
,…,X
n
is an ordering consistent with
G







n
i
i
i
n
X
Pa
X
P
X
X
P
1
1
))
(
|
(
)
,...,
(
))
(
|
(
))
(
|
(
))
(
|
(
))
(
(
))
(
,
(
))
(
|
(
1
1
1
i
i
i
k
k
k
i
k
k
k
i
i
i
i
i
X
Pa
X
P
X
Pa
X
P
X
Pa
X
P
X
NonDesc
P
X
NonDesc
X
P
X
NonDesc
X
P








Bayesian Network Definition


A Bayesian network is a pair (G,P)


P factorizes over G


P is specified as set of CPDs associated with G’s nodes




Parameters


Joint distribution: 2
n


Bayesian network (bounded in
-
degree k): n2
k

Bayesian Network Design


Variable considerations


Clarity test: can an omniscient being determine its value?


Hidden variables?


Irrelevant variables


Structure considerations


Causal order of variables


Which independencies (approximately) hold?


Probability considerations


Zero probabilities


Orders of magnitude


Relative values