# Introduction to Probabilistic

Introduction to Probabilistic
Graphical Models

Eran Segal

Weizmann Institute

Logistics

Staff:

Instructor

Eran Segal (
eran.segal@weizmann.ac.il
, room 149)

Teaching Assistants

, room 125)

Course information:

WWW:
http://www.weizmann.ac.il/math/pgm

Course book:

“Bayesian Networks and Beyond”,

Daphne Koller (Stanford) & Nir Friedman (Hebrew U.)

Course structure

One weekly meeting

Sun: 9am
-
11am

Homework assignments

2 weeks to complete each

40% of final grade

Final exam

3 hour class exam, date will be announced

60% of final grade

Probabilistic Graphical Models

Tool for representing complex systems and
performing sophisticated reasoning tasks

Fundamental notion:

Modularity

Complex systems are built by combining simpler parts

Why have a model?

Compact and modular representation of complex systems

Ability to execute complex reasoning patterns

Make predictions

Generalize from particular problem

Probabilistic Graphical Models

Increasingly important in Machine Learning

Many classical probabilistic problems in statistics,
information theory, pattern recognition, and statistical
mechanics are special cases of the formalism

Graphical models provides a common framework

Advantage: specialized techniques developed in one field
can be transferred between research communities

Representation: Graphs

Intuitive data structure for modeling highly
-
interacting
sets of variables

Explicit model for modularity

Data structure that allows for design of efficient
general
-
purpose algorithms

Reasoning: Probability Theory

Well understood framework for
modeling uncertainty

Partial knowledge of the state of the world

Noisy observations

Phenomenon not covered by our model

Inherent stochasticity

Clear semantics

Can be learned from data

Probabilistic Reasoning

In this course we will learn:

Semantics of probabilistic graphical models (PGMs)

Bayesian networks

Markov networks

Answering queries in a PGMs (“inference”)

Learning PGMs from data (“learning”)

Modeling temporal processes with PGMs

Hidden Markov Models (HMMs) as a special case

Course Outline

Week

Topic

1

Introduction, Bayesian network representation

1
-
3

2

Bayesian network representation cont.

1
-
3

3

Local probability models

5

4

Undirected graphical models

4

5

Exact inference

9,10

6

Exact inference cont.

9,10

7

Approximate inference

12

8

Approximate inference cont.

12

9

Learning: Parameters

16,17

10

Learning: Parameters cont.

16,17

11

Learning: Structure

18

12

Partially observed data

19

13

Learning undirected graphical models

20

14

Template models

6

15

Dynamic Bayesian networks

15

A Simple Example

We want to model whether our neighbor will inform
us of the alarm being set off

The alarm can set off if

There is a burglary

There is an earthquake

Whether our neighbor calls depends on whether the
alarm is set off

A Simple Example

Variables

Earthquake (E), Burglary (B), Alarm (A), NeighborCalls (N)

E

B

A

N

Prob.

F

F

F

F

0.01

F

F

F

T

0.04

F

F

T

F

0.05

F

F

T

T

0.01

F

T

F

F

0.02

F

T

F

T

0.07

F

T

T

F

0.2

F

T

T

T

0.1

T

F

F

F

0.01

T

F

F

T

0.07

T

F

T

F

0.13

T

F

T

T

0.04

T

T

F

F

0.06

T

T

F

T

0.05

T

T

T

F

0.1

T

T

T

T

0.05

2
4
-
1 independent
parameters

A Simple Example

Alarm

NeighborCalls

Burglary

Earthquake

A

E

B

F

T

F

F

0.99

0.01

F

T

0.1

0.9

T

F

0.3

0.7

T

T

0.01

0.99

N

A

F

T

F

0.9

0.1

T

0.2

0.8

E

F

T

0.9

0.1

7 independent
parameters

B

F

T

0.7

0.3

Example Bayesian Network

The “Alarm” network for monitoring intensive care
patients

509 parameters (full joint 2
37
)

37 variables

PCWP

CO

HRBP

HREKG

HRSAT

ERRCAUTER

HR

HISTORY

CATECHOL

SAO2

EXPCO2

ARTCO2

VENTALV

VENTLUNG

VENITUBE

DISCONNECT

MINVOLSET

VENTMACH

KINKEDTUBE

INTUBATION

PULMEMBOLUS

PAP

SHUNT

ANAPHYLAXIS

MINOVL

PVSAT

FIO2

PRESS

INSUFFANESTH

TPR

LVFAILURE

ERRBLOWOUTPUT

STROEVOLUME

LVEDVOLUME

HYPOVOLEMIA

CVP

BP

Application: Clustering Users

Input: TV shows that each user watches

Output: TV show “clusters”

Assumption: shows watched by same users are similar

Class 1

Power rangers

Animaniacs

X
-
men

Tazmania

Spider man

Class 4

60 minutes

NBC nightly news

CBS eve news

Murder she wrote

Matlock

Class 2

Young and restless

Bold and the beautiful

As the world turns

Price is right

CBS eve news

Class 3

Tonight show

Conan O’Brien

NBC nightly news

Later with Kinnear

Seinfeld

Class 5

Seinfeld

Friends

ER

Frasier

App.: Recommendation Systems

Given user preferences, suggest recommendations

Example:
Amazon.com

Input: movie preferences of many users

Solution: model correlations between movie features

Users that like comedy, often like drama

Users that like action, often do not like cartoons

Users that like Robert Deniro films often like Al Pacino films

Given user preferences, can predict probability that new
movies match preferences

Diagnostic Systems

Diagnostic indexing for home health site at microsoft

Enter symptoms

recommend multimedia content

Online TroubleShooters

Expression level in each
module is a function of
expression of regulators

App.: Finding Regulatory Networks

Experiment

Gene

Expression

Module

Regulator
1

Regulator
2

Regulator
3

Level

What module does
gene “g” belong to?

Expression level of
Regulator
1

in experiment

BMH1

䝉䌲C

0

0

0

2

1

Module

P(Level | Module, Regulators)

HAP4

䍍䬱K

0

0

0

App.: Finding Regulatory Networks

Ypl230w

Hap4

Xbp1

Yer184c

Yap6

Gat1

Ime4

Lsg1

Msn4

Gac1

Gis1

Not3

Sip2

Amino acid

metabolism

Energy and

cAMP signaling

DNA and RNA

processing

nuclear

STRE

N41

HAP234

REPCAR

CAT8

N26

HSF

HAC1

XBP1

MCM1

N30

ABF_C

N36

Kin82

Cmk1

Tpk1

Ppt1

N11

GATA

GCN4

CBF1_B

Tpk2

Pph3

N14

N13

Bmh1

Gcn20

GCR1

MIG1

N18

1

2

3

25

33

41

4

26

39

47

30

42

31

36

5

16

8

10

9

13

14

15

17

18

11

Regulation supported in literature

Regulator (Signaling molecule)

Regulator (transcription factor)

Inferred regulation

48

Module (number)

Experimentally tested regulator

Enriched
cis
-
Regulatory Motif

Prerequisites

Probability theory

Conditional probabilities

Joint distribution

Random variables

Information theory

Function optimization

Graph theory

Computational complexity

Probability Theory

Probability distribution
P

over (

, S) is a mapping from events
in S such that:

P(

)


0 for all

S

P(

) = 1

If

,

S and

=

, then P(

)=P(

)+P(

)

Conditional Probability:

Chain Rule:

Bayes Rule:

Conditional Independence:

)
(
)
(
)
|
(

P
P
P

)
(
)
(
)
|
(
)
|
(

P
P
P
P

)
(
)
|
(
)
(

P
P
P

)
|
(
)
|
(
)
|
(

P
P
Random Variables & Notation

Random variable:

Function from


to a value

Categorical / Ordinal / Continuous

Val(X)

set of possible values of RV X

Upper case letters denote RVs (e.g.,
X, Y, Z
)

Upper case bold letters denote set of RVs (e.g.,
X
,
Y
)

Lower case letters denote RV values (e.g.,
x, y, z
)

Lower case bold letters denote RV set values (e.g.,
x
)

Values for categorical RVs with
|Val(X)|=k: x
1
,x
2
,…,x
k

Marginal distribution

over X: P(X)

Conditional independence:

X
is independent of
Y

given
Z

if:

P
in

)
|
(
:
,
,
z
Z
y
Y
x
X
Z
z
Y
y
X
x

Expectation

Discrete RVs:

Continuous RVs:

Linearity of expectation:

Expectation of products:

(when X


Y in P)

X
x
P
x
xP
X
E
)
(
]
[

X
x
P
x
xp
X
E
)
(
]
[
]
[
]
[
)
(
)
(
)
,
(
)
,
(
)
,
(
)
,
(
)
,
(
)
(
]
[
,
,
,
Y
E
X
E
y
yP
x
xP
y
x
P
y
y
x
P
x
y
x
yP
y
x
xP
y
x
P
y
x
Y
X
E
x
y
x
y
x
y
y
x
y
x
y
x
P

]
[
]
[
)
(
)
(
)
(
)
(
)
,
(
]
[
,
,
Y
E
X
E
y
yP
x
xP
y
P
x
xyP
y
x
xyP
XY
E
y
x
y
x
y
x
P

Independence
assumption

Variance

Variance of RV:

If X and Y are independent: Var[X+Y]=Var[X]+Var[Y]

Var[aX+b]=a
2
Var[X]

]
[
]
[
]
[
]
[
]
[
2
]
[
]]
[
[
]]
[
2
[
]
[
]
[
]
[
2
]
[
]
[
2
2
2
2
2
2
2
2
2
X
E
X
E
X
E
X
E
X
E
X
E
X
E
E
X
XE
E
X
E
X
E
X
XE
X
E
X
E
X
E
X
Var
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P

Information Theory

Entropy:

We use log base 2 to interpret entropy as bits of information

Entropy of X is a lower bound on avg. # of bits to encode values of X

0

H
p
(X)

log|V
al(X)| for any distribution P(X)

Conditional entropy:

Information only helps:

Mutual information:

0

I
p
(X;Y)

H
p
(X)

Symmetry: I
p
(X;Y)= I
p
(Y;X)

I
p
(X;Y)=0 iff X and Y are independent

Chain rule of entropies:

X
x
P
x
P
x
P
X
H
)
(
log
)
(
)
(

Y
y
X
x
P
P
P
y
x
P
y
x
P
Y
H
Y
X
H
Y
X
H
,
)
|
(
log
)
,
(
)
(
)
,
(
)
|
(
)
,...,
|
(
...
)
(
)
,...,
(
1
1
1
1

n
n
P
P
n
P
X
X
X
H
X
H
X
X
H
)
(
)
|
(
X
H
Y
X
H
P
P

Y
y
X
x
P
P
P
x
P
y
x
P
y
x
P
Y
X
H
X
H
Y
X
I
,
)
(
)
|
(
log
)
,
(
)
|
(
)
(
)
;
(
Distances Between Distributions

Relative Entropy:

D(P

Q)

0

D(P

Q)=
0
iff P=Q

Not a distance metric (no symmetry and triangle inequality)

L
1

distance:

L
2

distance:

L

distance:

2
1
2
2
))
(
)
(
(

X
x
x
Q
x
P
Q
P

X
x
x
Q
x
P
x
P
X
Q
X
P
D
)
(
)
(
log
)
(
)
)
(
)
(
(

X
x
x
Q
x
P
Q
P
|
)
(
)
(
|
1
|
)
(
)
(
|
max
X
x
x
Q
x
P
Q
P

Optimization Theory

Find values

1
,…

n

such that

Optimization strategies

Solve gradient analytically and verify local maximum

Gradient search: guess initial values, and improve iteratively

Line search

Lagrange multipliers

Solve maximization problem with constraints

Maximize

)
'
,...,
'
(
max
)
,...,
(
1
'
,...,
'
1
1
n
n
f
f
n

0
,
,...,
1
:

j
c
n
j
C

j
n
j
j
c
f
L

1
)
(
)
,
(
Graph Theory

Undirected graph

Directed graph

Complete graph (every two nodes connected)

Acyclic graph

Partially directed acyclic graph (PDAG)

Induced graph

Sub
-
graph

Graph algorithms

Shortest path from node X
1

to all other nodes (BFS)

Representing Joint Distributions

Random variables: X
1
,…,X
n

P is a joint distribution over X
1
,…,X
n

If X
1
,..,X
n

binary, need
2
n

parameters to describe P

Can we represent P more compactly?

Key: Exploit independence properties

Independent Random Variables

Two variables X and Y are independent if

P(X=x|Y=y) = P(X=x) for all values x,y

Equivalently, knowing Y does not change predictions of X

If X and Y are independent then:

P(X, Y) = P(X|Y)P(Y) = P(X)P(Y)

If X
1
,…,X
n

are independent then:

P(X
1
,…,X
n
) = P(X
1
)…P(X
n
)

O(n) parameters

All 2
n

probabilities are implicitly defined

Cannot represent many types of distributions

Conditional Independence

X and Y are conditionally independent given Z if

P(X=x|Y=y, Z=z) = P(X=x|Z=z) for all values x, y, z

Equivalently, if we know Z, then knowing Y does not change
predictions of X

Notation: Ind(X;Y | Z) or (X

Y | Z)

Conditional Parameterization

S = Score on test, Val(S) = {s
0
,s
1
}

I = Intelligence, Val(I) = {i
0
,i
1
}

I

S

P(I,S)

i
0

s
0

0.665

i
0

s
1

0.035

i
1

s
0

0.06

i
1

s
1

0.24

S

I

s
0

s
1

i
0

0.95

0.05

i
1

0.2

0.8

I

i
0

i
1

0.7

0.3

P(S|I)

P(I)

P(I,S)

Joint parameterization

Conditional parameterization

3
parameters

3 parameters

Alternative parameterization: P(S) and P(I|S)

Conditional Parameterization

S = Score on test, Val(S) = {s
0
,s
1
}

I = Intelligence, Val(I) = {i
0
,i
1
}

G = Grade, Val(G) = {g
0
,g
1
,g
2
}

Assume that G and S are independent given I

Joint parameterization

2

2

3=12
-
1=
11 independent parameters

Conditional parameterization has

P(I,S,G) = P(I)P(S|I)P(G|I,S) = P(I)P(S|I)P(G|I)

P(I)

1 independent parameter

P(S|I)

2

1 independent parameters

P(G|I)
-

2

2 independent parameters

7 independent parameters

Naïve Bayes Model

Class variable C, Val(C) = {c
1
,…,c
k
}

Evidence variables X
1
,…,X
n

Naïve Bayes assumption: evidence variables are conditionally
independent given C

Applications in medical diagnosis, text classification

Used as a classifier:

Problem: Double counting correlated evidence

n
i
i
n
C
X
P
C
P
X
X
C
P
1
1
)
|
(
)
(
)
,...,
,
(

n
i
i
i
n
n
c
C
x
P
c
C
x
P
c
C
P
c
C
P
x
x
c
C
P
x
x
c
C
P
1
2
1
2
1
1
2
1
1
)
|
(
)
|
(
)
(
)
(
)
,...,
|
(
)
,...,
|
(
Bayesian Network (Informal)

Directed acyclic graph G

Nodes represent random variables

Edges represent direct influences between random variables

Local probability models

I

S

I

S

G

C

X
1

X
n

X
2

Naïve Bayes

Example 2

Example
1

Bayesian Network (Informal)

Represent a joint distribution

Specifies the probability for P(
X
=
x
)

Specifies the probability for P(
X
=
x
|
E
=
e
)

Allows for reasoning patterns

Prediction

(e.g., intelligent

high scores)

Explanation

(e.g., low score

not intelligent)

Explaining away
(different causes for same effect interact)

I

S

G

Example 2

Bayesian Network Structure

Directed acyclic graph G

Nodes X
1
,…,X
n

represent random variables

G encodes local Markov assumptions

X
i

is independent of its non
-
descendants given its parents

Formally: (X
i

NonDesc(X
i
)

| Pa(X
i
))

A

B

C

E

G

D

F

E

{A,C,D,F} | B

Independency Mappings (I
-
Maps)

Let P be a distribution over
X

Let I(P) be the independencies (
X

Y

|
Z
) in P

A Bayesian network structure is an I
-
map
(independency mapping) of P if I(G)

I(P)

I

S

I

S

P(I,S)

i
0

s
0

0.25

i
0

s
1

0.25

i
1

s
0

0.25

i
1

s
1

0.25

I

S

I

S

P(I,S)

i
0

s
0

0.4

i
0

s
1

0.3

i
1

s
0

0.2

i
1

s
1

0.1

I(P)={I

S}

I(G)={I

S}

I(G)=

I(P)=

Factorization Theorem

If G is an I
-
Map of P, then

Proof:

wlog.
X
1
,…,X
n
is an ordering consistent with
G

By chain rule:

From assumption:

Since G is an I
-
Map

(
X
i
; NonDesc(X
i
)| Pa(X
i
))

I(P)

n
i
i
i
n
X
Pa
X
P
X
X
P
1
1
))
(
|
(
)
,...,
(
)
(
)
(
}
,
{
}
,
{
)
(
1
,
1
1
,
1
i
i
i
i
i
X
NonDesc
X
Pa
X
X
X
X
X
Pa

n
i
i
i
n
X
X
X
P
X
X
P
1
1
1
1
)
,...,
|
(
)
,...,
(
))
(
|
(
)
,...,
|
(
1
1
i
i
i
i
X
Pa
X
P
X
X
X
P

Factorization Implies I
-
Map

G is an I
-
Map of P

Proof:

Need to show
(
X
i
; NonDesc(X
i
)| Pa(X
i
))

I(P) or that
P
(
X
i
| NonDesc(X
i
)) = P(X
i

| Pa(X
i
))

wlog.
X
1
,…,X
n
is an ordering consistent with
G

n
i
i
i
n
X
Pa
X
P
X
X
P
1
1
))
(
|
(
)
,...,
(
))
(
|
(
))
(
|
(
))
(
|
(
))
(
(
))
(
,
(
))
(
|
(
1
1
1
i
i
i
k
k
k
i
k
k
k
i
i
i
i
i
X
Pa
X
P
X
Pa
X
P
X
Pa
X
P
X
NonDesc
P
X
NonDesc
X
P
X
NonDesc
X
P

Bayesian Network Definition

A Bayesian network is a pair (G,P)

P factorizes over G

P is specified as set of CPDs associated with G’s nodes

Parameters

Joint distribution: 2
n

Bayesian network (bounded in
-
degree k): n2
k

Bayesian Network Design

Variable considerations

Clarity test: can an omniscient being determine its value?

Hidden variables?

Irrelevant variables

Structure considerations

Causal order of variables

Which independencies (approximately) hold?

Probability considerations

Zero probabilities

Orders of magnitude

Relative values