INTRODUCTION TO

Machine Learning

ETHEM ALPAYDIN

alpaydin@boun.edu.tr

http://www.cmpe.boun.edu.tr/~ethem/i2ml

Lecture Slides for

CHAPTER 3:

Bayesian Decision
Theory

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

3

Probability and Inference

Result of tossing a coin is


Random var
X

{1,0}

Bernoulli:
P
{
X
=1} =
p
o
X
(1

p
o
)
(1

X)

Sample:
X
= {
x
t
}
N
t
=1

Estimation:
p
o

t

x
t
/
N

Prediction of next toss:

p
o

> ½, Tails otherwise

4

Classification

Credit scoring: Inputs are income and savings.

Output is low
-
risk vs high
-
risk

Input:
x
= [
x
1
,
x
2
]
T

,Output:
C

{0,1}

Prediction:

otherwise

0

)
|
0
(
)
|
1
(
if

1

choose
ly
equivalent

or
otherwise

0

5
0
)
|
1
(
if

1

choose
2
1
2
1
2
1
C
C
C
C

,x
x
C
P

,x
x
C
P
.

,x
x
C
P
5

Bayes’ Rule

x
x
x
p
p
P
P
C
C
C
|

|

posterior

likelihood

prior

evidence

1
|
1
|
0
0
0
|
1
1
|
1
1
0

x
x
x
x
x
C
C
C
C
C
C
C
C
P
p
P
p
P
p
p
P
P
6

Bayes’ Rule: K>2 Classes

K
k
k
k
i
i
i
i
i
C
P
C
p
C
P
C
p
p
C
P
C
p
C
P
1
|
|
|
|
x
x
x
x
x

x
x
|

max

|
if

choose
1

and

0
1
k
k
i
i
K
i
i
i
C
P
C
P
C
C
P
C
P

7

Losses and Risks

Actions:
α
i

Loss of
α
i

when the state is
C
k

:
λ
ik

Expected risk (Duda and Hart, 1973)

x
x
x
x
|
min
|
if

choose
|
|
1
k
k
i
i
k
K
k
ik
i
R
R
C
P
R

8

Losses and Risks: 0/1 Loss

k
i
k
i
ik
if

1
if

0
For minimum risk, choose the most probable class

x
x
x
x
|
1
|
|
|
1
i
i
k
k
K
k
k
ik
i
C
P
C
P
C
P
R

9

Losses and Risks: Reject

1
0

otherwise

1
1
if

if

0

,
K
i
k
i
ik

x
x
x
x
x
|
1
|
|
|
|
1
1
i
i
k
k
i
K
k
k
K
C
P
C
P
R
C
P
R

otherwise

reject
1
|

and

|
|
if

choose

x
x
x
i
k
i
i
C
P
i
k
C
P
C
P
C
10

Discriminant Functions

K
,
,
i
,
g
i

1

x

x
x
k
k
i
i
g
g
C
max
if

choose

x
x
x
k
k
i
i
g
g
max
|

R
K

decision regions

R
1
,...,
R
K

i
i
i
i
i
C
P
C
p
C
P
R
g
|
|
|
x
x
x
x

11

K=2 Classes

Dichotomizer (
K
=2) vs Polychotomizer (
K
>2)

g
(
x
) =
g
1
(
x
)

g
2
(
x
)

Log odds:

otherwise

0
if

choose
2
1
C
g
C
x

x
x
|
|
log
2
1
C
P
C
P
12

Utility Theory

Prob of state
k

given exidence
x
: P
(
S
k
|
x
)

Utility of
α
i

when state is
k: U
ik

Expected utility:

x
x
x
x
|

max
|
if

Choose
|
|
j
j
i
i
k
k
ik
i
EU
EU
α
S
P
U
EU

13

Value of Information

Expected utility using
x

only

Expected utility using
x

and new feature
z

z

is useful if
EU
(
x
,z
) >
EU
(
x
)

k
k
ik
i
S
P
U
EU
x
x
|
max

k
k
ik
i
z
,
S
P
U
z
,
EU
x
x
|
max
14

Bayesian Networks

Aka graphical models, probabilistic networks

Nodes

are hypotheses (random vars) and the prob
corresponds to our belief in the truth of the
hypothesis

Arcs

are direct direct influences between
hypotheses

The
structure

is represented as a directed acyclic
graph (DAG)

The
parameters

are the conditional probs in the
arcs

(Pearl, 1988, 2000; Jensen, 1996; Lauritzen, 1996)

15

Causes and Bayes’ Rule

Diagnostic inference:

Knowing that the grass is wet,

what is the probability that rain is

the cause?

causal

diagnostic

75
0
6
0
2
0
4
0
9
0
4
0
9
0
|~
|
|
|
|
.
.
.
.
.
.
.
R
~
P
R
W
P
R
P
R
W
P
R
P
R
W
P
W
P
R
P
R
W
P
W
R
P

16

Causal vs Diagnostic Inference

Causal inference:

If the

sprinkler is on, what is the

probability that the grass is wet?

P
(
W
|
S
) =
P
(
W
|
R
,
S
)
P
(
R
|
S
) +

P
(
W
|~
R
,
S
)
P
(~
R
|
S
)

=
P
(
W
|
R
,
S
)
P
(
R
) +

P
(
W
|~
R
,
S
)
P
(~
R
)

= 0.95 0.4 + 0.9 0.6 = 0.92

Diagnostic inference:

If the grass is wet, what is the probability

that the sprinkler is on?

P
(
S
|
W
) = 0.35 > 0.2
P
(
S
)

P
(
S
|
R
,
W
) = 0.21

Explaining away:

Knowing that it has rained

decreases the probability that the sprinkler is on.

17

Bayesian Networks: Causes

Causal inference:

P
(
W
|
C
) =
P
(
W
|
R
,
S
)
P
(
R
,
S
|
C
) +

P
(
W
|~
R
,
S
)
P
(~
R
,
S
|
C
) +

P
(
W
|
R
,~
S
)
P
(
R
,~
S
|
C
) +

P
(
W
|~
R
,~
S
)
P
(~
R
,~
S
|
C
)

and use the fact that

P
(
R
,
S
|
C
) =
P
(
R
|
C
)
P
(
S
|
C
)

Diagnostic: P
(
C
|
W
) = ?

18

Bayesian Nets: Local structure

R
F
P
R
,
S
W
P
C
R
P
C
S
P
C
P
F
,
W
,
R
,
S
,
C
P
|
|
|
|

P
(
F
|
C
) = ?

d
i
i
i
d
X
X
P
X
,
X
P
1
1
parents
|

19

Bayesian Networks: Inference

P
(
C,S,R,W,F
) =
P
(
C
)
P
(
S
|
C
)
P
(
R
|
C
)
P
(
W
|
R,S
)
P
(
F
|
R
)

P
(
C,F
) = ∑
S

R

W

P
(
C,S,R,W,F
)

P
(
F
|
C
) =
P
(
C,F
) /
P
(
C
)

Not efficient!

Belief propagation (Pearl, 1988)

Junction trees (Lauritzen and Spiegelhalter, 1988)

20

Bayesian Networks:
Classification

diagnostic

P
(
C
|
x
)

Bayes’ rule inverts the arc:

x
x
x
p
C
P
C
p
C
P
|
|

21

Naive Bayes’ Classifier

Given
C
,
x
j

are independent:

p
(
x
|
C
) =
p
(
x
1
|
C
)
p
(
x
2
|
C
) ...
p
(
x
d
|
C
)

22

Influence Diagrams

chance node

decision node

utility node

23

Association Rules

Association rule:
X

Y

Support

(
X

Y
):

Confidence

(
X

Y
):

customers

and

bought

who
customers
#
Y
X
#
Y
,
X
P

Apriori algorithm (Agrawal et al., 1996)

X
#
Y
X
#
)
X
(
P
Y
,
X
P
X
Y
P

bought

who
customers

and

bought

who
customers
|