Bayesian Learning
•
Provides practical learning algorithms
–
Naïve Bayes learning
–
Bayesian belief network learning
–
Combine prior knowledge (prior probabilities)
•
Provides foundations for machine learning
–
Evaluating learning algorithms
–
Guiding the design of new algorithms
–
Learning from models : meta learning
Bayesian Classification: Why?
•
Probabilistic learning
: Calculate explicit probabilities
for hypothesis, among the most practical approaches to
certain types of learning problems
•
Incremental
: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with
observed data.
•
Probabilistic prediction
: Predict multiple hypotheses,
weighted by their probabilities
•
Standard
: Even when Bayesian methods are
computationally intractable, they can provide a standard
of optimal decision making against which other methods
can be measured
Basic Formulas for Probabilities
•
Product Rule : probability P(AB) of a conjunction of two events A
and B:
•
Sum Rule: probability of a disjunction of two events A and B:
•
Theorem of Total Probability : if events A1, …., An are mutually
exclusive with
)
(
)

(
)
(
)

(
)
,
(
A
P
A
B
P
B
P
B
A
P
B
A
P
)
(
)
(
)
(
)
(
AB
P
B
P
A
P
B
A
P
)
(
)

(
)
(
1
i
n
i
i
A
P
A
B
P
B
P
Basic Approach
Bayes Rule
:
)
(
)
(
)

(
)

(
D
P
h
P
h
D
P
D
h
P
•
P(h) = prior probability of hypothesis h
•
P(D) = prior probability of training data D
•
P(hD) = probability of h given D (posterior density )
•
P(Dh) = probability of D given h (likelihood of D given h)
The Goal of Bayesian Learning: the most probable hypothesis given the
training data (Maximum A Posteriori hypothesis )
map
h
)
(
)

(
max
)
(
)
(
)

(
max
)

(
max
h
P
h
D
P
D
P
h
P
h
D
P
D
h
P
h
H
h
H
h
H
h
map
An Example
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which the
disease is actually present, and a correct negative result in only 97% of
the cases in which the disease is not present. Furthermore, .008 of the
entire population have this cancer.
)
(
)
(
)

(
)

(
)
(
)
(
)

(
)

(
97
.
)

(
,
03
.
)

(
02
.
)

(
,
98
.
)

(
992
.
)
(
,
008
.
)
(
P
cancer
P
cancer
P
cancer
P
P
cancer
P
cancer
P
cancer
P
cancer
P
cancer
P
cancer
P
cancer
P
cancer
P
cancer
P
MAP Learner
For each hypothesis h in H, calculate the posterior probability
)
(
)
(
)

(
)

(
D
P
h
P
h
D
P
D
h
P
Output the hypothesis h
map
with the highest posterior probability
)

(
max
D
h
P
h
H
h
map
Comments:
Computational intensive
Providing a standard for judging the performance of
learning algorithms
Choosing P(h) and P(Dh) reflects our prior
knowledge about the learning task
Bayes Optimal Classifier
•
Question: Given new instance x, what is its most probable
classification?
•
Hmap(x) is not the most probable classification!
Example: Let P(h1D) = .4, P(h2D) = .3, P(h3 D) =.3
Given new data x, we have h1(x)=+, h2(x) =

, h3(x) =

What is the most probable classification of x ?
Bayes optimal classification:
)

(
)

(
max
D
h
P
h
v
P
i
H
hj
i
j
V
vj
Example:
P(h1 D) =.4,
P(

h1)=0,
P(+h1)=1
P(h2D) =.3,
P(

h2)=1,
P(+h2)=0
P(h3D)=.3,
P(

h3)=1,
P(+h3)=0
6
.
)

(
)

(
4
.
)

(
)

(
D
h
P
h
P
D
h
P
h
P
i
H
hi
i
i
H
hi
i
Naïve Bayes Learner
Assume target function f: X

> V, where each instance x described
by attributes <a1, a2, …., an>. Most probable value of f(x) is:
)
(
)

....
,
(
max
)
....
,
(
)
(
)

....
,
(
max
)
....
,

(
max
2
1
2
1
2
1
2
1
j
j
n
V
vj
n
j
j
n
V
vj
n
j
V
vj
v
P
v
a
a
a
P
a
a
a
P
v
P
v
a
a
a
P
a
a
a
v
P
v
Naïve Bayes assumption:
)

(
)

....
,
(
2
1
j
i
i
j
n
v
a
P
v
a
a
a
P
(attributes are conditionally independent)
Bayesian classification
•
The classification problem may be formalized
using
a

posteriori probabilities
:
•
P(CX) = prob. that the sample tuple
X=<x
1
,…,x
k
> is of class C.
•
E.g. P(class=N  outlook=sunny,windy=true,…)
•
Idea: assign to sample
X
the class label
C
such
that
P(CX) is maximal
Estimating a

posteriori probabilities
•
Bayes theorem
:
P(CX) = P(XC)∙P(C) / P(X)
•
P(X) is constant for all classes
•
P(C) = relative freq of class C samples
•
C such that
P(CX)
is maximum =
C such that
P(XC)∙P(C)
is maximum
•
Problem: computing P(XC) is unfeasible!
Naïve Bayesian Classification
•
Naïve assumption:
attribute independence
P(x
1
,…,x
k
C) = P(x
1
C)·…·P(x
k
C)
•
If i

th attribute is
categorical
:
P(x
i
C) is estimated as the relative freq of samples
having value x
i
as i

th attribute in class C
•
If i

th attribute is
continuous
:
P(x
i
C) is estimated thru a Gaussian density function
•
Computationally easy in both cases
Naive Bayesian Classifier (II)
•
Given a training set, we can compute the probabilities
Outlook
P
N
Humidity
P
N
sunny
2/9
3/5
high
3/9
4/5
overcast
4/9
0
normal
6/9
1/5
rain
3/9
2/5
Tempreature
Windy
hot
2/9
2/5
true
3/9
3/5
mild
4/9
2/5
false
6/9
2/5
cool
3/9
1/5
Play

tennis example: estimating P(x
i
C)
Outlook
Temperature
Humidity
Windy
Class
sunny
hot
high
false
N
sunny
hot
high
true
N
overcast
hot
high
false
P
rain
mild
high
false
P
rain
cool
normal
false
P
rain
cool
normal
true
N
overcast
cool
normal
true
P
sunny
mild
high
false
N
sunny
cool
normal
false
P
rain
mild
normal
false
P
sunny
mild
normal
true
P
overcast
mild
high
true
P
overcast
hot
normal
false
P
rain
mild
high
true
N
outlook
P(sunnyp) = 2/9
P(sunnyn) = 3/5
P(overcastp) = 4/9
P(overcastn) = 0
P(rainp) = 3/9
P(rainn) = 2/5
temperature
P(hotp) = 2/9
P(hotn) = 2/5
P(mildp) = 4/9
P(mildn) = 2/5
P(coolp) = 3/9
P(cooln) = 1/5
humidity
P(highp) = 3/9
P(highn) = 4/5
P(normalp) = 6/9
P(normaln) = 2/5
windy
P(truep) = 3/9
P(truen) = 3/5
P(falsep) = 6/9
P(falsen) = 2/5
P(p) = 9/14
P(n) = 5/14
Example : Naïve Bayes
Predict playing tennis in the day with the condition <sunny, cool, high,
strong> (P(v o=sunny, t= cool, h=high w=strong)) using the following
training data:
Day
Outlook
Temperature
Humidity
Wind
Play Tennis
1
Sunny
Hot
High
Weak
No
2
Sunny
Hot
High
Strong
No
3
Overcast
Hot
High
Weak
Yes
4
Rain
Mild
High
Weak
Yes
5
Rain
Cool
Normal
Weak
Yes
6
Rain
Cool
Normal
Strong
No
7
Overcast
Cool
Normal
Strong
Yes
8
Sunny
Mild
High
Weak
No
9
Sunny
Cool
Normal
Weak
Yes
10
Rain
Mild
Normal
Weak
Yes
11
Sunny
Mild
Normal
Strong
Yes
12
Overcast
Mild
High
Strong
Yes
13
Overcast
Hot
Normal
Weak
Yes
14
Rain
Mild
High
Strong
No
we have :
021
.
)

(
)

(
)

(
)

(
)
(
005
.
)

(
)

(
)

(
)

(
)
(
n
strong
p
n
high
p
n
cool
p
n
sun
p
n
p
y
strong
p
y
high
p
y
cool
p
y
sun
p
y
p
tennise
playing
of
days
wind
strong
with
tennise
playing
of
days
#
#
The independence hypothesis…
•
… makes computation possible
•
… yields optimal classifiers when satisfied
•
… but is seldom satisfied in practice, as attributes
(variables) are often correlated.
•
Attempts to overcome this limitation:
–
Bayesian networks
, that combine Bayesian reasoning with
causal relationships between attributes
–
Decision trees
, that reason on one attribute at the time,
considering most important attributes first
Naïve Bayes Algorithm
Naïve_Bayes_Learn (examples)
for each target value vj
estimate P(vj)
for each attribute value ai of each attribute a
estimate P(ai  vj )
Classify_New_Instance (x)
)

(
)
(
max
j
x
a
i
V
vj
j
v
a
P
v
P
v
i
Typical estimation of P(ai  vj)
m
n
mp
n
v
a
P
c
j
i
)

(
Where
n: examples with v=v; p is prior estimate for P(aivj)
nc: examples with a=ai, m is the weight to prior
Bayesian Belief Networks
•
Naïve Bayes assumption of conditional independence too restrictive
•
But it is intractable without some such assumptions
•
Bayesian Belief network (Bayesian net) describe conditional
independence among subsets of variables (attributes): combining prior
knowledge about dependencies among variables with observed
training data.
•
Bayesian Net
–
Node = variables
–
Arc = dependency
–
DAG, with direction on arc representing causality
Bayesian Networks:
Multi

variables with Dependency
•
Bayesian Belief network (Bayesian net) describe conditional
independence among subsets of variables (attributes): combining prior
knowledge about dependencies among variables with observed
training data.
•
Bayesian Net
–
Node = variables and each variable has a finite set of mutually exclusive
states
–
Arc = dependency
–
DAG, with direction on arc representing causality
–
To each variables A with parents B1, …., Bn there is attached a
conditional probability table P (A  B1, …., Bn)
Bayesian Belief Networks
•
Age, Occupation and Income determine if
customer will buy this product.
•
Given that customer buys product, whether
there is interest in insurance is now
independent of Age, Occupation, Income.
•
P(Age, Occ, Inc, Buy, Ins ) =
P(Age)P(Occ)P(Inc)
P(BuyAge,Occ,Inc)P(IntBuy)
Current State

of

the Art: Given structure
and probabilities, existing algorithms can
handle inference with categorical values and
limited representation of numerical values
Age
Occ
Income
Buy X
Interested in
Insurance
General Product Rule
)
,

(
)

,....
(
1
1
M
Pa
x
P
M
x
x
P
i
i
n
i
n
)
(
i
i
x
parent
Pa
Nodes as Functions
•
input: parents state values
•
output: a distribution over its own value
A
B
a
b
ab
~ab
a~b
~a~b
0.1
0.3
0.6
0.7
0.2
0.1
0.4
0.4
0.2
X
0.2
0.5
0.3
0.1
0.3
0.6
P(XA=a, B=b)
A node in BN is a conditional distribution function
l
m
h
l
m
h
Special Case : Naïve Bayes
h
e1
e2
en
………….
P(e1, e2, ……en, h ) = P(h) P(e1  h) …….P(en  h)
Inference in Bayesian Networks
Age
Income
House
Owner
EU
Voting
Pattern
Newspaper
Preference
Living
Location
How likely are
elderly rich
people to
buy
Sun
?
P(
paper = Sun

Age>60, Income > 60k
)
Inference in Bayesian Networks
Age
Income
House
Owner
EU
Voting
Pattern
Newspaper
Preference
Living
Location
How likely are
elderly rich
people who
voted labour
to
buy Daily Mail
?
P(
paper = DM

Age>60,
Income > 60k, v = labour
)
Bayesian Learning
B E A C N
~b e a c n
b ~e ~a ~c n
………………...
Burglary
Earthquake
Alarm
Call
Newscast
Input : fully or partially observable data cases
Output : parameters AND also structure
Learning Methods:
EM (Expectation Maximisation)
using current approximation of parameters to estimate filled in data
using filled in data to update parameters (ML)
Gradient Ascent Training
Gibbs Sampling (MCMC)
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment