Bayesian classifiers
Bayesian Classification: Why?
Probabilistic learning
: Calculate explicit probabilities for
hypothesis, among the most practical approaches to certain
types of learning problems
Incremental
: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with observed
data.
Probabilistic prediction
: Predict multiple hypotheses,
weighted by their probabilities
Standard
: Even when Bayesian methods are
computationally intractable, they can provide a standard of
optimal decision making against which other methods can
be measured
Bayesian Theorem
Given training data
D, posteriori probability of a
hypothesis h, P(hD)
follows the Bayes theorem
MAP (maximum posteriori) hypothesis
Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
)
(
)
(
)

(
)

(
D
P
h
P
h
D
P
D
h
P
.
)
(
)

(
max
arg
)

(
max
arg
h
P
h
D
P
H
h
D
h
P
H
h
MAP
h
Naïve Bayesian Classification
If i

th attribute is
categorical
:
P(d
i
C) is estimated as the relative freq of
samples having value d
i
as i

th attribute in class
C
If i

th attribute is
continuous
:
P(d
i
C) is estimated thru a Gaussian density
function
Computationally easy in both cases
Play

tennis example: estimating
P(x
i
C)
Outlook
Temperature
Humidity
Windy
Class
sunny
hot
high
false
N
sunny
hot
high
true
N
overcast
hot
high
false
P
rain
mild
high
false
P
rain
cool
normal
false
P
rain
cool
normal
true
N
overcast
cool
normal
true
P
sunny
mild
high
false
N
sunny
cool
normal
false
P
rain
mild
normal
false
P
sunny
mild
normal
true
P
overcast
mild
high
true
P
overcast
hot
normal
false
P
rain
mild
high
true
N
P(truen) = 3/5
P(truep) = 3/9
P(falsen) = 2/5
P(falsep) = 6/9
P(highn) = 4/5
P(highp) = 3/9
P(normaln) = 2/5
P(normalp) = 6/9
P(hotn) = 2/5
P(hotp) = 2/9
P(mildn) = 2/5
P(mildp) = 4/9
P(cooln) = 1/5
P(coolp) = 3/9
P(rainn) = 2/5
P(rainp) = 3/9
P(overcastn) = 0
P(overcastp) = 4/9
P(sunnyn) = 3/5
P(sunnyp) = 2/9
windy
humidity
temperature
outlook
P(n) = 5/14
P(p) = 9/14
Play

tennis example:
classifying X
An unseen sample X = <rain, hot, high, false>
P(Xp)∙P(p) =
P(rainp)∙P(hotp)∙P(highp)∙P(falsep)∙P(p) =
3/9∙2/9∙3/9∙6/9∙9/14 =
0.010582
P(Xn)∙P(n) =
P(rainn)∙P(hotn)∙P(highn)∙P(falsen)∙P(n) =
2/5∙2/5∙4/5∙2/5∙5/14 =
0.018286
Sample
X is
classified in class
n
(don
’
t play)
The independence
hypothesis…
… makes computation possible
… yields optimal classifiers when satisfied
… but is seldom satisfied in practice, as attributes
(variables) are often correlated.
Attempts to overcome this limitation:
Bayesian networks
, that combine Bayesian reasoning with
causal relationships between attributes
Bayesian Belief Networks (I)
Age
Diabetes
Insulin
FamilyH
Mass
Glucose
M
~M
(FH, A)
(FH, ~A)
(~FH, A)
(~FH, ~A)
0.8
0.2
0.5
0.5
0.7
0.3
0.1
0.9
Bayesian Belief Networks
The conditional probability table
for the variable Mass
From Jiawei Han's slides
Applying Bayesian nets
When all but one variable known:
P(DA,F,M,G,I)
From Jiawei Han's slides
Bayesian belief network
Find joint probability over set of variables
making use of conditional independence
whenever known
b
b
b
a
d
e
C
ad a
d
a
d
ad
0.1 0.2 0.3 0.4
0.3 0.2 0.1 0.5
Variable e independent
of d given b
Bayesian Belief Networks (II)
Bayesian belief network allows a
subset
of the variables
conditionally independent
A graphical model of causal relationships
Several cases of learning Bayesian belief networks
Given both network structure and all the variables: easy
Given network structure but only some variables: use gradient
descent / EM algorithms
When the network structure is not known in advance
Learning structure of network harder
From Jiawei Han's slides
The
k

Nearest Neighbor
Algorithm
All instances correspond to points in the n

D space.
The nearest neighbor are defined in terms of
Euclidean distance.
The target function could be discrete

or real

valued.
For discrete

valued, the
k

NN returns the most
common value among the k training examples nearest
to
x
q
.
Vonoroi diagram: the decision surface induced by 1

NN for a typical set of training examples.
.
_
+
_
x
q
+
_
_
+
_
_
+
.
.
.
.
.
From Jiawei Han's slides
Discussion on the
k

NN
Algorithm
The k

NN algorithm for continuous

valued target functions
Calculate the mean values of the
k
nearest neighbors
Distance

weighted nearest neighbor algorithm
Weight the contribution of each of the k neighbors according to
their distance to the query point
x
q
giving greater weight to closer neighbors
Similarly, for real

valued target functions
Robust to noisy data by averaging k

nearest neighbors
Curse of dimensionality: distance between neighbors could
be dominated by irrelevant attributes.
To overcome it, axes stretch or elimination of the least relevant
attributes.
w
d
x
q
x
i
1
2
(
,
)
Comments 0
Log in to post a comment