# Bayesian classifiers

AI and Robotics

Nov 7, 2013 (4 years and 6 months ago)

102 views

Bayesian classifiers

Bayesian Classification: Why?

Probabilistic learning
: Calculate explicit probabilities for
hypothesis, among the most practical approaches to certain
types of learning problems

Incremental
: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with observed
data.

Probabilistic prediction
: Predict multiple hypotheses,
weighted by their probabilities

Standard
: Even when Bayesian methods are
computationally intractable, they can provide a standard of
optimal decision making against which other methods can
be measured

Bayesian Theorem

Given training data

D, posteriori probability of a
hypothesis h, P(h|D)
follows the Bayes theorem

MAP (maximum posteriori) hypothesis

Practical difficulty: require initial knowledge of many
probabilities, significant computational cost

)
(
)
(
)
|
(
)
|
(
D
P
h
P
h
D
P
D
h
P

.
)
(
)
|
(
max
arg
)
|
(
max
arg
h
P
h
D
P
H
h
D
h
P
H
h
MAP
h

Naïve Bayesian Classification

If i
-
th attribute is
categorical
:

P(d
i
|C) is estimated as the relative freq of
samples having value d
i

as i
-
th attribute in class
C

If i
-
th attribute is
continuous
:

P(d
i
|C) is estimated thru a Gaussian density
function

Computationally easy in both cases

Play
-
tennis example: estimating
P(x
i
|C)

Outlook
Temperature
Humidity
Windy
Class
sunny
hot
high
false
N
sunny
hot
high
true
N
overcast
hot
high
false
P
rain
mild
high
false
P
rain
cool
normal
false
P
rain
cool
normal
true
N
overcast
cool
normal
true
P
sunny
mild
high
false
N
sunny
cool
normal
false
P
rain
mild
normal
false
P
sunny
mild
normal
true
P
overcast
mild
high
true
P
overcast
hot
normal
false
P
rain
mild
high
true
N
P(true|n) = 3/5

P(true|p) = 3/9

P(false|n) = 2/5

P(false|p) = 6/9

P(high|n) = 4/5

P(high|p) = 3/9

P(normal|n) = 2/5

P(normal|p) = 6/9

P(hot|n) = 2/5

P(hot|p) = 2/9

P(mild|n) = 2/5

P(mild|p) = 4/9

P(cool|n) = 1/5

P(cool|p) = 3/9

P(rain|n) = 2/5

P(rain|p) = 3/9

P(overcast|n) = 0

P(overcast|p) = 4/9

P(sunny|n) = 3/5

P(sunny|p) = 2/9

windy

humidity

temperature

outlook

P(n) = 5/14

P(p) = 9/14

Play
-
tennis example:
classifying X

An unseen sample X = <rain, hot, high, false>

P(X|p)∙P(p) =

P(rain|p)∙P(hot|p)∙P(high|p)∙P(false|p)∙P(p) =
3/9∙2/9∙3/9∙6/9∙9/14 =
0.010582

P(X|n)∙P(n) =

P(rain|n)∙P(hot|n)∙P(high|n)∙P(false|n)∙P(n) =
2/5∙2/5∙4/5∙2/5∙5/14 =
0.018286

Sample
X is

classified in class
n
(don

t play)

The independence
hypothesis…

… makes computation possible

… yields optimal classifiers when satisfied

… but is seldom satisfied in practice, as attributes
(variables) are often correlated.

Attempts to overcome this limitation:

Bayesian networks
, that combine Bayesian reasoning with
causal relationships between attributes

Bayesian Belief Networks (I)

Age

Diabetes

Insulin

FamilyH

Mass

Glucose

M

~M

(FH, A)

(FH, ~A)

(~FH, A)

(~FH, ~A)

0.8

0.2

0.5

0.5

0.7

0.3

0.1

0.9

Bayesian Belief Networks

The conditional probability table
for the variable Mass

From Jiawei Han's slides

Applying Bayesian nets

When all but one variable known:

P(D|A,F,M,G,I)

From Jiawei Han's slides

Bayesian belief network

Find joint probability over set of variables
making use of conditional independence
whenever known

b

b

b

a

d

e

C

d

a
d

0.1 0.2 0.3 0.4

0.3 0.2 0.1 0.5

Variable e independent

of d given b

Bayesian Belief Networks (II)

Bayesian belief network allows a
subset

of the variables
conditionally independent

A graphical model of causal relationships

Several cases of learning Bayesian belief networks

Given both network structure and all the variables: easy

Given network structure but only some variables: use gradient
descent / EM algorithms

When the network structure is not known in advance

Learning structure of network harder

From Jiawei Han's slides

The
k
-
Nearest Neighbor
Algorithm

All instances correspond to points in the n
-
D space.

The nearest neighbor are defined in terms of
Euclidean distance.

The target function could be discrete
-

or real
-

valued.

For discrete
-
valued, the
k
-
NN returns the most
common value among the k training examples nearest
to

x
q
.

Vonoroi diagram: the decision surface induced by 1
-
NN for a typical set of training examples.

.

_

+

_

x
q

+

_

_

+

_

_

+

.

.

.

.

.

From Jiawei Han's slides

Discussion on the
k
-
NN
Algorithm

The k
-
NN algorithm for continuous
-
valued target functions

Calculate the mean values of the

k

nearest neighbors

Distance
-
weighted nearest neighbor algorithm

Weight the contribution of each of the k neighbors according to
their distance to the query point

x
q

giving greater weight to closer neighbors

Similarly, for real
-
valued target functions

Robust to noisy data by averaging k
-
nearest neighbors

Curse of dimensionality: distance between neighbors could
be dominated by irrelevant attributes.

To overcome it, axes stretch or elimination of the least relevant
attributes.

w
d
x
q
x
i

1
2
(
,
)