Bayesian classifiers

ocelotgiantAI and Robotics

Nov 7, 2013 (3 years and 7 months ago)

69 views

Bayesian classifiers

Bayesian Classification: Why?


Probabilistic learning
: Calculate explicit probabilities for
hypothesis, among the most practical approaches to certain
types of learning problems


Incremental
: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with observed
data.


Probabilistic prediction
: Predict multiple hypotheses,
weighted by their probabilities


Standard
: Even when Bayesian methods are
computationally intractable, they can provide a standard of
optimal decision making against which other methods can
be measured





Bayesian Theorem


Given training data

D, posteriori probability of a
hypothesis h, P(h|D)
follows the Bayes theorem




MAP (maximum posteriori) hypothesis




Practical difficulty: require initial knowledge of many
probabilities, significant computational cost

)
(
)
(
)
|
(
)
|
(
D
P
h
P
h
D
P
D
h
P

.
)
(
)
|
(
max
arg
)
|
(
max
arg
h
P
h
D
P
H
h
D
h
P
H
h
MAP
h




Naïve Bayesian Classification


If i
-
th attribute is
categorical
:

P(d
i
|C) is estimated as the relative freq of
samples having value d
i

as i
-
th attribute in class
C


If i
-
th attribute is
continuous
:

P(d
i
|C) is estimated thru a Gaussian density
function


Computationally easy in both cases

Play
-
tennis example: estimating
P(x
i
|C)

Outlook
Temperature
Humidity
Windy
Class
sunny
hot
high
false
N
sunny
hot
high
true
N
overcast
hot
high
false
P
rain
mild
high
false
P
rain
cool
normal
false
P
rain
cool
normal
true
N
overcast
cool
normal
true
P
sunny
mild
high
false
N
sunny
cool
normal
false
P
rain
mild
normal
false
P
sunny
mild
normal
true
P
overcast
mild
high
true
P
overcast
hot
normal
false
P
rain
mild
high
true
N
P(true|n) = 3/5

P(true|p) = 3/9

P(false|n) = 2/5

P(false|p) = 6/9

P(high|n) = 4/5

P(high|p) = 3/9

P(normal|n) = 2/5

P(normal|p) = 6/9

P(hot|n) = 2/5

P(hot|p) = 2/9

P(mild|n) = 2/5

P(mild|p) = 4/9

P(cool|n) = 1/5

P(cool|p) = 3/9

P(rain|n) = 2/5

P(rain|p) = 3/9

P(overcast|n) = 0

P(overcast|p) = 4/9

P(sunny|n) = 3/5

P(sunny|p) = 2/9

windy

humidity

temperature

outlook

P(n) = 5/14

P(p) = 9/14

Play
-
tennis example:
classifying X


An unseen sample X = <rain, hot, high, false>



P(X|p)∙P(p) =

P(rain|p)∙P(hot|p)∙P(high|p)∙P(false|p)∙P(p) =
3/9∙2/9∙3/9∙6/9∙9/14 =
0.010582


P(X|n)∙P(n) =

P(rain|n)∙P(hot|n)∙P(high|n)∙P(false|n)∙P(n) =
2/5∙2/5∙4/5∙2/5∙5/14 =
0.018286



Sample
X is

classified in class
n
(don

t play)


The independence
hypothesis…


… makes computation possible


… yields optimal classifiers when satisfied


… but is seldom satisfied in practice, as attributes
(variables) are often correlated.


Attempts to overcome this limitation:


Bayesian networks
, that combine Bayesian reasoning with
causal relationships between attributes

Bayesian Belief Networks (I)

Age

Diabetes

Insulin

FamilyH

Mass

Glucose

M

~M

(FH, A)

(FH, ~A)

(~FH, A)

(~FH, ~A)

0.8

0.2

0.5

0.5

0.7

0.3

0.1

0.9

Bayesian Belief Networks

The conditional probability table
for the variable Mass

From Jiawei Han's slides

Applying Bayesian nets


When all but one variable known:


P(D|A,F,M,G,I)

From Jiawei Han's slides

Bayesian belief network


Find joint probability over set of variables
making use of conditional independence
whenever known







b

b

b

a

d

e

C


ad a
d

a
d

ad


0.1 0.2 0.3 0.4


0.3 0.2 0.1 0.5

Variable e independent

of d given b

Bayesian Belief Networks (II)


Bayesian belief network allows a
subset

of the variables
conditionally independent


A graphical model of causal relationships


Several cases of learning Bayesian belief networks


Given both network structure and all the variables: easy


Given network structure but only some variables: use gradient
descent / EM algorithms


When the network structure is not known in advance


Learning structure of network harder

From Jiawei Han's slides

The
k
-
Nearest Neighbor
Algorithm


All instances correspond to points in the n
-
D space.


The nearest neighbor are defined in terms of
Euclidean distance.


The target function could be discrete
-

or real
-

valued.


For discrete
-
valued, the
k
-
NN returns the most
common value among the k training examples nearest
to

x
q
.


Vonoroi diagram: the decision surface induced by 1
-
NN for a typical set of training examples.


.

_

+

_

x
q

+

_

_

+

_

_

+

.

.

.

.

.

From Jiawei Han's slides

Discussion on the
k
-
NN
Algorithm


The k
-
NN algorithm for continuous
-
valued target functions


Calculate the mean values of the

k

nearest neighbors


Distance
-
weighted nearest neighbor algorithm


Weight the contribution of each of the k neighbors according to
their distance to the query point

x
q


giving greater weight to closer neighbors


Similarly, for real
-
valued target functions


Robust to noisy data by averaging k
-
nearest neighbors


Curse of dimensionality: distance between neighbors could
be dominated by irrelevant attributes.


To overcome it, axes stretch or elimination of the least relevant
attributes.

w
d
x
q
x
i

1
2
(
,
)