Robust Bayesian Classifier
Presented by
Chandrasekhar Jakkampudi
Classification
Classification consists of assigning a class label to a set
of unclassified cases.
1.
Supervised Classification
The set of possible classes is known in advance.
2. Unsupervised Classification
Set of possible classes is not known. After classification
we can try to assign a name to that class. Unsupervised
classification is called clustering.
Supervised Classification
•
The input data, also called the training set, consists of
multiple records each having multiple attributes or features.
•
Each record is tagged with a class label.
•
The objective of classification is to analyze the input data
and to develop an accurate description or model for each
class using the features present in the data.
•
This model is used to classify test data for which the class
descriptions are not known. (1)
Bayesian Classifier
Assumptions :
1.
The classes are mutually exclusive and exhaustive.
2.
The attributes are independent given the class.
Called “Naïve” classifier because of these assumptions.
Empirically proven to be useful.
Scales very well.
Bayesian Classifier
Bayesian classifier is defined by a set C of classes and a set
A of attributes. A generic class belonging to C is denoted by
c
j
and a generic attribute belonging to A as A
i
.
Consider a database D with a set of attribute values and the
class label of the case.
The training of the Bayesian Classifier consists of the
estimation of the conditional probability distribution of each
attribute, given the class.
Bayesian Classifier
Let n(a
ik
c
j
) be the number of cases in which A
i
appears with
value a
ik
and the class is c
j
.
Then p(a
ik
c
j
) = n(a
ik
c
j
)/
n(a
ik
c
j
)
Also p(c
j
) = n(c
j
)/n
This is only an estimate based on frequency.
To incorporate our prior belief about
p(a
ik
c
j
) we add α
j
imaginary cases with class
c
j
of which α
jk
is the number of
imaginary cases in which A
i
appears with value a
ik
and the
class is c
j
.
Bayesian Classifier
Thus p(a
ik
c
j
) = (α
jk
+ n(a
ik
c
j
))/(α
j
+
n(c
j
))
Also p(c
j
) = (
α
j
+ n(c
j
))/(α
+ n) where α
is the prior
global precision.
Once the training (estimation of the conditional probability
distribution of each attribute, given the class) is complete
we can classify new cases.
To find p(c
j
e
k
) we begin by calculating
p(c
j
a
1k
) = p(a
1k
c
j
)p(c
j
)/Σp(a
1k
c
h
) p(c
h
)
p(c
j
a
1k ,
a
2k
) = p(a
2k
c
j
)p(c
j
a
1k
)/Σp(a
2k
c
h
) p(c
h
a
1k
) and so on.
Bayesian Classifier
•
Works well with complete databases.
•
Methods exist to classify incomplete databases
•
Examples include EM algorithm, Gibbs sampling, Bound
and Collapse (BC) and Robust Bayesian Classifier etc.
Robust Bayesian Classifier
Incomplete databases seriously compromise the
computational efficiency of Bayesian classifiers.
One approach is to throw away all the incomplete entries.
Another approach is to try to complete the database by
allowing the user to specify the pattern of the data.
Robust Bayesian Classifier makes no assumption about the
nature of the data. It provides probability intervals that
contain estimates learned from all possible completions of
the database.
Training
We need to estimate the conditional probability p(a
ik
/c
j
)
We have three types of incomplete cases.
1.
A
i
is missing.
2.
C is missing
3.
Both are missing.
Consider the case where value of A
i
is not known.
Fill in the all values of A
i
with a
ik
and calculate
p
max
(a
ik
/c
j
).
Fill none of the values of A
i
with a
ik
and calculate
p
min
(a
ik
/c
j
).
Actual value of p(a
ik
/c
j
) lies somewhere between these two
extremes.
Prediction
Prediction involves computing p(c
j
/e
k
). Since we now
have an interval for p(a
ik
/c
j
) we will now calculate
p
max
(c
j
/e
k
) and p
min
(c
j
/e
k
).
To make the actual prediction of the class, the authors
have introduced two criteria.
1. Stochastic dominance : Assign class label as c
j
if p
min
(c
j
/e
k
)
is greater than p
max
(c
h
/e
k
) for all h
≠
j.
2. Weak Dominance : Arrive at a single probability for
p(c
j
/e
k
) by assigning a score that will fall in the interval
between p
max
(c
j
/e
k
) and p
min
(c
j
/e
k
)
Prediction
Stochastic dominance criteria reduces coverage because the
probability intervals may be overlapping. This is a more
conservative and safe method.
Weak dominance criteria improves coverage. Classification
depends on the score used to arrive at a single probability for
p(c
j
/e).
Score used by the authors =
(p
min
(c
j
/e)(c

1)/c) + (p
max
(c
j
/e)/c)
where c is the total number of classes.
Results
Robust Bayesian Classifier was tested on the
Adult
database
which consists of 14 attributes over 48841 cases from the
US Census of 1994. 7% of the database is incomplete. The
database is divided into two classes: People who earn more
than $50000 a year and people who don’t.
Bayesian classifier gave an accuracy of 81.74% with a
coverage of 93%.
Robust Bayesian classifier under the Stochastic Dominance
criteria gave an accuracy of 86.51% with a coverage of 87%
Robust Bayesian classifier under the weak dominance
criteria gave an accuracy of 82.5% with 100% coverage.
Conclusion
Retains or improves upon the accuracy of the Naïve
Bayesian Classifier.
Stochastic dominance criterion should be the method used
when accuracy is more important as compared to the
coverage achieved.
For more general databases, the weak stochastic dominance
criterion should be used because it maintains the accuracy
of the classification while improving the coverage.
1. SLIQ: A fast scalable Classifier for Data Mining; Manish
Mehta, Rakesh Agarwal and Jorma Rissanen
2. An Introduction to the Robust Bayesian Classifier; Marco
Ramoni and Paola Sebastiani
3. A Bayesian Approach to Filtering Junk E

mail; Mehran
Sahami, Susan Dumais, David Heckerman, Eric Horvitz
4. Bayesian Networks without Tears; Eugene Charniak
5. Bayesian networks basics; Finn V. Jensen
Bibliography
Comments 0
Log in to post a comment