Data Mining
Comp. Sc. and Inf. Mgmt.
Asian Institute of Technology
Instructor
: Dr. Sumanta Guha
Slide Sources
: Han & Kamber
“Data Mining: Concepts and
Techniques” book, slides by
Han,
Han & Kamber,
adapted
and supplemented by Guha
Chapter 6: Classification and
Prediction
Classification vs. Prediction vs.
Estimation
Classification
is the grouping of existing data.
E.g., grouping patients based on their medical history.
E.g., grouping students based on their test scores.
Prediction
is the use of existing data values to guess a future
value.
E.g., using a patient’s medical history to guess the effect of a treatment.
E.g., using a student’s past test scores to guess chance of success in a future
exam.
Estimation
is the prediction of a
continuous
a value.
E.g., predicting a patient’s cholesterol level given certain drugs.
E.g., predicting a student’s score on the GRE knowing her scores on school
tests.
Prediction of discrete values is most often based on classification
, e.g.,
middle

aged
,
overweight
,
male
smoker
has
high
chance of heart attack;
young
,
normal weight
,
female non

smoker
has
low
chance of heart attack.
Question:
Why is classification
not
useful in estimation?
Classification Procedure
Split known data into two sets: a
training set
and a
testing set
.
Next, two

step classification procedure:
1.
Learning step:
Use training data to develop a
classification model.
2.
Testing step:
Use the testing data to test the
accuracy model. If the accuracy is acceptable the
model is put to use on real data.
AusDM 2009 Competition
Australasian Data Mining Conference 2009 Competition
Sample data type:
# Rating Predictor1 Predictor2 … Predictor1000
1 4 3 4 … 4
2 3 4 2 … 2
3 1 2 1 … 2
4 4 3 4 … 5
5 5 5 5 … 4
6 2 1 2 … 4
.
10000 1 2 4 … 1
Challenge:
Combine the given predictors into an even better one!
Process (1): Model Construction
Training
Data
NAME
RANK
YEARS
TENURED
Mike
Assistant Prof
3
no
Mary
Assistant Prof
7
yes
Bill
Professor
2
yes
Jim
Associate Prof
7
yes
Dave
Assistant Prof
6
no
Anne
Associate Prof
3
no
Classification
Algorithms
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier
(Model)
Process (2): Using the Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
Tenured?
NAME
RANK
YEARS
TENURED
Tom
Assistant Prof
2
no
Merlisa
Associate Prof
7
no
George
Professor
5
yes
Joseph
Assistant Prof
7
yes
Supervised vs. Unsupervised Learning
Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning
(clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
Decision Tree Induction: Training Dataset
age
income
student
credit_rating
buys_computer
young
high
no
fair
no
young
high
no
excellent
no
middle
high
no
fair
yes
senior
medium
no
fair
yes
senior
low
yes
fair
yes
senior
low
yes
excellent
no
middle
low
yes
excellent
yes
young
medium
no
fair
no
young
low
yes
fair
yes
senior
medium
yes
fair
yes
young
medium
yes
excellent
yes
middle
medium
no
excellent
yes
middle
high
yes
fair
yes
senior
medium
no
excellent
no
Output: A Decision Tree for “
buys_computer”
age?
student?
credit rating?
young
senior
no
yes
yes
yes
middle
fair
excellent
yes
no
Algorithm for building a Decision
Tree from Training Tuples
Algorithm:
Generate_decision_tree
. Generate a decision tree from
the training tuples of data partition
D.
Input:
Data partition
D
, which is a set of training tuples and their associated
class labels;
attribute_list
, the set of candidate attributes;
Attribute_selection method
, a procedure to determine the splitting
criterion that “best” partitions the data tuples into individual classes.
This criterion consists of a
splitting_attribute
and, possibly, either a
split point
or
splitting subset.
Output:
A decision tree.
Method:
(1) create a node
N;
(2)
if
tuples
in
D
are all of the same class
C
then
(3)
return
N
as a leaf node labeled with the class
C
;
(4)
if
attribute _list
is empty
then
(5)
return
N
as a leaf node labeled with the majority class in D; //
majority voting
(6)
apply
Attribute_selection_method
(
D,
attribute_list
) to find best
(7)
splitting_criterion
;
(7) label node
N
with
splitting_criterion
;
(8)
if
splitting attribute is discrete

valued and
multiway
splits allowed
then
// not restricted to binary trees
(9)
attribute_list
←
attribute_list
–
splitting_attribute
;
// remove
splitting_attribute
(10)
for each
outcome
j
of
splitting_criterion
// partition the
tuples
and grow
subtrees
for each partition
(11) let
D
j
be the set of data
tuples
in
D
satisfying outcome
j
; // a
partition
(12)
if
D
j
is empty
then
(13) attach a leaf labeled with the majority class in
D
to node
N;
(14)
else
attach the node returned by
Generate_decision_tree
(
D
j
,
attribute_list
) to
node
N
;
endfor
(15)
return
N;
How to choose?
Coming up
–
entropy
!
Entropy
Entropy measures the
uncertainty
associated with a random
variable.
Intuitively, if we toss a coin, then the maximum entropy is if the
coin is fair, i.e., if prob(H) = prob(T) = 0.5.
If the coin is not fair, e.g., if prob(H) = 0.75, prob(T) = 0.25,
then entropy is less (there is less uncertainty = less information
gain, associated tossing the coin).
Calculating entropy:
If a random variable
X
can take values
x
1
,
x
2
, …,
x
n
with probabilities
p
1
,
p
2
, …,
p
n
, then the entropy of
X
is given by the formula:
H
(
X
) =
–
∑
i=1..n
p
i
log(
p
i
)
(If
p
i
= 0 for some
i
, then 0
log(0) is taken to be 0.)
Exercise:
Calculate the entropy of a fair coin and one with
prob(H) = 0.75, prob(T) = 0.25.
Exercise:
Calculate the entropy of a fair dice.
Entropy of a binary variable
(coin)
Probability
E
n
t
r
o
p
y
Attribute Selection Measure:
Information Gain (ID3)
Select the attribute with the highest information gain
Let
p
i
be the probability that an arbitrary tuple in D
belongs to class
C
i
, estimated by 
C
i, D
/D
Expected information
(entropy) needed to classify a tuple
in D:
Information
needed (after using attribute A to split D into
v partitions) to classify D:
Information gained
by branching on attribute A
)
(
log
)
(
2
1
i
m
i
i
p
p
D
Info
)
(




)
(
1
j
v
j
j
A
D
Info
D
D
D
Info
(D)
Info
Info(D)
Gain(A)
A
Attribute Selection: Information Gain
Class P: buys_computer = “yes”
Class N: buys_computer = “no”
means “young” has 5 out
of 14 samples, with 2 yes’s and
3 no’s. Hence
Similarly,
age
p
i
n
i
I(p
i
, n
i
)
young
2
3
0.971
middle
4
0
0
senior
3
2
0.971
694
.
0
)
2
,
3
(
14
5
)
0
,
4
(
14
4
)
3
,
2
(
14
5
)
(
I
I
I
D
Info
age
048
.
0
)
_
(
151
.
0
)
(
029
.
0
)
(
rating
credit
Gain
student
Gain
income
Gain
246
.
0
)
(
)
(
)
(
D
Info
D
Info
age
Gain
age
age
income
student
credit_rating
buys_computer
young
high
no
fair
no
young
high
no
excellent
no
middle
high
no
fair
yes
senior
medium
no
fair
yes
senior
low
yes
fair
yes
senior
low
yes
excellent
no
middle
low
yes
excellent
yes
young
medium
no
fair
no
young
low
yes
fair
yes
senior
medium
yes
fair
yes
young
medium
yes
excellent
yes
middle
medium
no
excellent
yes
middle
high
yes
fair
yes
senior
medium
no
excellent
no
)
3
,
2
(
14
5
I
940
.
0
)
14
5
(
log
14
5
)
14
9
(
log
14
9
)
5
,
9
(
)
(
2
2
I
D
Info
Terminating a Decision Tree
age
income
student
credit_rating
buys_computer
young
high
no
fair
no
young
high
no
excellent
no
middle
high
no
fair
yes
senior
medium
no
fair
yes
senior
low
yes
fair
yes
senior
low
yes
excellent
no
middle
low
yes
excellent
yes
young
medium
no
fair
no
young
low
yes
fair
yes
senior
medium
yes
fair
yes
young
medium
yes
excellent
yes
middle
medium
no
excellent
yes
middle
high
yes
fair
yes
senior
medium
no
excellent
no
age?
student?
credit rating?
young
senior
no
yes
yes
yes
middle
fair
excellent
yes
no
The decision tree based on the training set up above terminates “perfectly” in that
every leaf is “pure”: e.g., all young non

student don’t buy computers (in the training
set, all young students buy computers, etc. This may not always be the case.
Exercise:
Determine a decision tree for
buys_computer
based
only
on the
first two
attributes, age and income, of the training set above (i.e., imagine student and
credit

rating data are not available).
Computing Information

Gain for
Continuous

Value Attributes
Let attribute A be a continuous

valued attribute
Must determine the
best split point
for A
Sort the value A in increasing order
Typically, the midpoint between each pair of adjacent
values is considered as a possible
split point
(a
i
+a
i+1
)/2 is the midpoint between the values of a
i
and a
i+1
Split:
D1 is the set of tuples in D satisfying A ≤ split

point, and
D2 is the set of tuples in D satisfying A > split

point
Attribute Selection Measure:
Gain Ratio (C4.5)
Information gain measure is biased towards attributes with a large
number of values
E.g, if in the customer table in the previous slide, we had a first column customer
IDs 1

14, and used this as the splitting attribute!
Gain would be 100%!
But,
obviously, this split is useless.
C4.5 (a successor of ID3) uses
SplitInfo
to overcome the problem
C4.5 normalizes the info gain by SplitInfo
GainRatio(A) = Gain(A)/SplitInfo(A)
Ex.
gain_ratio(income) = 0.029/0.926 = 0.031
The attribute with the maximum gain ratio is selected as the splitting
attribute
)




(
log




)
(
2
1
D
D
D
D
D
SplitInfo
j
v
j
j
A
926
.
0
)
14
4
(
log
14
4
)
14
6
(
log
14
6
)
14
4
(
log
14
4
)
(
2
2
2
D
SplitInfo
A
Gini index (CART, IBM IntelligentMiner)
If a data set
D
contains examples from
n
classes, gini index,
gini
(
D
) is
defined as
where
p
j
is the relative frequency of class
j
in
D
Gini measures the
relative mean difference
of values in the population,
i.e., the average of the differences divided by the average pop. value.
It is very important in economics to measure income distribution.
To motivate above definition, consider
n
= 2, and pop. whose values
are only 0 and 1, particularly,
p
fraction of the pop. is 0 and (1
–
p
) is
1. Then possible differences between pop. values are 0 and 1 with
probs.
Diff. Prob.
0
p
2
+ (1
–
p
)
2
1 2
p
(1
–
p
)
Therefore, gini = 2
p
(1
–
p
) = 1
–
p
2
–
(1
–
p
)
2
= 1
–
∑
j=1,2
p
j
2
n
j
p
j
D
gini
1
2
1
)
(
Gini index (CART, IBM IntelligentMiner)
If a data set
D
contains examples from
n
classes, gini index,
gini
(
D
) is
defined as
where
p
j
is the relative frequency of class
j
in
D
Always consider a binary split
of each attribute A. If a data set
D
is
split on A into two subsets
D
1
and
D
2
, the
gini
index
gini
(
D
) is defined
as
Reduction in Impurity:
The attribute giving the largest reduction in impurity is chosen to split
the node (
need to enumerate all the possible binary splitting points for
each attribute
)
n
j
p
j
D
gini
1
2
1
)
(
)
(




)
(




)
(
2
2
1
1
D
gini
D
D
D
gini
D
D
D
gini
A
)
(
)
(
)
(
D
gini
D
gini
A
gini
A
Gini index (CART, IBM IntelligentMiner)
Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
Suppose the attribute income partitions D into 10 in D
1
: {low,
medium} and 4 in D
2
but gini
{medium,high}
is 0.30 and thus the best since it is the lowest
All attributes are assumed continuous

valued
May need other tools, e.g., clustering, to get the possible split values
Can be modified for categorical attributes
459
.
0
14
5
14
9
1
)
(
2
2
D
gini
)
(
14
4
)
(
14
10
)
(
1
1
}
,
{
D
Gini
D
Gini
D
gini
medium
low
income
Comparing Attribute Selection Measures
The three measures, in general, return good results but
Information gain:
biased towards multivalued attributes
Gain ratio:
tends to prefer unbalanced splits in which one
partition is much smaller than the others
Gini index:
biased to multivalued attributes
has difficulty when # of classes is large
tends to favor tests that result in equal

sized
partitions and purity in both partitions
Bayesian Classification: Why?
A statistical classifier
: performs
probabilistic prediction,
i.e.,
predicts class membership probabilities
Foundation:
Based on Bayes’ Theorem.
Performance:
A simple Bayesian classifier,
naïve Bayesian
classifier
, has comparable performance with decision tree
and selected neural network classifiers
Incremental
: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct
—
prior knowledge can be combined with observed
data
Standard
: Even when Bayesian methods are
computationally intractable, they can provide a standard
of optimal decision making against which other methods
can be measured
Bayesian Theorem: Basics
Let
X
be a data sample (“
evidence
”): class label is unknown
Let H be a
hypothesis
that X belongs to class C
Classification is to determine P(H
X
) (
posteriori probability
of H conditioned on
X
), the probability that the hypothesis
holds given the observed data sample
X
P(H) (
prior probability
), the initial probability
E.g.,
H is “will buy computer” regardless of age, income, …
P(
X
): prior probability that an evidence is observed
E.g., X is “age in 31..40, medium income”
P(
X
H) (
posteriori probability
of
X
conditioned on H), the
probability of observing the evidence
X
, given that the
hypothesis holds
E.g.,
Given
H that a customer will buy computer, P(
X
H) is the
prob. of evidence
X
that customer’s age is 31..40, medium income
Bayesian Theorem
Given training evidence
X
, posteriori probability of a
hypothesis
H
,
P(H
X
)
,
follows the Bayes theorem
Informally, this can be written as
posteriori = likelihood x prior/evidence
Predicts
X
belongs to C
i
iff the probability P(C
i

X
) is the
highest among all the P(C
k
X) for all the
k
classes
Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
)
(
)
(
)

(
)

(
X
X
X
P
H
P
H
P
H
P
Towards Naïve Bayesian
Classifier
Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n

dim.
attribute vector
X
= (x
1
, x
2
, …, x
n
)
Suppose there are
m
classes C
1
, C
2
, …, C
m
.
Classification is to derive the maximum posteriori, i.e., the
maximal P(C
i

X
)
This can be derived from Bayes’ theorem
Since prior prob. P(
X
) is constant for all classes,
maximizing is equivalent to maximizing
)
(
)
(
)

(
)

(
X
X
X
P
i
C
P
i
C
P
i
C
P
)

(
X
i
C
P
)
(
)

(
i
C
P
i
C
P
X
Derivation of Naïve Bayes
Classifier
A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes), in particular:
This greatly reduces the computation cost: Only counts
the class distribution
If A
k
is categorical, P(x
k
C
i
) is the # of tuples in C
i
having
value x
k
for A
k
divided by C
i, D
 (# of tuples of C
i
in D)
)

(
...
)

(
)

(
1
)

(
)

(
2
1
C
i
x
P
C
i
x
P
C
i
x
P
n
k
C
i
x
P
C
i
P
n
k
X
Naïve Bayesian Classifier: Training Dataset
Class:
C
1
:buys_computer = ‘yes’
C
2
:buys_computer = ‘no’
Data sample
X = (age = youth,
income = medium,
student = yes
credit_rating = Fair)
age
income
student
credit_rating
buys_computer
youth
high
no
fair
no
youth
high
no
excellent
no
middle
high
no
fair
yes
senior
medium
no
fair
yes
senior
low
yes
fair
yes
senior
low
yes
excellent
no
middle
low
yes
excellent
yes
youth
medium
no
fair
no
youth
low
yes
fair
yes
senior
medium
yes
fair
yes
youth
medium
yes
excellent
yes
middle
medium
no
excellent
yes
middle
high
yes
fair
yes
senior
medium
no
excellent
no
Naïve Bayesian Classifier: An Example
P(C
i
):
P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
Compute P(XC
i
) for each class
P(age = “youth”  buys_computer = “yes”) = 2/9 = 0.222
P(age = “youth”  buys_computer = “no”) = 3/5 = 0.6
P(income = “medium”  buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium”  buys_computer = “no”) = 2/5 = 0.4
P(student = “yes”  buys_computer = “yes) = 6/9 = 0.667
P(student = “yes”  buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair”  buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair”  buys_computer = “no”) = 2/5 = 0.4
X = (age = youth, income = medium, student = yes, credit_rating = fair)
P(X
C
i
) :
P(Xbuys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(Xbuys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X
C
i
)*P(
C
i
) :
P(Xbuys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(Xbuys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore,
X
belongs to class (“buys_computer = yes”)
Avoiding the 0

Probability Problem
Naïve Bayesian prediction requires each conditional prob. be non

zero. Otherwise, the predicted prob. will be zero
Ex. Suppose a dataset with 1000 tuples, income=low (0), income=
medium (990), and income = high (10),
Use Laplacian correction (or Laplacian estimator)
Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
The “corrected” prob. estimates are close to their “uncorrected”
counterparts
n
k
C
i
x
k
P
C
i
X
P
1
)

(
)

(
Naïve Bayesian Classifier: Comments
Advantages
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence, therefore
loss of accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
How to deal with these dependencies?
Bayesian Belief Networks
Bayesian Belief Networks
Bayesian belief network allows a
subset
of the variables
conditionally independent
A graphical model of causal relationships
Represents
dependency
among the variables
Gives a specification of joint probability distribution
X
Y
Z
P
Nodes: random variables
Links: dependency
X and Y are the parents of Z, and Y is
the parent of P
No dependency between Z and P
Has no loops or cycles
Bayesian Belief Network: An Example
Family
History
LungCancer
PositiveXRay
Smoker
Emphysema
Dyspnea
LC
~LC
(FH, S)
(FH, ~S)
(~FH, S)
(~FH, ~S)
0.8
0.2
0.5
0.5
0.7
0.3
0.1
0.9
Bayesian Belief Networks
The
conditional probability table
(
CPT
) for variable LungCancer:
n
i
Y
Parents
i
x
i
P
x
x
P
n
1
))
(

(
)
,...,
(
1
CPT shows the conditional probability for
each possible combination of its parents
Derivation of the probability of a
particular combination of values of
X
,
from CPT:
Training Bayesian Networks
Several scenarios:
Given both the network structure and all variables
observable:
learn only the CPTs
Network structure known, some hidden variables:
gradient descent
(greedy hill

climbing) method,
analogous to neural network learning
Network structure unknown, all variables observable:
search through the model space to
reconstruct
network topology
Unknown structure, all hidden variables: No good
algorithms known for this purpose
Ref. D. Heckerman: Bayesian networks for data mining
The
k

Nearest Neighbor (k

NN)
Algorithm
All instances correspond to points in the n

D space
The nearest neighbor are defined in terms of
Euclidean distance, dist(
X
1
,
X
2
)
Target function could be discrete

or real

valued
For discrete

valued,
k

NN returns the most common
value (= majority vote) among the
k
training
examples nearest to
x
q
Vonoroi diagram: the decision surface induced by 1

NN for a typical set of training examples
.
_
+
_
x
q
+
_
_
+
_
_
+
.
.
.
.
.
Discussion on the
k

NN Algorithm
k

NN for real

valued prediction for a given unknown tuple
Returns the mean values of the
k
nearest neighbors
Distance

weighted nearest neighbor algorithm
Weight the contribution of each of the k neighbors
according to their distance to the query
x
q
Give greater weight to closer neighbors
Robust to noisy data by averaging k

nearest neighbors
Curse of dimensionality: distance between neighbors could
be dominated by irrelevant attributes
To overcome it, axes stretch or elimination of the least
relevant attributes
Categorical (non

numerical) attributes, e.g., color, are
difficult to handle
2
)
,
(
1
i
x
q
x
d
w
Ensemble Methods: Increasing the Accuracy
Ensemble methods
Use a combination of models to increase accuracy
Combine a series of k learned models, M
1
, M
2
, …, M
k
,
with the aim of creating an improved model M*
Popular ensemble methods
Bagging: averaging the prediction over a collection of
classifiers
Boosting: weighted vote with a collection of classifiers
Bagging: Boostrap Aggregation
Analogy: Diagnosis based on multiple doctors’ majority vote
Training
Given a set D of
d
tuples, at each iteration
i
, a training set D
i
of
d
tuples is sampled with replacement from D (i.e., boostrap)
A classifier model M
i
is learned for each training set D
i
Classification: classify an unknown sample
X
Each classifier M
i
returns its class prediction
The bagged classifier M* counts the votes and assigns the class
with the most votes to
X
Prediction: can be applied to the prediction of continuous values by
taking the average value of each prediction for a given test tuple
Accuracy
Often significant better than a single classifier derived from D
For noise data: not considerably worse, more robust
Proved improved accuracy in prediction
Boosting
Analogy: Consult several doctors, based on a combination of weighted
diagnoses
—
weight assigned based on the previous diagnosis accuracy
How boosting works?
Weights are assigned to each training tuple
A series of k classifiers is iteratively learned
After a classifier M
i
is learned, the weights are updated to allow the
subsequent classifier, M
i+1
, to pay more attention to the training
tuples that were misclassified by M
i
The final M* combines the votes of each individual classifier, where
the weight of each classifier's vote is a function of its accuracy
The boosting algorithm can be extended for the prediction of
continuous values
Comparing with bagging: boosting tends to achieve greater accuracy,
but it also risks overfitting the model to misclassified data
Adaboost (Freund and Schapire,
1997)
AdaBoost Lecture Slides by Šochman and Matas
Comments 0
Log in to post a comment