# Slides - Asian Institute of Technology

Τεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 4 χρόνια και 8 μήνες)

114 εμφανίσεις

Data Mining

Comp. Sc. and Inf. Mgmt.

Asian Institute of Technology

Instructor
: Dr. Sumanta Guha

Slide Sources
: Han & Kamber
“Data Mining: Concepts and
Techniques” book, slides by
Han,

Han & Kamber,
and supplemented by Guha

Chapter 6: Classification and
Prediction

Classification vs. Prediction vs.
Estimation

Classification

is the grouping of existing data.

E.g., grouping patients based on their medical history.

E.g., grouping students based on their test scores.

Prediction

is the use of existing data values to guess a future
value.

E.g., using a patient’s medical history to guess the effect of a treatment.

E.g., using a student’s past test scores to guess chance of success in a future
exam.

Estimation
is the prediction of a
continuous

a value.

E.g., predicting a patient’s cholesterol level given certain drugs.

E.g., predicting a student’s score on the GRE knowing her scores on school
tests.

Prediction of discrete values is most often based on classification
, e.g.,
middle
-
aged
,
overweight
,
male

smoker

has
high

chance of heart attack;
young
,
normal weight
,
female non
-
smoker

has
low

chance of heart attack.

Question:

Why is classification
not

useful in estimation?

Classification Procedure

Split known data into two sets: a
training set
and a
testing set
.

Next, two
-
step classification procedure:

1.
Learning step:
Use training data to develop a
classification model.

2.
Testing step:

Use the testing data to test the
accuracy model. If the accuracy is acceptable the
model is put to use on real data.

AusDM 2009 Competition

Australasian Data Mining Conference 2009 Competition

Sample data type:

# Rating Predictor1 Predictor2 … Predictor1000

1 4 3 4 … 4

2 3 4 2 … 2

3 1 2 1 … 2

4 4 3 4 … 5

5 5 5 5 … 4

6 2 1 2 … 4

.

10000 1 2 4 … 1

Challenge:
Combine the given predictors into an even better one!

Process (1): Model Construction

Training

Data

NAME
RANK
YEARS
TENURED
Mike
Assistant Prof
3
no
Mary
Assistant Prof
7
yes
Bill
Professor
2
yes
Jim
Associate Prof
7
yes
Dave
Assistant Prof
6
no
Anne
Associate Prof
3
no
Classification

Algorithms

IF rank = ‘professor’

OR years > 6

THEN tenured = ‘yes’

Classifier

(Model)

Process (2): Using the Model in Prediction

Classifier

Testing

Data

Unseen Data

(Jeff, Professor, 4)

Tenured?

NAME
RANK
YEARS
TENURED
Tom
Assistant Prof
2
no
Merlisa
Associate Prof
7
no
George
Professor
5
yes
Joseph
Assistant Prof
7
yes
Supervised vs. Unsupervised Learning

Supervised learning (classification)

Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations

New data is classified based on the training set

Unsupervised learning

(clustering)

The class labels of training data is unknown

Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data

Decision Tree Induction: Training Dataset

age
income
student
credit_rating
young
high
no
fair
no
young
high
no
excellent
no
middle
high
no
fair
yes
senior
medium
no
fair
yes
senior
low
yes
fair
yes
senior
low
yes
excellent
no
middle
low
yes
excellent
yes
young
medium
no
fair
no
young
low
yes
fair
yes
senior
medium
yes
fair
yes
young
medium
yes
excellent
yes
middle
medium
no
excellent
yes
middle
high
yes
fair
yes
senior
medium
no
excellent
no
Output: A Decision Tree for “

age?

student?

credit rating?

young

senior

no

yes

yes

yes

middle

fair

excellent

yes

no

Algorithm for building a Decision
Tree from Training Tuples

Algorithm:

Generate_decision_tree
. Generate a decision tree from
the training tuples of data partition
D.

Input:

Data partition
D
, which is a set of training tuples and their associated
class labels;

attribute_list
, the set of candidate attributes;

Attribute_selection method
, a procedure to determine the splitting
criterion that “best” partitions the data tuples into individual classes.
This criterion consists of a
splitting_attribute
and, possibly, either a
split point
or

splitting subset.

Output:

A decision tree.

Method:

(1) create a node
N;

(2)
if

tuples

in
D
are all of the same class

C
then

(3)
return

N
as a leaf node labeled with the class
C
;

(4)
if
attribute _list
is empty
then

(5)
return

N
as a leaf node labeled with the majority class in D; //
majority voting

(6)
apply
Attribute_selection_method
(
D,
attribute_list
) to find best

(7)

splitting_criterion
;

(7) label node
N
with

splitting_criterion
;

(8)
if

splitting attribute is discrete
-
valued and

multiway

splits allowed
then

// not restricted to binary trees

(9)

attribute_list

attribute_list

splitting_attribute
;
// remove
splitting_attribute

(10)
for each
outcome
j
of
splitting_criterion

// partition the
tuples

and grow
subtrees

for each partition

(11) let
D
j

be the set of data
tuples

in
D

satisfying outcome
j
; // a
partition

(12)
if

D
j

is empty
then

(13) attach a leaf labeled with the majority class in
D
to node
N;

(14)
else
attach the node returned by
Generate_decision_tree
(
D
j

,

attribute_list
) to
node
N
;

endfor

(15)
return

N;

How to choose?

Coming up

entropy
!

Entropy

Entropy measures the
uncertainty

associated with a random
variable.

Intuitively, if we toss a coin, then the maximum entropy is if the
coin is fair, i.e., if prob(H) = prob(T) = 0.5.

If the coin is not fair, e.g., if prob(H) = 0.75, prob(T) = 0.25,
then entropy is less (there is less uncertainty = less information
gain, associated tossing the coin).

Calculating entropy:
If a random variable
X

can take values
x
1
,
x
2
, …,
x
n

with probabilities
p
1
,
p
2
, …,
p
n
, then the entropy of
X

is given by the formula:

H
(
X
) =

i=1..n

p
i
log(
p
i
)

(If
p
i

= 0 for some
i
, then 0

log(0) is taken to be 0.)

Exercise:

Calculate the entropy of a fair coin and one with
prob(H) = 0.75, prob(T) = 0.25.

Exercise:

Calculate the entropy of a fair dice.

Entropy of a binary variable
(coin)

Probability

E

n

t

r

o

p

y

Attribute Selection Measure:
Information Gain (ID3)

Select the attribute with the highest information gain

Let
p
i

be the probability that an arbitrary tuple in D
belongs to class
C
i
, estimated by |
C
i, D
|/|D|

Expected information

(entropy) needed to classify a tuple
in D:

Information

needed (after using attribute A to split D into
v partitions) to classify D:

Information gained

by branching on attribute A

)
(
log
)
(
2
1
i
m
i
i
p
p
D
Info

)
(
|
|
|
|
)
(
1
j
v
j
j
A
D
Info
D
D
D
Info

(D)
Info
Info(D)
Gain(A)
A

Attribute Selection: Information Gain

means “young” has 5 out
of 14 samples, with 2 yes’s and
3 no’s. Hence

Similarly,

age
p
i
n
i
I(p
i
, n
i
)
young
2
3
0.971
middle
4
0
0
senior
3
2
0.971
694
.
0
)
2
,
3
(
14
5
)
0
,
4
(
14
4
)
3
,
2
(
14
5
)
(

I
I
I
D
Info
age
048
.
0
)
_
(
151
.
0
)
(
029
.
0
)
(

rating
credit
Gain
student
Gain
income
Gain
246
.
0
)
(
)
(
)
(

D
Info
D
Info
age
Gain
age
age
income
student
credit_rating
young
high
no
fair
no
young
high
no
excellent
no
middle
high
no
fair
yes
senior
medium
no
fair
yes
senior
low
yes
fair
yes
senior
low
yes
excellent
no
middle
low
yes
excellent
yes
young
medium
no
fair
no
young
low
yes
fair
yes
senior
medium
yes
fair
yes
young
medium
yes
excellent
yes
middle
medium
no
excellent
yes
middle
high
yes
fair
yes
senior
medium
no
excellent
no
)
3
,
2
(
14
5
I
940
.
0
)
14
5
(
log
14
5
)
14
9
(
log
14
9
)
5
,
9
(
)
(
2
2

I
D
Info
Terminating a Decision Tree

age
income
student
credit_rating
young
high
no
fair
no
young
high
no
excellent
no
middle
high
no
fair
yes
senior
medium
no
fair
yes
senior
low
yes
fair
yes
senior
low
yes
excellent
no
middle
low
yes
excellent
yes
young
medium
no
fair
no
young
low
yes
fair
yes
senior
medium
yes
fair
yes
young
medium
yes
excellent
yes
middle
medium
no
excellent
yes
middle
high
yes
fair
yes
senior
medium
no
excellent
no
age?

student?

credit rating?

young

senior

no

yes

yes

yes

middle

fair

excellent

yes

no

The decision tree based on the training set up above terminates “perfectly” in that

every leaf is “pure”: e.g., all young non
-
student don’t buy computers (in the training

set, all young students buy computers, etc. This may not always be the case.

Exercise:
Determine a decision tree for
based
only

on the
first two

attributes, age and income, of the training set above (i.e., imagine student and

credit
-
rating data are not available).

Computing Information
-
Gain for
Continuous
-
Value Attributes

Let attribute A be a continuous
-
valued attribute

Must determine the
best split point

for A

Sort the value A in increasing order

Typically, the midpoint between each pair of adjacent
values is considered as a possible
split point

(a
i
+a
i+1
)/2 is the midpoint between the values of a
i

and a
i+1

Split:

D1 is the set of tuples in D satisfying A ≤ split
-
point, and
D2 is the set of tuples in D satisfying A > split
-
point

Attribute Selection Measure:

Gain Ratio (C4.5)

Information gain measure is biased towards attributes with a large
number of values

E.g, if in the customer table in the previous slide, we had a first column customer
IDs 1
-
14, and used this as the splitting attribute!
Gain would be 100%!

But,
obviously, this split is useless.

C4.5 (a successor of ID3) uses
SplitInfo
to overcome the problem

C4.5 normalizes the info gain by SplitInfo

GainRatio(A) = Gain(A)/SplitInfo(A)

Ex.

gain_ratio(income) = 0.029/0.926 = 0.031

The attribute with the maximum gain ratio is selected as the splitting
attribute

)
|
|
|
|
(
log
|
|
|
|
)
(
2
1
D
D
D
D
D
SplitInfo
j
v
j
j
A

926
.
0
)
14
4
(
log
14
4
)
14
6
(
log
14
6
)
14
4
(
log
14
4
)
(
2
2
2

D
SplitInfo
A
Gini index (CART, IBM IntelligentMiner)

If a data set
D
contains examples from
n

classes, gini index,
gini
(
D
) is
defined as

where
p
j

is the relative frequency of class
j

in
D

Gini measures the
relative mean difference
of values in the population,
i.e., the average of the differences divided by the average pop. value.

It is very important in economics to measure income distribution.

To motivate above definition, consider
n

= 2, and pop. whose values
are only 0 and 1, particularly,
p

fraction of the pop. is 0 and (1

p
) is
1. Then possible differences between pop. values are 0 and 1 with
probs.

Diff. Prob.

0
p
2

+ (1

p
)
2

1 2
p
(1

p
)

Therefore, gini = 2
p
(1

p
) = 1

p
2

(1

p
)
2

= 1

j=1,2

p
j
2

n
j
p
j
D
gini
1
2
1
)
(
Gini index (CART, IBM IntelligentMiner)

If a data set
D
contains examples from
n

classes, gini index,
gini
(
D
) is
defined as

where
p
j

is the relative frequency of class
j

in
D

Always consider a binary split
of each attribute A. If a data set
D

is
split on A into two subsets
D
1

and
D
2
, the
gini

index
gini
(
D
) is defined
as

Reduction in Impurity:

The attribute giving the largest reduction in impurity is chosen to split
the node (
need to enumerate all the possible binary splitting points for
each attribute
)

n
j
p
j
D
gini
1
2
1
)
(
)
(
|
|
|
|
)
(
|
|
|
|
)
(
2
2
1
1
D
gini
D
D
D
gini
D
D
D
gini
A

)
(
)
(
)
(
D
gini
D
gini
A
gini
A

Gini index (CART, IBM IntelligentMiner)

Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”

Suppose the attribute income partitions D into 10 in D
1
: {low,
medium} and 4 in D
2

but gini
{medium,high}

is 0.30 and thus the best since it is the lowest

All attributes are assumed continuous
-
valued

May need other tools, e.g., clustering, to get the possible split values

Can be modified for categorical attributes

459
.
0
14
5
14
9
1
)
(
2
2

D
gini
)
(
14
4
)
(
14
10
)
(
1
1
}
,
{
D
Gini
D
Gini
D
gini
medium
low
income

Comparing Attribute Selection Measures

The three measures, in general, return good results but

Information gain:

biased towards multivalued attributes

Gain ratio:

tends to prefer unbalanced splits in which one
partition is much smaller than the others

Gini index:

biased to multivalued attributes

has difficulty when # of classes is large

tends to favor tests that result in equal
-
sized
partitions and purity in both partitions

Bayesian Classification: Why?

A statistical classifier
: performs
probabilistic prediction,
i.e.,

predicts class membership probabilities

Foundation:

Based on Bayes’ Theorem.

Performance:

A simple Bayesian classifier,
naïve Bayesian
classifier
, has comparable performance with decision tree
and selected neural network classifiers

Incremental
: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct

prior knowledge can be combined with observed
data

Standard
: Even when Bayesian methods are
computationally intractable, they can provide a standard
of optimal decision making against which other methods
can be measured

Bayesian Theorem: Basics

Let
X

be a data sample (“
evidence
”): class label is unknown

Let H be a
hypothesis

that X belongs to class C

Classification is to determine P(H|
X
) (
posteriori probability
of H conditioned on
X
), the probability that the hypothesis
holds given the observed data sample
X

P(H) (
prior probability
), the initial probability

E.g.,

H is “will buy computer” regardless of age, income, …

P(
X
): prior probability that an evidence is observed

E.g., X is “age in 31..40, medium income”

P(
X
|H) (
posteriori probability
of
X

conditioned on H), the
probability of observing the evidence
X
, given that the
hypothesis holds

E.g.,

Given

H that a customer will buy computer, P(
X
|H) is the
prob. of evidence
X

that customer’s age is 31..40, medium income

Bayesian Theorem

Given training evidence

X
, posteriori probability of a
hypothesis
H
,
P(H|
X
)
,
follows the Bayes theorem

Informally, this can be written as

posteriori = likelihood x prior/evidence

Predicts
X

belongs to C
i

iff the probability P(C
i
|
X
) is the
highest among all the P(C
k
|X) for all the
k

classes

Practical difficulty: require initial knowledge of many
probabilities, significant computational cost

)
(
)
(
)
|
(
)
|
(
X
X
X
P
H
P
H
P
H
P

Towards Naïve Bayesian
Classifier

Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n
-
dim.
attribute vector
X

= (x
1
, x
2
, …, x
n
)

Suppose there are
m

classes C
1
, C
2
, …, C
m
.

Classification is to derive the maximum posteriori, i.e., the
maximal P(C
i
|
X
)

This can be derived from Bayes’ theorem

Since prior prob. P(
X
) is constant for all classes,

maximizing is equivalent to maximizing

)
(
)
(
)
|
(
)
|
(
X
X
X
P
i
C
P
i
C
P
i
C
P

)
|
(
X
i
C
P
)
(
)
|
(
i
C
P
i
C
P
X
Derivation of Naïve Bayes
Classifier

A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes), in particular:

This greatly reduces the computation cost: Only counts
the class distribution

If A
k

is categorical, P(x
k
|C
i
) is the # of tuples in C
i

having
value x
k

for A
k

divided by |C
i, D
| (# of tuples of C
i

in D)

)
|
(
...
)
|
(
)
|
(
1
)
|
(
)
|
(
2
1
C
i
x
P
C
i
x
P
C
i
x
P
n
k
C
i
x
P
C
i
P
n
k

X
Naïve Bayesian Classifier: Training Dataset

Class:

C
1

C
2

Data sample

X = (age = youth,

income = medium,

student = yes

credit_rating = Fair)

age
income
student
credit_rating
youth
high
no
fair
no
youth
high
no
excellent
no
middle
high
no
fair
yes
senior
medium
no
fair
yes
senior
low
yes
fair
yes
senior
low
yes
excellent
no
middle
low
yes
excellent
yes
youth
medium
no
fair
no
youth
low
yes
fair
yes
senior
medium
yes
fair
yes
youth
medium
yes
excellent
yes
middle
medium
no
excellent
yes
middle
high
yes
fair
yes
senior
medium
no
excellent
no
Naïve Bayesian Classifier: An Example

P(C
i
):
P(buys_computer = “yes”) = 9/14 = 0.643

P(buys_computer = “no”) = 5/14= 0.357

Compute P(X|C
i
) for each class

P(age = “youth” | buys_computer = “yes”) = 2/9 = 0.222

P(age = “youth” | buys_computer = “no”) = 3/5 = 0.6

P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444

P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4

P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667

P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2

P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667

P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

X = (age = youth, income = medium, student = yes, credit_rating = fair)

P(X|
C
i
) :

P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044

P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019

P(X|
C
i
)*P(
C
i
) :

Therefore,
X

belongs to class (“buys_computer = yes”)

Avoiding the 0
-
Probability Problem

Naïve Bayesian prediction requires each conditional prob. be non
-
zero. Otherwise, the predicted prob. will be zero

Ex. Suppose a dataset with 1000 tuples, income=low (0), income=
medium (990), and income = high (10),

Use Laplacian correction (or Laplacian estimator)

Prob(income = low) = 1/1003

Prob(income = medium) = 991/1003

Prob(income = high) = 11/1003

The “corrected” prob. estimates are close to their “uncorrected”
counterparts

n
k
C
i
x
k
P
C
i
X
P
1
)
|
(
)
|
(

Easy to implement

Good results obtained in most of the cases

Assumption: class conditional independence, therefore
loss of accuracy

Practically, dependencies exist among variables

E.g., hospitals: patients: Profile: age, family history, etc.

Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.

Dependencies among these cannot be modeled by Naïve
Bayesian Classifier

How to deal with these dependencies?

Bayesian Belief Networks

Bayesian Belief Networks

Bayesian belief network allows a
subset

of the variables
conditionally independent

A graphical model of causal relationships

Represents
dependency

among the variables

Gives a specification of joint probability distribution

X

Y

Z

P

Nodes: random variables

X and Y are the parents of Z, and Y is
the parent of P

No dependency between Z and P

Has no loops or cycles

Bayesian Belief Network: An Example

Family

History

LungCancer

PositiveXRay

Smoker

Emphysema

Dyspnea

LC

~LC

(FH, S)

(FH, ~S)

(~FH, S)

(~FH, ~S)

0.8

0.2

0.5

0.5

0.7

0.3

0.1

0.9

Bayesian Belief Networks

The
conditional probability table

(
CPT
) for variable LungCancer:

n
i
Y
Parents
i
x
i
P
x
x
P
n
1
))
(
|
(
)
,...,
(
1
CPT shows the conditional probability for
each possible combination of its parents

Derivation of the probability of a
particular combination of values of
X
,
from CPT:

Training Bayesian Networks

Several scenarios:

Given both the network structure and all variables
observable:
learn only the CPTs

Network structure known, some hidden variables:

(greedy hill
-
climbing) method,
analogous to neural network learning

Network structure unknown, all variables observable:
search through the model space to
reconstruct
network topology

Unknown structure, all hidden variables: No good
algorithms known for this purpose

Ref. D. Heckerman: Bayesian networks for data mining

The
k
-
Nearest Neighbor (k
-
NN)
Algorithm

All instances correspond to points in the n
-
D space

The nearest neighbor are defined in terms of
Euclidean distance, dist(
X
1
,
X
2
)

Target function could be discrete
-

or real
-

valued

For discrete
-
valued,
k
-
NN returns the most common
value (= majority vote) among the
k

training
examples nearest to

x
q

Vonoroi diagram: the decision surface induced by 1
-
NN for a typical set of training examples

.

_

+

_

x
q

+

_

_

+

_

_

+

.

.

.

.

.

Discussion on the
k
-
NN Algorithm

k
-
NN for real
-
valued prediction for a given unknown tuple

Returns the mean values of the

k

nearest neighbors

Distance
-
weighted nearest neighbor algorithm

Weight the contribution of each of the k neighbors
according to their distance to the query
x
q

Give greater weight to closer neighbors

Robust to noisy data by averaging k
-
nearest neighbors

Curse of dimensionality: distance between neighbors could
be dominated by irrelevant attributes

To overcome it, axes stretch or elimination of the least
relevant attributes

Categorical (non
-
numerical) attributes, e.g., color, are
difficult to handle

2
)
,
(
1
i
x
q
x
d
w

Ensemble Methods: Increasing the Accuracy

Ensemble methods

Use a combination of models to increase accuracy

Combine a series of k learned models, M
1
, M
2
, …, M
k
,
with the aim of creating an improved model M*

Popular ensemble methods

Bagging: averaging the prediction over a collection of
classifiers

Boosting: weighted vote with a collection of classifiers

Bagging: Boostrap Aggregation

Analogy: Diagnosis based on multiple doctors’ majority vote

Training

Given a set D of
d
tuples, at each iteration
i
, a training set D
i

of
d

tuples is sampled with replacement from D (i.e., boostrap)

A classifier model M
i

is learned for each training set D
i

Classification: classify an unknown sample

X

Each classifier M
i

returns its class prediction

The bagged classifier M* counts the votes and assigns the class
X

Prediction: can be applied to the prediction of continuous values by
taking the average value of each prediction for a given test tuple

Accuracy

Often significant better than a single classifier derived from D

For noise data: not considerably worse, more robust

Proved improved accuracy in prediction

Boosting

Analogy: Consult several doctors, based on a combination of weighted
diagnoses

weight assigned based on the previous diagnosis accuracy

How boosting works?

Weights are assigned to each training tuple

A series of k classifiers is iteratively learned

After a classifier M
i

is learned, the weights are updated to allow the
subsequent classifier, M
i+1
, to pay more attention to the training
tuples that were misclassified by M
i

The final M* combines the votes of each individual classifier, where
the weight of each classifier's vote is a function of its accuracy

The boosting algorithm can be extended for the prediction of
continuous values

Comparing with bagging: boosting tends to achieve greater accuracy,
but it also risks overfitting the model to misclassified data