Slides - Asian Institute of Technology

ocelotgiantΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

76 εμφανίσεις

Data Mining

Comp. Sc. and Inf. Mgmt.

Asian Institute of Technology

Instructor
: Dr. Sumanta Guha

Slide Sources
: Han & Kamber
“Data Mining: Concepts and
Techniques” book, slides by
Han,


Han & Kamber,
adapted
and supplemented by Guha

Chapter 6: Classification and
Prediction

Classification vs. Prediction vs.
Estimation


Classification

is the grouping of existing data.


E.g., grouping patients based on their medical history.


E.g., grouping students based on their test scores.


Prediction

is the use of existing data values to guess a future
value.


E.g., using a patient’s medical history to guess the effect of a treatment.


E.g., using a student’s past test scores to guess chance of success in a future
exam.


Estimation
is the prediction of a
continuous

a value.


E.g., predicting a patient’s cholesterol level given certain drugs.


E.g., predicting a student’s score on the GRE knowing her scores on school
tests.

Prediction of discrete values is most often based on classification
, e.g.,
middle
-
aged
,
overweight
,
male

smoker

has
high

chance of heart attack;
young
,
normal weight
,
female non
-
smoker

has
low

chance of heart attack.

Question:

Why is classification
not

useful in estimation?

Classification Procedure

Split known data into two sets: a
training set
and a
testing set
.

Next, two
-
step classification procedure:

1.
Learning step:
Use training data to develop a
classification model.

2.
Testing step:

Use the testing data to test the
accuracy model. If the accuracy is acceptable the
model is put to use on real data.

AusDM 2009 Competition


Australasian Data Mining Conference 2009 Competition


Sample data type:

# Rating Predictor1 Predictor2 … Predictor1000

1 4 3 4 … 4

2 3 4 2 … 2

3 1 2 1 … 2

4 4 3 4 … 5

5 5 5 5 … 4

6 2 1 2 … 4

.

10000 1 2 4 … 1

Challenge:
Combine the given predictors into an even better one!

Process (1): Model Construction

Training

Data

NAME
RANK
YEARS
TENURED
Mike
Assistant Prof
3
no
Mary
Assistant Prof
7
yes
Bill
Professor
2
yes
Jim
Associate Prof
7
yes
Dave
Assistant Prof
6
no
Anne
Associate Prof
3
no
Classification

Algorithms

IF rank = ‘professor’

OR years > 6

THEN tenured = ‘yes’

Classifier

(Model)

Process (2): Using the Model in Prediction


Classifier

Testing

Data

Unseen Data

(Jeff, Professor, 4)

Tenured?

NAME
RANK
YEARS
TENURED
Tom
Assistant Prof
2
no
Merlisa
Associate Prof
7
no
George
Professor
5
yes
Joseph
Assistant Prof
7
yes
Supervised vs. Unsupervised Learning


Supervised learning (classification)


Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations


New data is classified based on the training set


Unsupervised learning

(clustering)


The class labels of training data is unknown


Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data

Decision Tree Induction: Training Dataset

age
income
student
credit_rating
buys_computer
young
high
no
fair
no
young
high
no
excellent
no
middle
high
no
fair
yes
senior
medium
no
fair
yes
senior
low
yes
fair
yes
senior
low
yes
excellent
no
middle
low
yes
excellent
yes
young
medium
no
fair
no
young
low
yes
fair
yes
senior
medium
yes
fair
yes
young
medium
yes
excellent
yes
middle
medium
no
excellent
yes
middle
high
yes
fair
yes
senior
medium
no
excellent
no
Output: A Decision Tree for “
buys_computer”

age?

student?

credit rating?

young

senior

no

yes

yes

yes

middle

fair

excellent

yes

no

Algorithm for building a Decision
Tree from Training Tuples

Algorithm:

Generate_decision_tree
. Generate a decision tree from
the training tuples of data partition
D.

Input:


Data partition
D
, which is a set of training tuples and their associated
class labels;


attribute_list
, the set of candidate attributes;


Attribute_selection method
, a procedure to determine the splitting
criterion that “best” partitions the data tuples into individual classes.
This criterion consists of a
splitting_attribute
and, possibly, either a
split point
or

splitting subset.

Output:

A decision tree.

Method:

(1) create a node
N;

(2)
if

tuples

in
D
are all of the same class

C
then

(3)
return

N
as a leaf node labeled with the class
C
;

(4)
if
attribute _list
is empty
then

(5)
return

N
as a leaf node labeled with the majority class in D; //
majority voting

(6)
apply
Attribute_selection_method
(
D,
attribute_list
) to find best

(7)

splitting_criterion
;

(7) label node
N
with

splitting_criterion
;

(8)
if

splitting attribute is discrete
-
valued and


multiway

splits allowed
then

// not restricted to binary trees

(9)


attribute_list



attribute_list



splitting_attribute
;
// remove
splitting_attribute

(10)
for each
outcome
j
of
splitting_criterion


// partition the
tuples

and grow
subtrees

for each partition

(11) let
D
j

be the set of data
tuples

in
D

satisfying outcome
j
; // a
partition


(12)
if

D
j

is empty
then

(13) attach a leaf labeled with the majority class in
D
to node
N;

(14)
else
attach the node returned by
Generate_decision_tree
(
D
j

,

attribute_list
) to
node
N
;


endfor

(15)
return

N;

How to choose?

Coming up


entropy
!

Entropy


Entropy measures the
uncertainty

associated with a random
variable.


Intuitively, if we toss a coin, then the maximum entropy is if the
coin is fair, i.e., if prob(H) = prob(T) = 0.5.


If the coin is not fair, e.g., if prob(H) = 0.75, prob(T) = 0.25,
then entropy is less (there is less uncertainty = less information
gain, associated tossing the coin).


Calculating entropy:
If a random variable
X

can take values
x
1
,
x
2
, …,
x
n

with probabilities
p
1
,
p
2
, …,
p
n
, then the entropy of
X

is given by the formula:


H
(
X
) =


i=1..n

p
i
log(
p
i
)


(If
p
i

= 0 for some
i
, then 0

log(0) is taken to be 0.)


Exercise:

Calculate the entropy of a fair coin and one with
prob(H) = 0.75, prob(T) = 0.25.


Exercise:

Calculate the entropy of a fair dice.

Entropy of a binary variable
(coin)

Probability

E

n

t

r

o

p

y

Attribute Selection Measure:
Information Gain (ID3)


Select the attribute with the highest information gain


Let
p
i

be the probability that an arbitrary tuple in D
belongs to class
C
i
, estimated by |
C
i, D
|/|D|


Expected information

(entropy) needed to classify a tuple
in D:



Information

needed (after using attribute A to split D into
v partitions) to classify D:



Information gained

by branching on attribute A


)
(
log
)
(
2
1
i
m
i
i
p
p
D
Info




)
(
|
|
|
|
)
(
1
j
v
j
j
A
D
Info
D
D
D
Info




(D)
Info
Info(D)
Gain(A)
A


Attribute Selection: Information Gain


Class P: buys_computer = “yes”


Class N: buys_computer = “no”


means “young” has 5 out
of 14 samples, with 2 yes’s and
3 no’s. Hence



Similarly,

age
p
i
n
i
I(p
i
, n
i
)
young
2
3
0.971
middle
4
0
0
senior
3
2
0.971
694
.
0
)
2
,
3
(
14
5
)
0
,
4
(
14
4
)
3
,
2
(
14
5
)
(




I
I
I
D
Info
age
048
.
0
)
_
(
151
.
0
)
(
029
.
0
)
(



rating
credit
Gain
student
Gain
income
Gain
246
.
0
)
(
)
(
)
(



D
Info
D
Info
age
Gain
age
age
income
student
credit_rating
buys_computer
young
high
no
fair
no
young
high
no
excellent
no
middle
high
no
fair
yes
senior
medium
no
fair
yes
senior
low
yes
fair
yes
senior
low
yes
excellent
no
middle
low
yes
excellent
yes
young
medium
no
fair
no
young
low
yes
fair
yes
senior
medium
yes
fair
yes
young
medium
yes
excellent
yes
middle
medium
no
excellent
yes
middle
high
yes
fair
yes
senior
medium
no
excellent
no
)
3
,
2
(
14
5
I
940
.
0
)
14
5
(
log
14
5
)
14
9
(
log
14
9
)
5
,
9
(
)
(
2
2





I
D
Info
Terminating a Decision Tree

age
income
student
credit_rating
buys_computer
young
high
no
fair
no
young
high
no
excellent
no
middle
high
no
fair
yes
senior
medium
no
fair
yes
senior
low
yes
fair
yes
senior
low
yes
excellent
no
middle
low
yes
excellent
yes
young
medium
no
fair
no
young
low
yes
fair
yes
senior
medium
yes
fair
yes
young
medium
yes
excellent
yes
middle
medium
no
excellent
yes
middle
high
yes
fair
yes
senior
medium
no
excellent
no
age?

student?

credit rating?

young

senior

no

yes

yes

yes

middle

fair

excellent

yes

no

The decision tree based on the training set up above terminates “perfectly” in that

every leaf is “pure”: e.g., all young non
-
student don’t buy computers (in the training

set, all young students buy computers, etc. This may not always be the case.


Exercise:
Determine a decision tree for
buys_computer
based
only

on the
first two


attributes, age and income, of the training set above (i.e., imagine student and

credit
-
rating data are not available).

Computing Information
-
Gain for
Continuous
-
Value Attributes


Let attribute A be a continuous
-
valued attribute


Must determine the
best split point

for A


Sort the value A in increasing order


Typically, the midpoint between each pair of adjacent
values is considered as a possible
split point


(a
i
+a
i+1
)/2 is the midpoint between the values of a
i

and a
i+1


Split:


D1 is the set of tuples in D satisfying A ≤ split
-
point, and
D2 is the set of tuples in D satisfying A > split
-
point

Attribute Selection Measure:

Gain Ratio (C4.5)


Information gain measure is biased towards attributes with a large
number of values


E.g, if in the customer table in the previous slide, we had a first column customer
IDs 1
-
14, and used this as the splitting attribute!
Gain would be 100%!

But,
obviously, this split is useless.


C4.5 (a successor of ID3) uses
SplitInfo
to overcome the problem




C4.5 normalizes the info gain by SplitInfo


GainRatio(A) = Gain(A)/SplitInfo(A)


Ex.


gain_ratio(income) = 0.029/0.926 = 0.031


The attribute with the maximum gain ratio is selected as the splitting
attribute

)
|
|
|
|
(
log
|
|
|
|
)
(
2
1
D
D
D
D
D
SplitInfo
j
v
j
j
A





926
.
0
)
14
4
(
log
14
4
)
14
6
(
log
14
6
)
14
4
(
log
14
4
)
(
2
2
2








D
SplitInfo
A
Gini index (CART, IBM IntelligentMiner)


If a data set
D
contains examples from
n

classes, gini index,
gini
(
D
) is
defined as





where
p
j

is the relative frequency of class
j

in
D


Gini measures the
relative mean difference
of values in the population,
i.e., the average of the differences divided by the average pop. value.


It is very important in economics to measure income distribution.


To motivate above definition, consider
n

= 2, and pop. whose values
are only 0 and 1, particularly,
p

fraction of the pop. is 0 and (1


p
) is
1. Then possible differences between pop. values are 0 and 1 with
probs.


Diff. Prob.


0
p
2

+ (1


p
)
2



1 2
p
(1


p
)


Therefore, gini = 2
p
(1


p
) = 1


p
2



(1


p
)
2

= 1



j=1,2

p
j
2





n
j
p
j
D
gini
1
2
1
)
(
Gini index (CART, IBM IntelligentMiner)


If a data set
D
contains examples from
n

classes, gini index,
gini
(
D
) is
defined as





where
p
j

is the relative frequency of class
j

in
D


Always consider a binary split
of each attribute A. If a data set
D

is
split on A into two subsets
D
1

and
D
2
, the
gini

index
gini
(
D
) is defined
as





Reduction in Impurity:



The attribute giving the largest reduction in impurity is chosen to split
the node (
need to enumerate all the possible binary splitting points for
each attribute
)





n
j
p
j
D
gini
1
2
1
)
(
)
(
|
|
|
|
)
(
|
|
|
|
)
(
2
2
1
1
D
gini
D
D
D
gini
D
D
D
gini
A


)
(
)
(
)
(
D
gini
D
gini
A
gini
A



Gini index (CART, IBM IntelligentMiner)


Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”




Suppose the attribute income partitions D into 10 in D
1
: {low,
medium} and 4 in D
2




but gini
{medium,high}

is 0.30 and thus the best since it is the lowest


All attributes are assumed continuous
-
valued


May need other tools, e.g., clustering, to get the possible split values


Can be modified for categorical attributes

459
.
0
14
5
14
9
1
)
(
2
2
















D
gini
)
(
14
4
)
(
14
10
)
(
1
1
}
,
{
D
Gini
D
Gini
D
gini
medium
low
income















Comparing Attribute Selection Measures


The three measures, in general, return good results but


Information gain:


biased towards multivalued attributes


Gain ratio:


tends to prefer unbalanced splits in which one
partition is much smaller than the others


Gini index:


biased to multivalued attributes


has difficulty when # of classes is large


tends to favor tests that result in equal
-
sized
partitions and purity in both partitions

Bayesian Classification: Why?


A statistical classifier
: performs
probabilistic prediction,
i.e.,

predicts class membership probabilities


Foundation:

Based on Bayes’ Theorem.


Performance:

A simple Bayesian classifier,
naïve Bayesian
classifier
, has comparable performance with decision tree
and selected neural network classifiers


Incremental
: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct


prior knowledge can be combined with observed
data


Standard
: Even when Bayesian methods are
computationally intractable, they can provide a standard
of optimal decision making against which other methods
can be measured

Bayesian Theorem: Basics


Let
X

be a data sample (“
evidence
”): class label is unknown


Let H be a
hypothesis

that X belongs to class C


Classification is to determine P(H|
X
) (
posteriori probability
of H conditioned on
X
), the probability that the hypothesis
holds given the observed data sample
X


P(H) (
prior probability
), the initial probability


E.g.,

H is “will buy computer” regardless of age, income, …


P(
X
): prior probability that an evidence is observed


E.g., X is “age in 31..40, medium income”


P(
X
|H) (
posteriori probability
of
X

conditioned on H), the
probability of observing the evidence
X
, given that the
hypothesis holds


E.g.,

Given

H that a customer will buy computer, P(
X
|H) is the
prob. of evidence
X

that customer’s age is 31..40, medium income

Bayesian Theorem


Given training evidence

X
, posteriori probability of a
hypothesis
H
,
P(H|
X
)
,
follows the Bayes theorem







Informally, this can be written as



posteriori = likelihood x prior/evidence


Predicts
X

belongs to C
i

iff the probability P(C
i
|
X
) is the
highest among all the P(C
k
|X) for all the
k

classes


Practical difficulty: require initial knowledge of many
probabilities, significant computational cost

)
(
)
(
)
|
(
)
|
(
X
X
X
P
H
P
H
P
H
P

Towards Naïve Bayesian
Classifier


Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n
-
dim.
attribute vector
X

= (x
1
, x
2
, …, x
n
)


Suppose there are
m

classes C
1
, C
2
, …, C
m
.


Classification is to derive the maximum posteriori, i.e., the
maximal P(C
i
|
X
)


This can be derived from Bayes’ theorem





Since prior prob. P(
X
) is constant for all classes,

maximizing is equivalent to maximizing



)
(
)
(
)
|
(
)
|
(
X
X
X
P
i
C
P
i
C
P
i
C
P

)
|
(
X
i
C
P
)
(
)
|
(
i
C
P
i
C
P
X
Derivation of Naïve Bayes
Classifier


A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes), in particular:




This greatly reduces the computation cost: Only counts
the class distribution



If A
k

is categorical, P(x
k
|C
i
) is the # of tuples in C
i

having
value x
k

for A
k

divided by |C
i, D
| (# of tuples of C
i

in D)


)
|
(
...
)
|
(
)
|
(
1
)
|
(
)
|
(
2
1
C
i
x
P
C
i
x
P
C
i
x
P
n
k
C
i
x
P
C
i
P
n
k







X
Naïve Bayesian Classifier: Training Dataset

Class:

C
1
:buys_computer = ‘yes’

C
2
:buys_computer = ‘no’


Data sample

X = (age = youth,

income = medium,

student = yes

credit_rating = Fair)

age
income
student
credit_rating
buys_computer
youth
high
no
fair
no
youth
high
no
excellent
no
middle
high
no
fair
yes
senior
medium
no
fair
yes
senior
low
yes
fair
yes
senior
low
yes
excellent
no
middle
low
yes
excellent
yes
youth
medium
no
fair
no
youth
low
yes
fair
yes
senior
medium
yes
fair
yes
youth
medium
yes
excellent
yes
middle
medium
no
excellent
yes
middle
high
yes
fair
yes
senior
medium
no
excellent
no
Naïve Bayesian Classifier: An Example


P(C
i
):
P(buys_computer = “yes”) = 9/14 = 0.643


P(buys_computer = “no”) = 5/14= 0.357



Compute P(X|C
i
) for each class


P(age = “youth” | buys_computer = “yes”) = 2/9 = 0.222


P(age = “youth” | buys_computer = “no”) = 3/5 = 0.6


P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444


P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4


P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667


P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2


P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667


P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4




X = (age = youth, income = medium, student = yes, credit_rating = fair)



P(X|
C
i
) :

P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044


P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019

P(X|
C
i
)*P(
C
i
) :
P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028




P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007


Therefore,
X

belongs to class (“buys_computer = yes”)



Avoiding the 0
-
Probability Problem


Naïve Bayesian prediction requires each conditional prob. be non
-
zero. Otherwise, the predicted prob. will be zero






Ex. Suppose a dataset with 1000 tuples, income=low (0), income=
medium (990), and income = high (10),


Use Laplacian correction (or Laplacian estimator)


Adding 1 to each case

Prob(income = low) = 1/1003

Prob(income = medium) = 991/1003

Prob(income = high) = 11/1003


The “corrected” prob. estimates are close to their “uncorrected”
counterparts




n
k
C
i
x
k
P
C
i
X
P
1
)
|
(
)
|
(
Naïve Bayesian Classifier: Comments


Advantages


Easy to implement


Good results obtained in most of the cases


Disadvantages


Assumption: class conditional independence, therefore
loss of accuracy


Practically, dependencies exist among variables


E.g., hospitals: patients: Profile: age, family history, etc.


Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.


Dependencies among these cannot be modeled by Naïve
Bayesian Classifier


How to deal with these dependencies?


Bayesian Belief Networks

Bayesian Belief Networks


Bayesian belief network allows a
subset

of the variables
conditionally independent


A graphical model of causal relationships


Represents
dependency

among the variables


Gives a specification of joint probability distribution

X

Y

Z

P



Nodes: random variables



Links: dependency



X and Y are the parents of Z, and Y is
the parent of P



No dependency between Z and P



Has no loops or cycles

Bayesian Belief Network: An Example

Family

History

LungCancer

PositiveXRay

Smoker

Emphysema

Dyspnea

LC

~LC

(FH, S)

(FH, ~S)

(~FH, S)

(~FH, ~S)

0.8

0.2

0.5

0.5

0.7

0.3

0.1

0.9

Bayesian Belief Networks

The
conditional probability table

(
CPT
) for variable LungCancer:




n
i
Y
Parents
i
x
i
P
x
x
P
n
1
))
(
|
(
)
,...,
(
1
CPT shows the conditional probability for
each possible combination of its parents

Derivation of the probability of a
particular combination of values of
X
,
from CPT:

Training Bayesian Networks


Several scenarios:


Given both the network structure and all variables
observable:
learn only the CPTs


Network structure known, some hidden variables:
gradient descent

(greedy hill
-
climbing) method,
analogous to neural network learning


Network structure unknown, all variables observable:
search through the model space to
reconstruct
network topology


Unknown structure, all hidden variables: No good
algorithms known for this purpose


Ref. D. Heckerman: Bayesian networks for data mining

The
k
-
Nearest Neighbor (k
-
NN)
Algorithm


All instances correspond to points in the n
-
D space


The nearest neighbor are defined in terms of
Euclidean distance, dist(
X
1
,
X
2
)


Target function could be discrete
-

or real
-

valued


For discrete
-
valued,
k
-
NN returns the most common
value (= majority vote) among the
k

training
examples nearest to

x
q


Vonoroi diagram: the decision surface induced by 1
-
NN for a typical set of training examples


.

_

+

_

x
q

+

_

_

+

_

_

+

.

.

.

.

.

Discussion on the
k
-
NN Algorithm


k
-
NN for real
-
valued prediction for a given unknown tuple


Returns the mean values of the

k

nearest neighbors


Distance
-
weighted nearest neighbor algorithm


Weight the contribution of each of the k neighbors
according to their distance to the query
x
q


Give greater weight to closer neighbors


Robust to noisy data by averaging k
-
nearest neighbors


Curse of dimensionality: distance between neighbors could
be dominated by irrelevant attributes


To overcome it, axes stretch or elimination of the least
relevant attributes


Categorical (non
-
numerical) attributes, e.g., color, are
difficult to handle

2
)
,
(
1
i
x
q
x
d
w

Ensemble Methods: Increasing the Accuracy


Ensemble methods


Use a combination of models to increase accuracy


Combine a series of k learned models, M
1
, M
2
, …, M
k
,
with the aim of creating an improved model M*


Popular ensemble methods


Bagging: averaging the prediction over a collection of
classifiers


Boosting: weighted vote with a collection of classifiers

Bagging: Boostrap Aggregation


Analogy: Diagnosis based on multiple doctors’ majority vote


Training


Given a set D of
d
tuples, at each iteration
i
, a training set D
i

of
d

tuples is sampled with replacement from D (i.e., boostrap)


A classifier model M
i

is learned for each training set D
i


Classification: classify an unknown sample

X



Each classifier M
i

returns its class prediction


The bagged classifier M* counts the votes and assigns the class
with the most votes to
X


Prediction: can be applied to the prediction of continuous values by
taking the average value of each prediction for a given test tuple


Accuracy


Often significant better than a single classifier derived from D


For noise data: not considerably worse, more robust


Proved improved accuracy in prediction

Boosting


Analogy: Consult several doctors, based on a combination of weighted
diagnoses

weight assigned based on the previous diagnosis accuracy


How boosting works?


Weights are assigned to each training tuple


A series of k classifiers is iteratively learned


After a classifier M
i

is learned, the weights are updated to allow the
subsequent classifier, M
i+1
, to pay more attention to the training
tuples that were misclassified by M
i


The final M* combines the votes of each individual classifier, where
the weight of each classifier's vote is a function of its accuracy


The boosting algorithm can be extended for the prediction of
continuous values


Comparing with bagging: boosting tends to achieve greater accuracy,
but it also risks overfitting the model to misclassified data

Adaboost (Freund and Schapire,
1997)

AdaBoost Lecture Slides by Šochman and Matas