UNIT 4 Classification & prediction

brewerobstructionAI and Robotics

Nov 7, 2013 (3 years and 11 months ago)

98 views

UNIT 4

Classification & prediction



Classification vs. Prediction



Classification:

o

predicts categorical class labels (discrete or nominal)

o

classifies data (constructs a model) based on the training set and the values
(class labels) in a classifying
attribute and uses it in classifying new data



Prediction:

o

models continuous
-
valued functions, i.e., predicts unknown or missing values



Typical Applications

o

credit approval

o

target marketing

o

medical diagnosis

o

treatment effectiveness analysis




Classificati
on

A Two
-
Step Process



Model construction: describing a set of predetermined classes

o

Each tuple/sample is assumed to belong to a predefined class, as determined by
the class label attribute

o

The set of tuples used for model construction is training set

o

The m
odel is represented as classification rules, decision trees, or mathematical
formulae



Model usage: for classifying future or unknown objects

o

Estimate accuracy of the model



The known label of test sample is compared with the classified result
from the model



Accuracy rate is the percentage of test set samples that are correctly
classified by the model



Test set is independent of training set, otherwise over
-
fitting will occur

o

If the accuracy is acceptable, use the model to classify data tuples whose class
labe
ls are not known



Supervised vs. Unsupervised Learning



Supervised learning (classification)

o

Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations

o

New data is classified based o
n the training set



Unsupervised learning (clustering)

o

The class labels of training data is unknown

o

Given a set of measurements, observations, etc. with the aim of establishing the
existence of classes or clusters in the data



Issues Regarding Classification

and Prediction (1): Data Preparation



Data cleaning

o

Preprocess data in order to reduce noise and handle missing values



Relevance analysis (feature selection)

o

Remove the irrelevant or redundant attributes



Data transformation

o

Generalize and/or normalize data



Predictive accuracy



Speed and scalability

o

time to construct the model

o

time to use the model



Robustness

o

handling noise and missing values



Scalability

o

efficiency in disk
-
resident databases



Interpretability:

o

understanding and insight provided by the model



Goodness of rules

o

decision tree size

o

compactness of classification rules



Training Dataset













age
income
student
credit_rating
buys_computer
<=30
high
no
fair
no
<=30
high
no
excellent
no
31…40
high
no
fair
yes
>40
medium
no
fair
yes
>40
low
yes
fair
yes
>40
low
yes
excellent
no
31…40
low
yes
excellent
yes
<=30
medium
no
fair
no
<=30
low
yes
fair
yes
>40
medium
yes
fair
yes
<=30
medium
yes
excellent
yes
31…40
medium
no
excellent
yes
31…40
high
yes
fair
yes
>40
medium
no
excellent
no





Algorithm for Decision Tree Induction



Basic algorithm (a greedy algorithm)

o

Tree is
constructed in a top
-
down recursive divide
-
and
-
conquer manner

o

At start, all the training examples are at the root

o

Attributes are categorical (if continuous
-
valued, they are discretized in advance)

o

Examples are partitioned recursively based on selected attr
ibutes

o

Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,
information gain)



Conditions for stopping partitioning

o

All samples for a given node belong to the same class

o

There are no remaining attributes for further partiti
oning


majority voting is
employed for classifying the leaf

o

There are no samples left



Attribute Selection Measure: Information Gain (ID3/C4.5)



Select the attribute with the highest information gain



S contains s
i

tuples of class C
i

for i = {1, …, m}



information measures info required to classify any arbitrary tuple



entropy of attribute A with values {a
1
,a
2
,…,a
v
}



information gained by branching on attribute A



Attribute Selection by Informa
tion Gain Computation



Class P: buys_computer = “yes”



Class N: buys_computer = “no”



I(p, n) = I(9, 5) =0.940



Compute the entropy for
age
:














s
s
s
s
,...,s
,s
s
i
m
i
i
m
2
1
2
1
log
)
I(




)
,...,
(
...
E(A)
1
1
1
mj
j
v
j
mj
j
s
s
I
s
s
s





E(A)
)
s
,...,
s
,
I(s
Gain(A)
m
2
1


694
.
0
)
2
,
3
(
14
5
)
0
,
4
(
14
4
)
3
,
2
(
14
5
)
(




I
I
I
age
E
age
p
i
n
i
I(p
i
, n
i
)
<=30
2
3
0.971
30…40
4
0
0
>40
3
2
0.971

o

means “age <=30” has 5 out of 14 samples, with 2 yes’es and 3 no’s. Hence









Extracting Classification Rules from Trees



Represent the knowledge in the form of IF
-
THEN rules



One rule is created for each path from the root to a leaf



Each attribute
-
value pair along a path forms a
conjunction



The leaf node holds the class prediction



Rules are easier for humans to understand



Example



IF
age

= “<=30” AND
student

= “
no
” THEN
buys_computer

= “
no




IF
age

= “<=30” AND
student

= “
yes
” THEN
buys_computer

= “
yes




IF
age

= “31…40” THEN
buy
s_computer

= “
yes




IF
age

= “>40” AND
credit_rating

= “
excellent
” THEN
buys_computer
= “
yes




IF
age

= “<=30” AND
credit_rating

= “
fair
” THEN
buys_computer

= “
no




Avoid Overfitting in Classification



Overfitting: An induced tree may overfit the training data

o

Too many branches, some may reflect anomalies due to noise or outliers

o

Poor accuracy for unseen samples



Two approaches to avoid overfitting

o

Prepruning: Halt tree construction early

do not split

a node if this would result
in the goodness measure falling below a threshold



Difficult to choose an appropriate threshold

o

Postpruning: Remove branches from a “fully grown” tree

get a sequence of
progressively pruned trees



Use a set of data different from

the training data to decide which is the
“best pruned tree”



Enhancements to basic decision tree induction



Allow for continuous
-
valued attributes

o

Dynamically define new discrete
-
valued attributes that partition the continuous
attribute value into a
discrete set of intervals



Handle missing attribute values

o

Assign the most common value of the attribute

o

Assign probability to each of the possible values



Attribute construction

)
3
,
2
(
14
5
I
246
.
0
)
(
)
,
(
)
(



age
E
n
p
I
age
Gain
048
.
0
)
_
(
151
.
0
)
(
029
.
0
)
(



rating
credit
Gain
student
Gain
income
Gain
o

Create new attributes based on existing ones that are sparsely represented

o

This

reduces fragmentation, repetition, and replication



Classification in Large Databases



Classification

a classical problem extensively studied by statisticians and machine
learning researchers



Scalability: Classifying data sets with millions of examples and
hundreds of attributes
with reasonable speed



Why decision tree induction in data mining?

o

relatively faster learning speed (than other classification methods)

o

convertible to simple and easy to understand classification rules

o

can use SQL queries for accessin
g databases

o

comparable classification accuracy with other methods



Scalable Decision Tree Induction Methods in Data Mining Studies



SLIQ (EDBT’96


Mehta et al.)

o

builds an index for each attribute and only class list and the current attribute list
reside in
memory



SPRINT (VLDB’96


J. Shafer et al.)

o

constructs an attribute list data structure



PUBLIC (VLDB’98


Rastogi & Shim)

o

integrates tree splitting and tree pruning: stop growing the tree earlier



RainForest (VLDB’98


Gehrke, Ramakrishnan & Ganti)

o

separat
es the scalability aspects from the criteria that determine the quality of
the tree

o

builds an AVC
-
list (attribute, value, class label)



Data Cube
-
Based Decision
-
Tree Induction



Integration of generalization with decision
-
tree induction (Kamber et al’97).



Classification at primitive concept levels

o

E.g., precise temperature, humidity, outlook, etc.

o

Low
-
level concepts, scattered classes, bushy classification
-
trees

o

Semantic interpretation problems.



Cube
-
based multi
-
level classification

o

Relevance analysis at mu
lti
-
levels.

o

Information
-
gain analysis with dimension + level.



Bayesian Classification: Why?



Probabilistic learning
: Calculate explicit probabilities for hypothesis, among the most
practical approaches to certain types of learning problems



Incremental
: Eac
h training example can incrementally increase/decrease the probability
that a hypothesis is correct. Prior knowledge can be combined with observed data.



Probabilistic prediction
: Predict multiple hypotheses, weighted by their probabilities



Standard
: Even

when Bayesian methods are computationally intractable, they can
provide a standard of optimal decision making against which other methods can be
measured



Bayesian Theorem: Basics



Let X be a data sample whose class label is unknown



Let H be a hypothesis th
at X belongs to class C



For classification problems, determine P(H/X): the probability that the hypothesis holds
given the observed data sample X



P(H): prior probability of hypothesis H (i.e. the initial probability before we observe any
data, reflects th
e background knowledge)



P(X): probability that sample data is observed



P(X|H) : probability of observing the sample X, given that the hypothesis holds



Bayesian Theorem



Given training data

X, posteriori probability of a hypothesis H, P(H|X)
follows the Bayes
theorem




Informally, this can be written as






posterior =likelihood x prior / evidence



MAP (maximum posteriori) hypothesis



Practical difficulty: require initial knowledge of many probabilities, significant
computational cost




Naïve Bayes Classifier



A simplified assumption: attributes are conditionally independent:



The product of occurrence of say 2 elements x
1

and x
2
, given the current class is C, is the
product of the probabilities of each element
taken separately, given the same class
P([y
1
,y
2
],C) = P(y
1
,C) * P(y
2
,C)



No dependence relation between attributes



Greatly reduces the computation cost, only count the class distribution.



Once the probability P(X|C
i
) is known, assign X to the class with ma
ximum P(X|C
i
)*P(C
i
)



Naïve Bayesian Classifier: Example



Compute P(X/Ci) for each class



P(age=“<30” | buys_computer=“yes”) = 2/9=0.222



P(age=“<30” | buys_computer=“no”) = 3/5 =0.6



P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444



P(income=“medium” |
buys_computer=“no”) = 2/5 = 0.4



P(student=“yes” | buys_computer=“yes)= 6/9 =0.667

)
(
)
(
)
|
(
)
|
(
X
P
H
P
H
X
P
X
H
P

.
)
(
)
|
(
max
arg
)
|
(
max
arg
h
P
h
D
P
H
h
D
h
P
H
h
MAP
h







n
k
C
i
x
k
P
C
i
X
P
1
)
|
(
)
|
(


P(student=“yes” | buys_computer=“no”)= 1/5=0.2



P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667



P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4



X=(age<=30 ,income

=medium, student=yes,credit_rating=fair)



P(X|Ci) :

P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044

o

P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019



P(X|Ci)*P(Ci ) :
P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028

o

P(X|buys_com
puter=“yes”) *
P(buys_computer=“yes”)=0.007



X belongs to class “buys_computer=yes”






Naïve Bayesian Classifier: Comments



Advantages :

o

Easy to implement

o

Good results obtained in most of the cases



Disadvantages

o

Assumption: class conditional independence
, therefore loss of accuracy

o

Practically, dependencies exist among variables

o

E.g., hospitals: patients: Profile: age, family history etc



Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc

o

Dependencies among these cannot be modeled by Naïve

Bayesian Classifier



How to deal with these dependencies?

o

Bayesian Belief Networks



Classification



Classification:

o

predicts categorical class labels



Typical Applications

o

{credit history, salary}
-
> credit approval ( Yes/No)

o

{Temp, Humidity}
--
> Rain (Yes/
No)






Binary Classification problem



The data above the red line belongs to class ‘x’



The data below red line belongs to class ‘o’



Examples


SVM, Perceptron, Probabilistic Classifiers



Discriminative Classifiers

)
(
:
}
1
,
0
{
,
}
1
,
0
{
x
h
y
Y
X
h
Y
y
X
x
n








Advantages

o

prediction accuracy is generally high



(as compared to Bayesian methods


in general)

o

robust, works when training examples contain errors

o

fast evaluation of the learned target function



(Bayesian networks are normally slow)



Criticism

o

long training time

o

dif
ficult to understand the learned function (weights)



(Bayesian networks can be used easily for pattern discovery)

o

not easy to incorporate domain knowledge



(easy in the form of priors on the data or distributions)



Neural Networks



Analogy to Biological
Systems (Indeed a great example of a good learning system)



Massive Parallelism allowing for computational efficiency



The first learning algorithm came in 1959 (Rosenblatt) who suggested that if a target
output value is provided for a single neuron with fix
ed inputs, one can incrementally
change weights to learn to produce these outputs using the perceptron learning rule






The
n
-
dimensional input vector
x

is mapped into variable
y

by means of the s
calar
product and a nonlinear function mapping






Network Training



The ultimate objective of training

)
sign(
y
Example
For
n
0
i
k
i
i
x
w





o

obtain a set of weights that makes almost all the tuples in the training data
classified correctly



Steps

o

Initialize weights with random values

o

Feed the

input tuples into the network one by one

o

For each unit



Compute the net input to the unit as a linear combination of all the
inputs to the unit



Compute the output value using the activation function



Compute the error



Update the weights and the bias



Network

Pruning and Rule Extraction



Network pruning

o

Fully connected network will be hard to articulate

o

N

input nodes,
h

hidden nodes and
m

output nodes lead to
h(m+N)

weights

o

Pruning: Remove some of the links without affecting classification accuracy of
the netwo
rk



Extracting rules from a trained network

o

Discretize activation values; replace individual activation value by the cluster
average maintaining the network accuracy

o

Enumerate the output from the discretized activation values to find rules
between activatio
n value and output

o

Find the relationship between the input and activation value

o

Combine the above two to have rules relating the output to input



Association
-
Based Classification



Several methods for association
-
based classification

o

ARCS: Quantitative asso
ciation mining and clustering of association rules (Lent et
al’97)



It beats C4.5 in (mainly) scalability and also accuracy

o

Associative classification: (Liu et al’98)



It mines high support and high confidence rules in the form of “cond_set
=> y”, where y
is a class label

o

CAEP (Classification by aggregating emerging patterns) (Dong et al’99)



Emerging patterns (EPs): the itemsets whose support increases
significantly from one class to another



Mine Eps based on minimum support and growth rate



Other Classifica
tion Methods



k
-
nearest neighbor classifier



case
-
based reasoning



Genetic algorithm



Rough set approach



Fuzzy set approaches



Instance
-
Based Methods



Instance
-
based learning:

o

Store training examples and delay the processing (“lazy evaluation”) until a new
instance must be classified



Typical approaches


o

k
-
nearest neighbor approach




Instances represented as points in a Euclidean space.

o

Locally weighted regression




Constructs local approximation

o

Case
-
based reasoning




Uses symbolic representations and knowledge
-
based inference



The
k
-
Nearest Neighbor Algorithm



All instances correspond to points in the n
-
D space.



The nearest neighbor are defined in terms of Euclidean distance.



The target function could be discrete
-

or real
-

valued.



For discrete
-
valued, the
k
-
NN re
turns the most common value among the k training
examples nearest to
xq
.



Vonoroi diagram: the decision surface induced by 1
-
NN for a typical set of training
examples.



Case
-
Based Reasoning



Also uses:

lazy evaluation + analyze similar instances



Difference:

Instances are not “points in a Euclidean space”



Example:

Water faucet problem in CADET (Sycara et al’92)



Methodology

o

Instances represented by rich symbolic descriptions (e.g., function graphs)

o

Multiple retrieved cases may be combined

o

Tight coupling between

case retrieval, knowledge
-
based reasoning, and problem
solving



Research issues


o

Indexing based on syntactic similarity measure, and when failure, backtracking,
and adapting to additional cases



Remarks on Lazy vs. Eager Learning



Instance
-
based learning:

lazy evaluation



Decision
-
tree and Bayesian classification
: eager evaluation



Key differences

o

Lazy method may consider query instance
xq

when deciding how to generalize
beyond the training data
D


o

Eager method cannot since they have already chosen global a
pproximation when
seeing the query



Efficiency: Lazy
-

less time training but more time predicting



Accuracy

o

Lazy method effectively uses a richer hypothesis space since it uses many local
linear functions to form its implicit global approximation to the tar
get function

o

Eager: must commit to a single hypothesis that covers the entire instance space



Genetic Algorithms



GA: based on an analogy to biological evolution



Each rule is represented by a string of bits



An initial population is created consisting of rand
omly generated rules

o

e.g., IF A
1

and Not A
2

then C
2

can be encoded as 100



Based on the notion of survival of the fittest, a new population is formed to consists of
the fittest rules and their offsprings



The fitness of a rule is represented by its classi
fication accuracy on a set of training
examples



Offsprings are generated by crossover and mutation



Rough Set Approach



Rough sets are used to approximately or “roughly” define equivalent classes



A rough set for a given class C is approximated by two sets: a lower approximation
(certain to be in C) and an upper approximation (cannot be described as not belonging
to C)



Finding the minimal subsets (reducts) of attributes (for feature reduction) is N
P
-
hard but
a discernibility matrix is used to reduce the computation intensity




Fuzzy Set Approaches



Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of
membership (such as using fuzzy membership graph)



Attribute values are
converted to fuzzy values

o

e.g., income is mapped into the discrete categories {low, medium, high} with
fuzzy values calculated



For a given new sample, more than one fuzzy value may apply



Each applicable rule contributes a vote for membership in the categor
ies



Typically, the truth values for each predicted category are summed



What Is Prediction?



Prediction is similar to classification

o

First, construct a model

o

Second, use model to predict unknown value



Major method for prediction is regression

-

Linear and mult
iple regression

-

Non
-
linear regression



Prediction is different from classification

o

Classification refers to predict categorical class label

o

Prediction models continuous
-
valued functions



Regress Analysis and Log
-
Linear Models in Prediction



Linear regression
: Y =


+


X

o

Two parameters ,


and


specify the line and are to be estimated by using the
data at hand.

o

using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….



Multiple regression
: Y = b0 + b1 X1 + b2 X2.

o

Many nonlinear functio
ns can be transformed into the above.



Log
-
linear models
:

o

The multi
-
way table of joint probabilities is approximated by a product of lower
-
order tables.

o

Probability:
p(a, b, c, d) =

ab

ac

ad

bcd



Locally Weighted Regression



Construct an explicit approxi
mation to

f
over a local region surrounding query instance
xq
.



Locally weighted linear regression:

o

The target function

f

is approximated near
xq

using the linear function:


o

minimize the squared error: distance
-
decreasing weight
K




o

the gradient descent training rule:





In most cases, the target function is approximated by a constant, linear, or quadratic
function.



Classification Accuracy: Estimating Error Rates



Partition: Training
-
and
-
testing

o

u
se two independent data sets, e.g., training set (2/3), test set(1/3)

o

used for data set with large number of samples



Cross
-
validation

o

divide the data set into
k

subsamples

o

use
k
-
1

subsamples as training data and one sub
-
sample as test data

k
-
fold
cross
-
val
idation

o

for data set with moderate size



Bootstrapping (leave
-
one
-
out)

o

for small size data




Bagging and Boosting

)
(
)
(
1
1
0
)
(
ˆ
x
n
a
n
w
x
a
w
w
x
f





))
,
(
(
_
_
_
_
2
))
(
ˆ
)
(
(
2
1
)
(
x
q
x
d
K
q
x
of
neighbors
nearest
k
x
x
f
x
f
q
x
E









q
x
of
neighbors
nearest
k
x
x
j
a
x
f
x
f
x
q
x
d
K
j
w
_
_
_
_
)
(
))
(
ˆ
)
(
))((
,
(
(







Bagging



Given a set S of s samples



Generate a bootstrap sample T from S. Cases in S may not appear in T or may appear
more than once.



Repeat this sampling procedure, getting a sequence of k independent training sets



A corresponding sequence of classifiers C1,C2,…,Ck is constructed for each of these
training sets, by using the same classification algorithm



To classify an unknown sample X
,let each classifier predict or vote



The Bagged Classifier C* counts the votes and assigns X to the class with the “most”
votes



Boosting Technique


Algorithm



Assign every example an equal weight
1/N



For t = 1, 2, …, T Do

o

Obtain a hypothesis (classifier
) h
(t)

under w
(t)


o

Calculate the error of

h(t)

and re
-
weight the examples based on the error . Each
classifier is dependent on the previous ones. Samples that are incorrectly
predicted are weighted more heavily

o

Normalize w
(t+1)

to sum to 1 (weights assigne
d to different classifiers sum to 1)



Output a weighted sum of all the hypothesis, with each hypothesis weighted according
to its accuracy on the training set



Bagging and Boosting



Experiments with a new boosting algorithm, freund et al (AdaBoost )



Bagging
Predictors, Brieman



Boosting Naïve Bayesian Learning on large subset of MEDLINE, W. Wilbur