UNIT 4
Classification & prediction
Classification vs. Prediction
Classification:
o
predicts categorical class labels (discrete or nominal)
o
classifies data (constructs a model) based on the training set and the values
(class labels) in a classifying
attribute and uses it in classifying new data
Prediction:
o
models continuous

valued functions, i.e., predicts unknown or missing values
Typical Applications
o
credit approval
o
target marketing
o
medical diagnosis
o
treatment effectiveness analysis
Classificati
on
—
A Two

Step Process
Model construction: describing a set of predetermined classes
o
Each tuple/sample is assumed to belong to a predefined class, as determined by
the class label attribute
o
The set of tuples used for model construction is training set
o
The m
odel is represented as classification rules, decision trees, or mathematical
formulae
Model usage: for classifying future or unknown objects
o
Estimate accuracy of the model
The known label of test sample is compared with the classified result
from the model
Accuracy rate is the percentage of test set samples that are correctly
classified by the model
Test set is independent of training set, otherwise over

fitting will occur
o
If the accuracy is acceptable, use the model to classify data tuples whose class
labe
ls are not known
Supervised vs. Unsupervised Learning
Supervised learning (classification)
o
Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
o
New data is classified based o
n the training set
Unsupervised learning (clustering)
o
The class labels of training data is unknown
o
Given a set of measurements, observations, etc. with the aim of establishing the
existence of classes or clusters in the data
Issues Regarding Classification
and Prediction (1): Data Preparation
Data cleaning
o
Preprocess data in order to reduce noise and handle missing values
Relevance analysis (feature selection)
o
Remove the irrelevant or redundant attributes
Data transformation
o
Generalize and/or normalize data
Predictive accuracy
Speed and scalability
o
time to construct the model
o
time to use the model
Robustness
o
handling noise and missing values
Scalability
o
efficiency in disk

resident databases
Interpretability:
o
understanding and insight provided by the model
Goodness of rules
o
decision tree size
o
compactness of classification rules
Training Dataset
age
income
student
credit_rating
buys_computer
<=30
high
no
fair
no
<=30
high
no
excellent
no
31…40
high
no
fair
yes
>40
medium
no
fair
yes
>40
low
yes
fair
yes
>40
low
yes
excellent
no
31…40
low
yes
excellent
yes
<=30
medium
no
fair
no
<=30
low
yes
fair
yes
>40
medium
yes
fair
yes
<=30
medium
yes
excellent
yes
31…40
medium
no
excellent
yes
31…40
high
yes
fair
yes
>40
medium
no
excellent
no
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
o
Tree is
constructed in a top

down recursive divide

and

conquer manner
o
At start, all the training examples are at the root
o
Attributes are categorical (if continuous

valued, they are discretized in advance)
o
Examples are partitioned recursively based on selected attr
ibutes
o
Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,
information gain)
Conditions for stopping partitioning
o
All samples for a given node belong to the same class
o
There are no remaining attributes for further partiti
oning
–
majority voting is
employed for classifying the leaf
o
There are no samples left
Attribute Selection Measure: Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
S contains s
i
tuples of class C
i
for i = {1, …, m}
information measures info required to classify any arbitrary tuple
entropy of attribute A with values {a
1
,a
2
,…,a
v
}
information gained by branching on attribute A
Attribute Selection by Informa
tion Gain Computation
Class P: buys_computer = “yes”
Class N: buys_computer = “no”
I(p, n) = I(9, 5) =0.940
Compute the entropy for
age
:
s
s
s
s
,...,s
,s
s
i
m
i
i
m
2
1
2
1
log
)
I(
)
,...,
(
...
E(A)
1
1
1
mj
j
v
j
mj
j
s
s
I
s
s
s
E(A)
)
s
,...,
s
,
I(s
Gain(A)
m
2
1
694
.
0
)
2
,
3
(
14
5
)
0
,
4
(
14
4
)
3
,
2
(
14
5
)
(
I
I
I
age
E
age
p
i
n
i
I(p
i
, n
i
)
<=30
2
3
0.971
30…40
4
0
0
>40
3
2
0.971
o
means “age <=30” has 5 out of 14 samples, with 2 yes’es and 3 no’s. Hence
Extracting Classification Rules from Trees
Represent the knowledge in the form of IF

THEN rules
One rule is created for each path from the root to a leaf
Each attribute

value pair along a path forms a
conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand
Example
IF
age
= “<=30” AND
student
= “
no
” THEN
buys_computer
= “
no
”
IF
age
= “<=30” AND
student
= “
yes
” THEN
buys_computer
= “
yes
”
IF
age
= “31…40” THEN
buy
s_computer
= “
yes
”
IF
age
= “>40” AND
credit_rating
= “
excellent
” THEN
buys_computer
= “
yes
”
IF
age
= “<=30” AND
credit_rating
= “
fair
” THEN
buys_computer
= “
no
”
Avoid Overfitting in Classification
Overfitting: An induced tree may overfit the training data
o
Too many branches, some may reflect anomalies due to noise or outliers
o
Poor accuracy for unseen samples
Two approaches to avoid overfitting
o
Prepruning: Halt tree construction early
—
do not split
a node if this would result
in the goodness measure falling below a threshold
Difficult to choose an appropriate threshold
o
Postpruning: Remove branches from a “fully grown” tree
—
get a sequence of
progressively pruned trees
Use a set of data different from
the training data to decide which is the
“best pruned tree”
Enhancements to basic decision tree induction
Allow for continuous

valued attributes
o
Dynamically define new discrete

valued attributes that partition the continuous
attribute value into a
discrete set of intervals
Handle missing attribute values
o
Assign the most common value of the attribute
o
Assign probability to each of the possible values
Attribute construction
)
3
,
2
(
14
5
I
246
.
0
)
(
)
,
(
)
(
age
E
n
p
I
age
Gain
048
.
0
)
_
(
151
.
0
)
(
029
.
0
)
(
rating
credit
Gain
student
Gain
income
Gain
o
Create new attributes based on existing ones that are sparsely represented
o
This
reduces fragmentation, repetition, and replication
Classification in Large Databases
Classification
—
a classical problem extensively studied by statisticians and machine
learning researchers
Scalability: Classifying data sets with millions of examples and
hundreds of attributes
with reasonable speed
Why decision tree induction in data mining?
o
relatively faster learning speed (than other classification methods)
o
convertible to simple and easy to understand classification rules
o
can use SQL queries for accessin
g databases
o
comparable classification accuracy with other methods
Scalable Decision Tree Induction Methods in Data Mining Studies
SLIQ (EDBT’96
—
Mehta et al.)
o
builds an index for each attribute and only class list and the current attribute list
reside in
memory
SPRINT (VLDB’96
—
J. Shafer et al.)
o
constructs an attribute list data structure
PUBLIC (VLDB’98
—
Rastogi & Shim)
o
integrates tree splitting and tree pruning: stop growing the tree earlier
RainForest (VLDB’98
—
Gehrke, Ramakrishnan & Ganti)
o
separat
es the scalability aspects from the criteria that determine the quality of
the tree
o
builds an AVC

list (attribute, value, class label)
Data Cube

Based Decision

Tree Induction
Integration of generalization with decision

tree induction (Kamber et al’97).
Classification at primitive concept levels
o
E.g., precise temperature, humidity, outlook, etc.
o
Low

level concepts, scattered classes, bushy classification

trees
o
Semantic interpretation problems.
Cube

based multi

level classification
o
Relevance analysis at mu
lti

levels.
o
Information

gain analysis with dimension + level.
Bayesian Classification: Why?
Probabilistic learning
: Calculate explicit probabilities for hypothesis, among the most
practical approaches to certain types of learning problems
Incremental
: Eac
h training example can incrementally increase/decrease the probability
that a hypothesis is correct. Prior knowledge can be combined with observed data.
Probabilistic prediction
: Predict multiple hypotheses, weighted by their probabilities
Standard
: Even
when Bayesian methods are computationally intractable, they can
provide a standard of optimal decision making against which other methods can be
measured
Bayesian Theorem: Basics
Let X be a data sample whose class label is unknown
Let H be a hypothesis th
at X belongs to class C
For classification problems, determine P(H/X): the probability that the hypothesis holds
given the observed data sample X
P(H): prior probability of hypothesis H (i.e. the initial probability before we observe any
data, reflects th
e background knowledge)
P(X): probability that sample data is observed
P(XH) : probability of observing the sample X, given that the hypothesis holds
Bayesian Theorem
Given training data
X, posteriori probability of a hypothesis H, P(HX)
follows the Bayes
theorem
Informally, this can be written as
posterior =likelihood x prior / evidence
MAP (maximum posteriori) hypothesis
Practical difficulty: require initial knowledge of many probabilities, significant
computational cost
Naïve Bayes Classifier
A simplified assumption: attributes are conditionally independent:
The product of occurrence of say 2 elements x
1
and x
2
, given the current class is C, is the
product of the probabilities of each element
taken separately, given the same class
P([y
1
,y
2
],C) = P(y
1
,C) * P(y
2
,C)
No dependence relation between attributes
Greatly reduces the computation cost, only count the class distribution.
Once the probability P(XC
i
) is known, assign X to the class with ma
ximum P(XC
i
)*P(C
i
)
Naïve Bayesian Classifier: Example
Compute P(X/Ci) for each class
P(age=“<30”  buys_computer=“yes”) = 2/9=0.222
P(age=“<30”  buys_computer=“no”) = 3/5 =0.6
P(income=“medium”  buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” 
buys_computer=“no”) = 2/5 = 0.4
P(student=“yes”  buys_computer=“yes)= 6/9 =0.667
)
(
)
(
)

(
)

(
X
P
H
P
H
X
P
X
H
P
.
)
(
)

(
max
arg
)

(
max
arg
h
P
h
D
P
H
h
D
h
P
H
h
MAP
h
n
k
C
i
x
k
P
C
i
X
P
1
)

(
)

(
P(student=“yes”  buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair”  buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair”  buys_computer=“no”)=2/5=0.4
X=(age<=30 ,income
=medium, student=yes,credit_rating=fair)
P(XCi) :
P(Xbuys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044
o
P(Xbuys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(XCi)*P(Ci ) :
P(Xbuys_computer=“yes”) * P(buys_computer=“yes”)=0.028
o
P(Xbuys_com
puter=“yes”) *
P(buys_computer=“yes”)=0.007
X belongs to class “buys_computer=yes”
Naïve Bayesian Classifier: Comments
Advantages :
o
Easy to implement
o
Good results obtained in most of the cases
Disadvantages
o
Assumption: class conditional independence
, therefore loss of accuracy
o
Practically, dependencies exist among variables
o
E.g., hospitals: patients: Profile: age, family history etc
Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc
o
Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
How to deal with these dependencies?
o
Bayesian Belief Networks
Classification
Classification:
o
predicts categorical class labels
Typical Applications
o
{credit history, salary}

> credit approval ( Yes/No)
o
{Temp, Humidity}

> Rain (Yes/
No)
Binary Classification problem
The data above the red line belongs to class ‘x’
The data below red line belongs to class ‘o’
Examples
–
SVM, Perceptron, Probabilistic Classifiers
Discriminative Classifiers
)
(
:
}
1
,
0
{
,
}
1
,
0
{
x
h
y
Y
X
h
Y
y
X
x
n
Advantages
o
prediction accuracy is generally high
(as compared to Bayesian methods
–
in general)
o
robust, works when training examples contain errors
o
fast evaluation of the learned target function
(Bayesian networks are normally slow)
Criticism
o
long training time
o
dif
ficult to understand the learned function (weights)
(Bayesian networks can be used easily for pattern discovery)
o
not easy to incorporate domain knowledge
(easy in the form of priors on the data or distributions)
Neural Networks
Analogy to Biological
Systems (Indeed a great example of a good learning system)
Massive Parallelism allowing for computational efficiency
The first learning algorithm came in 1959 (Rosenblatt) who suggested that if a target
output value is provided for a single neuron with fix
ed inputs, one can incrementally
change weights to learn to produce these outputs using the perceptron learning rule
The
n

dimensional input vector
x
is mapped into variable
y
by means of the s
calar
product and a nonlinear function mapping
Network Training
The ultimate objective of training
)
sign(
y
Example
For
n
0
i
k
i
i
x
w
o
obtain a set of weights that makes almost all the tuples in the training data
classified correctly
Steps
o
Initialize weights with random values
o
Feed the
input tuples into the network one by one
o
For each unit
Compute the net input to the unit as a linear combination of all the
inputs to the unit
Compute the output value using the activation function
Compute the error
Update the weights and the bias
Network
Pruning and Rule Extraction
Network pruning
o
Fully connected network will be hard to articulate
o
N
input nodes,
h
hidden nodes and
m
output nodes lead to
h(m+N)
weights
o
Pruning: Remove some of the links without affecting classification accuracy of
the netwo
rk
Extracting rules from a trained network
o
Discretize activation values; replace individual activation value by the cluster
average maintaining the network accuracy
o
Enumerate the output from the discretized activation values to find rules
between activatio
n value and output
o
Find the relationship between the input and activation value
o
Combine the above two to have rules relating the output to input
Association

Based Classification
Several methods for association

based classification
o
ARCS: Quantitative asso
ciation mining and clustering of association rules (Lent et
al’97)
It beats C4.5 in (mainly) scalability and also accuracy
o
Associative classification: (Liu et al’98)
It mines high support and high confidence rules in the form of “cond_set
=> y”, where y
is a class label
o
CAEP (Classification by aggregating emerging patterns) (Dong et al’99)
Emerging patterns (EPs): the itemsets whose support increases
significantly from one class to another
Mine Eps based on minimum support and growth rate
Other Classifica
tion Methods
k

nearest neighbor classifier
case

based reasoning
Genetic algorithm
Rough set approach
Fuzzy set approaches
Instance

Based Methods
Instance

based learning:
o
Store training examples and delay the processing (“lazy evaluation”) until a new
instance must be classified
Typical approaches
o
k

nearest neighbor approach
Instances represented as points in a Euclidean space.
o
Locally weighted regression
Constructs local approximation
o
Case

based reasoning
Uses symbolic representations and knowledge

based inference
The
k

Nearest Neighbor Algorithm
All instances correspond to points in the n

D space.
The nearest neighbor are defined in terms of Euclidean distance.
The target function could be discrete

or real

valued.
For discrete

valued, the
k

NN re
turns the most common value among the k training
examples nearest to
xq
.
Vonoroi diagram: the decision surface induced by 1

NN for a typical set of training
examples.
Case

Based Reasoning
Also uses:
lazy evaluation + analyze similar instances
Difference:
Instances are not “points in a Euclidean space”
Example:
Water faucet problem in CADET (Sycara et al’92)
Methodology
o
Instances represented by rich symbolic descriptions (e.g., function graphs)
o
Multiple retrieved cases may be combined
o
Tight coupling between
case retrieval, knowledge

based reasoning, and problem
solving
Research issues
o
Indexing based on syntactic similarity measure, and when failure, backtracking,
and adapting to additional cases
Remarks on Lazy vs. Eager Learning
Instance

based learning:
lazy evaluation
Decision

tree and Bayesian classification
: eager evaluation
Key differences
o
Lazy method may consider query instance
xq
when deciding how to generalize
beyond the training data
D
o
Eager method cannot since they have already chosen global a
pproximation when
seeing the query
Efficiency: Lazy

less time training but more time predicting
Accuracy
o
Lazy method effectively uses a richer hypothesis space since it uses many local
linear functions to form its implicit global approximation to the tar
get function
o
Eager: must commit to a single hypothesis that covers the entire instance space
Genetic Algorithms
GA: based on an analogy to biological evolution
Each rule is represented by a string of bits
An initial population is created consisting of rand
omly generated rules
o
e.g., IF A
1
and Not A
2
then C
2
can be encoded as 100
Based on the notion of survival of the fittest, a new population is formed to consists of
the fittest rules and their offsprings
The fitness of a rule is represented by its classi
fication accuracy on a set of training
examples
Offsprings are generated by crossover and mutation
Rough Set Approach
Rough sets are used to approximately or “roughly” define equivalent classes
A rough set for a given class C is approximated by two sets: a lower approximation
(certain to be in C) and an upper approximation (cannot be described as not belonging
to C)
Finding the minimal subsets (reducts) of attributes (for feature reduction) is N
P

hard but
a discernibility matrix is used to reduce the computation intensity
Fuzzy Set Approaches
Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of
membership (such as using fuzzy membership graph)
Attribute values are
converted to fuzzy values
o
e.g., income is mapped into the discrete categories {low, medium, high} with
fuzzy values calculated
For a given new sample, more than one fuzzy value may apply
Each applicable rule contributes a vote for membership in the categor
ies
Typically, the truth values for each predicted category are summed
What Is Prediction?
Prediction is similar to classification
o
First, construct a model
o
Second, use model to predict unknown value
Major method for prediction is regression

Linear and mult
iple regression

Non

linear regression
Prediction is different from classification
o
Classification refers to predict categorical class label
o
Prediction models continuous

valued functions
Regress Analysis and Log

Linear Models in Prediction
Linear regression
: Y =
+
X
o
Two parameters ,
and
specify the line and are to be estimated by using the
data at hand.
o
using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….
Multiple regression
: Y = b0 + b1 X1 + b2 X2.
o
Many nonlinear functio
ns can be transformed into the above.
Log

linear models
:
o
The multi

way table of joint probabilities is approximated by a product of lower

order tables.
o
Probability:
p(a, b, c, d) =
ab
ac
ad
bcd
Locally Weighted Regression
Construct an explicit approxi
mation to
f
over a local region surrounding query instance
xq
.
Locally weighted linear regression:
o
The target function
f
is approximated near
xq
using the linear function:
o
minimize the squared error: distance

decreasing weight
K
o
the gradient descent training rule:
In most cases, the target function is approximated by a constant, linear, or quadratic
function.
Classification Accuracy: Estimating Error Rates
Partition: Training

and

testing
o
u
se two independent data sets, e.g., training set (2/3), test set(1/3)
o
used for data set with large number of samples
Cross

validation
o
divide the data set into
k
subsamples
o
use
k

1
subsamples as training data and one sub

sample as test data
—
k

fold
cross

val
idation
o
for data set with moderate size
Bootstrapping (leave

one

out)
o
for small size data
Bagging and Boosting
)
(
)
(
1
1
0
)
(
ˆ
x
n
a
n
w
x
a
w
w
x
f
))
,
(
(
_
_
_
_
2
))
(
ˆ
)
(
(
2
1
)
(
x
q
x
d
K
q
x
of
neighbors
nearest
k
x
x
f
x
f
q
x
E
q
x
of
neighbors
nearest
k
x
x
j
a
x
f
x
f
x
q
x
d
K
j
w
_
_
_
_
)
(
))
(
ˆ
)
(
))((
,
(
(
Bagging
Given a set S of s samples
Generate a bootstrap sample T from S. Cases in S may not appear in T or may appear
more than once.
Repeat this sampling procedure, getting a sequence of k independent training sets
A corresponding sequence of classifiers C1,C2,…,Ck is constructed for each of these
training sets, by using the same classification algorithm
To classify an unknown sample X
,let each classifier predict or vote
The Bagged Classifier C* counts the votes and assigns X to the class with the “most”
votes
Boosting Technique
—
Algorithm
Assign every example an equal weight
1/N
For t = 1, 2, …, T Do
o
Obtain a hypothesis (classifier
) h
(t)
under w
(t)
o
Calculate the error of
h(t)
and re

weight the examples based on the error . Each
classifier is dependent on the previous ones. Samples that are incorrectly
predicted are weighted more heavily
o
Normalize w
(t+1)
to sum to 1 (weights assigne
d to different classifiers sum to 1)
Output a weighted sum of all the hypothesis, with each hypothesis weighted according
to its accuracy on the training set
Bagging and Boosting
Experiments with a new boosting algorithm, freund et al (AdaBoost )
Bagging
Predictors, Brieman
Boosting Naïve Bayesian Learning on large subset of MEDLINE, W. Wilbur
Comments 0
Log in to post a comment