4Classification and Prediction (6hrs)

elbowcheepAI and Robotics

Oct 15, 2013 (3 years and 10 months ago)

84 views

4

䍬慳獩晩Ca瑩潮⁡湤⁐te摩捴楯渠d
6桲s



4.1 What is classification? What is prediction?

4.2 Issues regarding classification and prediction

4.3 Classification by decision tree
induction

4.4 Bayesian classification

4.5 Classification by back propagation

4.6 Support Vector Machines (SVM)

4.7 Prediction

4.8 Accuracy and error measures

4.9 Model selection

4.10 Summary

Key Points

Definitions; Bayesian Classification ; Decision Trees



Notes
: More details on related algorithms.

Q&A:

1.

What is the difference between Classification
and

prediction

Cl
a
ss
i
f
i
ca
ti
on

i
s

t
he

p
r
o
c
e
ss

of

f
i
nd
i
ng

a

s
e
t

of

m
o
d
e
l
s

(
or

f
un
c
ti
ons)

t
h
a
t

d
e
s
cr
i
be

a
nd

d
i
s
ti
n
g
u
i
sh d
a
t
a

c
l
a
ss
e
s

o
r

c
on
c
e
p
t
s,

f
or

t
he

pu
r
pose

o
f

b
e
i
n
g

a
b
l
e

t
o

use

t
he

m
od
e
l

t
o

p
re
d
i
c
t

t
he

c
l
a
ss

o
f ob
j
ec
t
s

whose

c
l
a
ss

l
a
b
e
l

i
s

unknown.

I
n

p
re
d
i
c
ti
o
n,

ra
t
h
e
r

t
h
a
n

p
re
d
i
c
ti
n
g

c
l
a
ss

l
a
b
e
l
s,

t
he

m
a
i
n
i
n
t
ere
st

(
usu
a
l
l
y
)

i
s

mi
ss
i
ng

or

u
n
a
v
a
il
a
b
l
e

d
a
t
a

v
a
l
u
e
s.

(
H
a
n

&

K
a
m
b
er
)


S
o,

a
lt
hou
g
h

c
l
a
ss
i
f
i
c
a
ti
on

i
s

ac
t
u
a
l
l
y

t
he

s
t
e
p

of

f
i
nd
i
ng

t
he

m
od
e
l
s,

t
he

g
o
a
l

of

bo
t
h

m
e
t
hods

i
s

t
o p
re
d
i
c
t

so
m
e
t
h
i
ng

a
bout

unknown

d
a
t
a

ob
j
ec
t
s.

T
he

d
i
ffe
r
e
n
c
e

i
s

t
h
a
t

i
n

c
l
a
ss
i
f
i
ca
ti
on

t
h
a
t

so
m
e
t
h
i
n
g


i
s

t
he

c
l
a
ss

of

ob
j
ec
t
s,

wh
e
rea
s

i
n

p
r
e
d
i
c
ti
on

i
t
i
s

t
he

mi
ss
i
ng

d
a
t
a

v
a
l
u
e
s.

2.

Briefly outline the major steps of decision tree classification.


Answer: The major steps are as follows:


• The tree starts as a single root node containing all of the training tuples.

• If the tuples are all from the same class, then the node
becomes a leaf,
labeled with that class.

• Else, an attribute selection method is called to determine the splitting
criterion. Such a method may using a heuristic or statistical measure (e.g.,
information gain or gini index) to select the “best”

way to separate the tuples
into individual classes. The splitting criterion consists of a splitting
attribute and may also indicate either a split
-
point or a splitting subset, as
described below.

• Next, the node is labeled with the spl
itting criterion, which serves as
a test at the node. A branch is grown from the node to each of the
outcomes of the splitting criterion and the tuples are partitioned
accordingly.

• The algorithm recurses to create a decision tree for t
he tuples at each
partition.

3.

Why is tree pruning useful in decision tree induction? What is
a drawback of using a separate set of tuples to evaluate pruning?

Answer:


The decision tree built may overfit the training data. There could be too many
branches, some of which may reflect anomalies in the training data due to noise
or outliers. Tree pruning addresses this issue of overfitting the data by removing
the least reliable branches (using statistical measures). This generally
result
s in a more compact and reliable decision tree that is faster and more
accurate in its classification of data.

The drawback of using a separate set of tuples to evaluate pruning is
that it may not be representative of the training tuples used
to create the
original decision tree. If the separate set of tuples are skewed, then using them
to evaluate the pruned tree would not be a good indicator of the pruned
tree’s classification accuracy. Furthermore, using a separate set of tu
ples to
evaluate pruning means there are less tuples to use for creation and testing
of the tree. While this is considered a drawback in machine learning, it may
not be so in data mining due to the availability of larger data sets.

4.

Given a de
cision tree, you have the option of (a)
converting the decision tree to rules and then pruning
the resulting rules, or (b) pruning the decision tree and
then converting the pruned tree to rules. What
advantage does (a) have o
ver (b)?

Answer:

If pruning a subtree, we would remove the subtree completely with method
(b). However, with method (a), if pruning a rule, we may remove any
precondition of it. The latter is less restrictive.

5.

It is important to calculate the wor
st
-
case computational
complexity of the decision tree algorithm. Given data set D,
the number of attributes n, and the number of training tuples
|D|, show that the computational cost of growing a tree is at most n
×

籄簠
×

log(籄簩|


Answer:


The wo
rst
-
case scenario occurs when we have to use as many
attributes as possible before being able to classify each group of tuples. The
maximum depth of the tree is log(|D|). At each level we will have to

compute the attribute selection
measure O(n) times (one per
attribute). The total number of tuples on each level is |D| (adding over all
the partitions). Thus, the computation per level of the tree is O(n × |D|). Summing
over all of the levels we obtain O(n × |D| × log(|D|)
).

6.

Given a 5 GB data set with 50 attributes (each containing 100
distinct values) and 512 MB of main memory in your laptop,
outline an efficient method that constructs decision trees in
such large data sets. Justify your answer by rough calcu
lation
of your main memory usage.

Answer:

We will use the RainForest algorithm for this problem. Assume there are C
class labels. The most memory required will be for AVC
-
set for the root of
the tree. To compute the AVC
-
set for the root node
, we scan the
database once and construct the AVC
-
list for each of the 50 attributes. The
size of each AVC
-
list is 100 × C . The total size of the AVC
-
set is then
100 × C × 50, which will easily fit into 512MB of memory

for a reasonable C .
The computation of other AVC
-
sets is done in a similar way but they will
be smaller because there will be less attributes available. To reduce the number
of scans we can compute the AVC
-
set for nodes at the same level of t
he tree in
parallel. With such small AVC
-
sets per node, we can probably fit the level in
memory.

7.

Compare the advantages and disadvantages of eager
classification (e.g., decision tree, Bayesian, neural network)
versus lazy classification (e.g
., k
-
nearest neighbor, case
-
based
reasoning).

Answer:

Eager classification is faster at classification than lazy classification because it
constructs a generalization model before receiving any new tuples to classify.
Weights can be assigned to attribute
s, which can improve classification accuracy.
Disadvantages of eager classification are that it must commit to a single hypothesis
that covers the entire instance space, which can decrease classification, and more
time is needed for training.

Lazy classif
ication uses a richer hypothesis space, which can improve
classification accuracy. It requires less time for training than eager classification.
A disadvantages of lazy classification is that all training tuples need to be stored,
which leads to expens
ive storage costs and requires efficient indexing techniques.
Another disadvantage is that it is slower at classification because classifiers are not
built until new tuples need to be classified. Furthermore, attributes are all
equally weighted, which c
an decrease classification accuracy. (Problems may arise
due to irrelevant attributes in the data.)

8.

What is association
-
based classification?

Why is association
-
based
classification able to achieve higher classification accuracy

than a
classical decision
-
tree method?

Explain

how association
-
based
classification can be used for text document classification.


Answer:


Association
-
based classification is a method where association rules are
generated and analyzed for use in classification. We first
search for strong
associations between frequent patterns (conjunctions of attribute
-

value pairs)
and class labels. Using such strong associations we classify new examples.
Association
-
based classification can achieve higher accuracy than a c
lassical
decision tree because it overcomes the constraint of decision trees, which
consider only one attribute at a time, and uses very high confidence
rules that combine multiple attributes.

For text document classification, we can mode
l each document as
a transaction containing items that correspond to terms. (We can
preprocess the data to do stemming and remove stop words.) We also
add the document class to the transaction. We then find frequent patterns and
output ru
les of the form term1 , term2 , ..., termk → classi [sup =0.1, conf
=0.9]. When a new document arrives for classification, we can apply the
rule with highest support and confidence that matches the document, or apply a
combination of rules a
s in CMAR.


9.

The support vector machine (SVM) is a highly accurate
classification method. However, SVM classifiers suffer from
slow processing when training with a large set of data
tuples. Discuss how to overcome this difficulty and de
velop a
scalable SVM algorithm for efficient SVM classification in large
datasets.


Answer:


We can use the micro
-
clustering technique in “Classifying large
data sets using SVM with hierarchical clusters” by Yu, Yang, and Han,
in Proc. 20
03 ACM SIGKDD Int. Conf. Knowledge Discovery and Data
Mining (KDD’03), pages 306
-
315, Aug. 2003 [YYH03].

A Cluster
-
Based
SVM (CB
-
SVM) method is described as follows:


1. Construct the microclusters using a CF
-
Tree (Chapter 7).

2. Train an S
VM on the centroids of the microclusters.

3. Decluster entries near the boundary.

4. Repeat the SVM training with the additional entries.

5. Repeat the above until convergence.

10.

What is boosting ? State why it may improve the accuracy of
decisio
n tree induction.


Answer:


Boosting is a technique used to help improve classifier accuracy.
We are given a set S of s tuples. For iteration t, where t = 1, 2, . . . , T , a
training set St is sampled with replacement from S. Assign weights to the tuples
within that

training set. Create a classifier, Ct from St . After Ct is created,
update the weights of the tuples so that the tuples causing classification error
will have a a greater probability of being selected for the next classifier
constructed. This

will help improve the accuracy of the next classifier,
Ct+1 . Using this technique, each classifier should have greater accuracy than
its predecessor. The final boosting classifier combines the votes of each individual
classifier, where the wei
ght of each classifier’s vote is a function of its accuracy.