Lab 1: Decision Trees and Decision Rules

cobblerbeggarΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 3 χρόνια και 8 μήνες)

48 εμφανίσεις


1

Lab 1: Decision Trees and Decision Rules


Evgueni N. Smirnov

smirnov@cs.unimaas.nl


August 21, 20
13




1.

Introduction

Given a data
-
mining problem, you need to have data that represent the problem, models that
are suitable for the data, and of course a data
-
m
ining environment that contains the
algorithms capable of learning these models. In this lab you will study two well
-
known
classification

problems. You will try to find classification models for these problems using
decision trees and decision rules. The a
lgorithms to learn these models are given in Weka, a
data
-
mining environment that accompanies our course. You will study the explorer part of
Weka to learn how to call decision
-
tree and decision
-
rule algorithms, how to evaluate the
accuracy

of the learned
models, and how to use reduced error pruning.


2.

Concept
-
Learning Problems

In this lab you are expected to build classification models for two
classification

problems:



La
bor
-
negotiation problem;



Soybea
n classification problem.


The data files for all the t
wo problems are provided in the directory:


https://project.dke.maastrichtuniversity.nl/datamining/UCI/datasets
-
UCI.zip


3.

Environment

As stated above to build the de
sired classification models you will use Weka. Weka is a
data
-
mining environment that contains a collection of machine
-
learning algorithms for
solving real
-
world data
-
mining problems. The algorithms can either be applied directly or
called from your own Ja
va code. Weka contains tools for data pre
-
processing, classification,
regression, clustering, association rules, and visualization. It is also well
-
suited for
developing new machine learning schemes. Weka is open source software issued under the
GNU Genera
l Public License.


4.

Algorithms

To build the
classifiers

you will use four learning algorithms provided in Weka:

1.

zeroR
is a majority/average predictor. It assigns to each instance the classification of
the majority class found in the dataset.

2.

oneR

is one
-
lev
el decision tree learner. The only option of the algorithm is
bucket
size

that specifies the minimum number of instances covered by each leaf of the
resulting decision tree.

3.

J4.8 (C4.5) is a decision
-
tree learner
(discussed in the last lecture).

4.

JRip is
a decision
-
rule learner

(discussed in the last lecture)
.






2

5.

Lab Tasks

A.

Study the
classification

problems (given in section 2) and their data files (especially
how they are formatted in Weka)

B.

Study the Explorer part of Weka.

C.

Apply all the four learning alg
orithms on the two data files using the default settings.
Determine the accuracy of the resulting
classifiers

using
the
training set
and

10
-
fold
cross validation.

Is there a difference? Can you explain why?

Analyz
e the resulting
classifiers

from a comprehe
nsibility point of view.

D.

Experiment with J4.8 and
error
pre
-
pruning
. To build a decision tree

using
pre
-
pruning
first set the option
unpruned

to
True
. Then, e
xperiment with
pre
-
pruning by
changing the option
minNumObj

from
1

to
15.

(The option
minNumObj

de
termines
the

min num
ber of training instances in

the

leaf nodes of the decision tree
s
.
) Determine
the accuracy of the resulting
decision trees

using
the
training set
and

10
-
fold cross
validation. Compare the accuracy obtained with that of J4.8 from the pre
vious task.

E.

Experiment with J4.8 and reduced error pruning
1
. To build a decision tree using reduced
error pruning you have to use the option reducedErrorPruning of J4.8. Experiment with
reduced error pruning by changing the option numFolds from 2 to the si
ze of the
datasets (The option numFolds determines the amount of data used for reduced
-
error
pruning. One fold is used for pruning, the rest for growing the tree.). Determine the
accuracy of the resulting
decision trees

using

the

training set and 10
-
fold
cross
validation. Compare the accuracy obtained with that of J4.8 from the previous task
s
.






1

Please do not forget to turn
the option
unpruned

to
False.