De La Salle University College of Computer Studies INTROAI / Introduction to Artificial Intelligence AY 2010-11 Term 1 Assignment 3 Empirical Analysis of Machine Learning Algorithms

randombroadAI and Robotics

Oct 15, 2013 (3 years and 11 months ago)

102 views

1

of
2


De La Salle University

College of Computer Studies


INTROAI / Introduction to Artificial Intelligence

AY 2010
-
11 Term 1


Assignment
3

Empirical Analysis of Machine Learning Algorithms



Instructions
:


Your overall task is to compare the

learning curves
1

of a decision tree
learner
and a multilayer neural network
learner
on a published data set.


There is no need to implement the learning algorithms; you can use Weka, a suite of machine learning
algorithms implemented in Java
,

available a
t
http://www.cs.waikato.ac.nz/ml/weka/
. Extensive documentation
is also available from the said site. For this assignment, use Weka’s J48 (which implements C4.5, a more
robust version of ID3) and Multil
ayerPerceptrons (which implements B
ackprop

for multilayer neural networks.)


Get the data set from the UC Irvine Machine Learning Repository at
http://archive.ics.uci.edu/ml/datasets.html
.
Many d
atasets in the UCI repository are in C4.5 data format, a brief description of which is available at
http://www.cs.washington.edu/dm/vfml/appendixes/c45.htm
. Weka accepts C4.5 format a
s well as ARFF,
Weka’s own format
,

explained in detail in
http://www.cs.waikato.ac.nz/~ml/weka/arff.html
.
2


No
2
group
s may work on the same dataset,

so
obtain dataset approval first as soon a
s possible though
raymund.sison@delasalle.ph
.

When you propose a dataset, specify the following:
3


Classification? (Y/N)



Attribute characteristics (C/I/R):



Data set characteristics
(M/U/S/T/R):



#

of instances:



# of attributes:



Missing values? (Y/N)




Report contents and grading:


Description of the experiments

(
basically
how the learning curves were produced)

1

Decision tree and neural network model
(
nodes and
weights)

2

Learning curves
for ID3 and MLN

(with the source data in
Excel
tables
, and .csv files in the appendix)

2

Analysis of the learning curves

5






1

Assuming you are using cross
-
validation with N=10 folds, for each of the 10 folds yo
u normally use 9/10 of the data for training
(TRAINi, i=1..10), and 1/10 for testing (TESTi, i=1..10). To plot a learning curve, you also consider subsets of the training

data. That is,
for each fold you repeat the experiment by using 10%, 20%,…, 100% of T
RAINi for training, while still using the entire test set (TESTi)
for testing. The x
-
axis of the learning curve will be the number of training instances, while the y
-
axis will be the percentage of test items
that are correct. Below are sample learning curv
es from (Russell & Norvig, 2003, p. 747).


2

of
2









2

Therefore, you might have to convert a dataset from C4.5 format to ARFF. These formats are described in the links in the exce
rpt
above. If you still have problems after your conversion, let me know; I might
have the ARFF file for the dataset you have chosen.


Here are some common problems when converting a dataset to arff format:



There's a missing attribute (this is a major error). The number of columns in the data should match the number of declared
attribut
es in the arff file.



There are unusual symbols in, or following, the names of attributes (e.g., a colon (e.g., Var1:Sub1), descriptors (e.g., Var1

(Hz)).
Stick to alphanumeric characters and the dash (e.g., Var1
-
Sub1).



Attributes declared as string. Weka o
r, more specifically, classifiers of Weka, don’t handle string attributes. Replace these with
integers or enumerations.



Weka takes the last attribute as the class by default. Override this by specifying which attribute Weka should treat as a cla
ss.


3

Here

are some guidelines when choosing a dataset:



Only choose datasets whose associated task is classification, because decision trees are for classification. Do not choose
sequential or time
-
series datasets, because decision trees can't handle them.



Do not c
hoose relational datasets because these require predicate logic learners; decision trees and backprop networks only
handle propositional logic.



Do not choose datasets with too few instances because your learning curves will not be meaningful. Best to choos
e a dataset with
>100 instances. However, do not choose datasets with too many instances (e.g., thousands of instances) because then the
multilayer network learner will take a long time to build a model.



Best not to choose a dataset with missing values. If

you insist on choosing one with missing values, you must be careful how you
treat these. You must study the literature about this dataset, see how missing values were handled, and include those papers
and
their results in your report. The papers that cite
d this dataset are listed at the bottom of the dataset page.