Homework 1

reformcartloadΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

111 εμφανίσεις

CS 9633

Machine Learning

Homework 1

k
-
nearest neighbor and the leukemia dataset



Date Assigned: January 11, 2003


Date Due: Beginning of class period
February 4
, 2003


To be Submitted:



Both electronic and hard
-
copy version of code



Report (2
-
4 pages)
that describes

your

implementation

of k
nn,
describes results

with
test data and training data for

o

All features, unnormalized data,
varying values of k


o

All features normalized data, best value of k from previous

o

For the 50 genes used by Golub et al. , no
rmalized data, best value of k


You should compare your

results to those reported in original paper
, and presents an
overall discussion and summary
.
When varying the values of k, you should use k = 1, 3, 5
and if the accuracy from 3 to 5 has increased, c
ontinue to increment k by 2 until the
accuracy values decreases. Present results in terms of both %correct and in a confusion
matrix.
The report must properly cite any sources used for implementation of knn
(including our text), source of the data, and t
he paper in which the original results are
described.


Normalization process

(Han and Kamber, 2001, pages 114
-
116, 339
-
341)

Golub et al. appear to have used z
-
score normalization. Standard z
-
score normalization
uses the mean and standard deviation of the

values for each attribute to standardize the
values for that attribute. The resulting attribute values will have a mean of 0 and a
standard deviation of 1. An alternative is to use the mean absolute deviation rather than
the stan
dard deviation. If
m
f

i
s the mean, then the m
ean

absolute deviation is computed as
follows:


The use of mean absolute deviation
reduces the effect of outliers.


To convert attribute values to z
-
scores

1. Compute either the standard deviation or mean abs
olute deviation of the attribute

2. Compute the z score for each attribute value v as

v’ = (v
-

m
f
)/s
f


where s
f

is either the standard deviation or the mean absolute deviation.


Assignment:


The leukemia dataset of Golub, et al.
, 1999

has been ext
ensively used to test machine
learning algorithms. Information about the dataset
along with a copy of the original paper
published in
Science

that describes the dataset
can be found at:

http://www
-
genome.wi.mit.edu/cgi
-
bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=43


Your assignment is to implement t
he k
-
nearest neighbor learning algorithm and to use this
algorithm with the leukemia da
ta set. You should use the test and training datasets as
specified at the web site. The attributes of the data set are numeric measurements from
gene expression experiments. Each

attribute
value indicates the level of expression of a
gene in tissue tak
en from individuals with different types of leukemia.
The task of the
learning algorithm is to learn to predict the type of leukemia from the gene expression
profile. There are relatively few data instances, but each instance has a large number of
featur
es. Several authors have reported good results using Support Vector Machines with
this data set (we will study SVMs later in the semester). A recent abstract indicates that
the authors were also able to achieve good results with knn (Hoffmann et al. 2002
).

<http://data.mpi
-
sb.mpg.de/internet/eccb2002.nsf/4e38523048ac3b74c1256c240049e193/b24561212112517bc
1256c1500371011?OpenDocument
>