CS 9633
Machine Learning
Homework 1
k

nearest neighbor and the leukemia dataset
Date Assigned: January 11, 2003
Date Due: Beginning of class period
February 4
, 2003
To be Submitted:
Both electronic and hard

copy version of code
Report (2

4 pages)
that describes
your
implementation
of k
nn,
describes results
with
test data and training data for
o
All features, unnormalized data,
varying values of k
o
All features normalized data, best value of k from previous
o
For the 50 genes used by Golub et al. , no
rmalized data, best value of k
You should compare your
results to those reported in original paper
, and presents an
overall discussion and summary
.
When varying the values of k, you should use k = 1, 3, 5
and if the accuracy from 3 to 5 has increased, c
ontinue to increment k by 2 until the
accuracy values decreases. Present results in terms of both %correct and in a confusion
matrix.
The report must properly cite any sources used for implementation of knn
(including our text), source of the data, and t
he paper in which the original results are
described.
Normalization process
(Han and Kamber, 2001, pages 114

116, 339

341)
Golub et al. appear to have used z

score normalization. Standard z

score normalization
uses the mean and standard deviation of the
values for each attribute to standardize the
values for that attribute. The resulting attribute values will have a mean of 0 and a
standard deviation of 1. An alternative is to use the mean absolute deviation rather than
the stan
dard deviation. If
m
f
i
s the mean, then the m
ean
absolute deviation is computed as
follows:
The use of mean absolute deviation
reduces the effect of outliers.
To convert attribute values to z

scores
1. Compute either the standard deviation or mean abs
olute deviation of the attribute
2. Compute the z score for each attribute value v as
v’ = (v

m
f
)/s
f
where s
f
is either the standard deviation or the mean absolute deviation.
Assignment:
The leukemia dataset of Golub, et al.
, 1999
has been ext
ensively used to test machine
learning algorithms. Information about the dataset
along with a copy of the original paper
published in
Science
that describes the dataset
can be found at:
http://www

genome.wi.mit.edu/cgi

bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=43
Your assignment is to implement t
he k

nearest neighbor learning algorithm and to use this
algorithm with the leukemia da
ta set. You should use the test and training datasets as
specified at the web site. The attributes of the data set are numeric measurements from
gene expression experiments. Each
attribute
value indicates the level of expression of a
gene in tissue tak
en from individuals with different types of leukemia.
The task of the
learning algorithm is to learn to predict the type of leukemia from the gene expression
profile. There are relatively few data instances, but each instance has a large number of
featur
es. Several authors have reported good results using Support Vector Machines with
this data set (we will study SVMs later in the semester). A recent abstract indicates that
the authors were also able to achieve good results with knn (Hoffmann et al. 2002
).
<http://data.mpi

sb.mpg.de/internet/eccb2002.nsf/4e38523048ac3b74c1256c240049e193/b24561212112517bc
1256c1500371011?OpenDocument
>
Comments 0
Log in to post a comment