IMPUTATION OF MISSIN
G DATA FOR INPUT TO
BOB L. WALL
Montana State University
Department of Computer Science
JEFF K. ELSER
Montana State University
Department of Computer Science
The standard formulation of support vector machines (SVMs) does not allow
for missing values for any of the attributes in an example being learned or
classified. We examine preprocessing methods to prepare data sets containing
ing values for use by an SVM. These methods can be separated into two
categories: ignoring the missing data (discarding training examples with a
missing attribute or discarding attributes that have missing values), or using a
process known as
to generate missing values. A simple imputation
method is just to use the average value for the attribute, but there are more
robust techniques, including a k
nearest neighbors (KNN) approach and the use
of a separate SVM to determine the likely value.
We compare the performance
of an SVM on test datasets containing missing values using each of the above
techniques, and finally we propose a topic for further research
method of handling the missing values directly within the SVM learning
The standard formulation of support vector machines (SVMs) does not
allow for missing values for any of the attributes in an example being learned or
classified. We examine methods by which data sets containing missing values
can be p
rocessed using an SVM. This is typically accomplished by one of two
means: ignoring the missing data (either by discarding examples with a missing
attribute value or discarding an attribute that has missing values), or using a
process generally referred
, by which a value is generated for the
attribute. These techniques are typically carried out on the data prior to its being
supplied to the learning algorithm. A simple imputation method is just to use
the average value for the attribute
, but there are more robust techniques that are
likely to supply a value closer to the one that is missing, including a k
neighbors (KNN) approach and the use of a separate SVM to determine the
We compare the performance of an SVM on
test datasets containing
missing values using the following techniques: ignoring examples with missing
values; ignoring attributes with missing values; imputing the missing values
using the attribute’s mean value; imputing the missing values using KNN; an
imputing the missing values using a separate SVM.
In order to have control over the variables in the experiments, we wrote a
data set generator. The generator calculated the sum of the squares of all the
values in a row of data. If the sum wa
s less than or equal to one, the example
belonged to the class. If the sum was greater than one, it did not belong to the
class. This creates a hyper
sphere shape which is not linearly separable. This is
necessary because the linearly classifiable data
we started with was not causing
any error. The error caused by the harder data sets allows us to see differences
in performance better. The number of positive and negative training examples
and test cases were controlled explicitly. The number of attrib
utes was also
explicitly stated for each set.
Two data sets had 20 examples, four had 50 examples, two had 100
examples, one had 1000, and one had 10,000. Each had an equal number of
positive and negative examples. The number of attributes varied between
MEASURING SVM PERFOR
When we generated the training data sets, we also generated test data sets
with the same distribution for each attribute. After training an SVM using the
training data, we tested it against the test data set an
d measured the accuracy.
In the generated data, a random attribute was picked and then rows of that
attribute were randomly picked to be missing. The number of columns and rows
used were explicitly controlled by us, but the act
ual position of the columns and
rows was random.
HANDLING MISSING VAL
Eliminating Missing Values
The first two methods by which we deal with missing values in the data set
are very simple: just ignore training examples that contain missing values, or
gnore attributes that have missing values in any of the training examples.
Note that the first method causes problems if any examples to be classified
after learning is complete contain missing values; they must either be ignored, or
the missing values mus
t be generated using one of the following methods.
Replacing with Mean / Mode Value
This technique is not significantly more complicated than the first two. If a
training example has a missing value, that value is replaced by the “most
common” value for t
hat attribute. For continuous attributes, we use the mean or
average of all the specified values for that attribute in the training data set. (This
does impose the requirement that two passes be made over the data, one to
determine the mean and another to
update any missing values.) For discrete
attributes, we use the mode, or most commonly occuring, value for that attribute.
Nearest Neighbors (KNN) Imputation
For this implementation, if a training example contains one or more missing
values, we measure
between the example and all other examples
that contain no missing values. Our distance metric is a modified version of the
the distance between two examples is the sum of the
distances between the corresponding attribut
e values in each example. For
discrete attributes, this distance is 0 if the values are the same, and 1 otherwise.
In order to combine distances for discrete and continuous attributes, we perform
a similar distance measurement for continuous attributes
if the absolute
difference between the two values is less than half a standard deviation, the
distance is 0; otherwise, it is 1.
The K complete examples closest to the example with missing values are
used to choose a value. For a discrete attribute, the
most frequently occurring
value is used. For a continuous attribute, the average of the values from the K
neighbors is used. For our experiments, we used K = 5.
An SVM can be used to impute the missing values as well. For each
n the training set that has missing values, we train an SVM using all of
the training examples that have no missing values; we ignore the original
classification value from the data set and use the value of the attribute being
imputed is the target value.
We also ignore any other attributes that have any
missing values when generating this new training data.
If the attribute is a continuous value, we use SVM
to learn the data. If the attribute is discrete with only two values, a s
mode can be used. For a discrete attribute with more than
two values, special handling is necessary
standard SVMs, including SVM
Light, do not do multi
category classification. We use the standard technique of
to handle multi
valued discrete attributes; if the attribute
values, we generate
separate training data sets; in data set
classification value for each example indicates whether or not the example had
for the attribute.
fter an SVM is trained on each data set, we then use that SVM to classify
or perform regression on the examples with missing values for the attribute.
Again, we ignore any other attributes that have missing values. If the attribute
being imputed is conti
nuous, the SVM will use regression to generate the value.
If the attribute is continuous, we need to classify it with each of the
models and select the value corresponding to the SVM that classifies the
example as positive. If more than one SVM gen
erates a positive classification,
we randomly select one value.
The results of processing each of our data sets containing missing data
using each of the above methods is summarized below.
For the generated data, we averaged the accuracies for 10
varied data sets,
for each imputation technique. As would be expected, the control where the
actual data was left in had the best accuracy. The next best technique was using
information from the model to pick values. This involved replacing missing
es with the expected center of the model. The use of an SVM to impute the
values was next best. Effectivly tieing for fourth were three more realistic
techniques: discard attributes with missing values, k
nearest neighbor, and
replacing values with the
mean. Since it is would be uncommon to have
information about the model to simply pick good values, these last three are
probably the dominant choices. Replacing with a 1 or
1 were inferior as
expected. The worst technique was discarding rows. This co
uld be partly
because we chose our missing values by focusing on single columns. That
would cause the missing values to be distributed more evenly between the rows,
and thus cause a greater loss of data upon removing those rows.
Replace with 0
Replace with 1
TOPICS FOR FURTHER R
It is obviously of significant value in certain
data sets to avoid throwing
away examples because one or more of their attribute values are missing. Other
attributes that do have values might be significant to learning the classification
criteria. However, if possible, care should be taken to avoid in
troducing error to
the classification by assigning an incorrect value for the missing attribute.
It may be possible to take advantage of the way in which the SVM
computes the hyperplane that separates categories to better handle missing
Consider the theoretical basis for SVM learning
negative examples are mapped to a higher
dimensional Euclidean space (i.e. a
space in which the
is defined) in which it is possible to separate
the positive and negative examples
with a single hyperplane. The two groups of
examples are each enclosed inside a convex hull, and the hyperplane is built by
drawing a line between the closest points on the two convex hulls, and drawing
a plane perpendicular to this line halfway between t
he convex hulls.
Consider adding a new data point which has a missing value for an
attribute. When this data point is mapped to the higher
dimensional space, if it
lies within the boundaries of its classification’s convex hull, regardless of the
ch could be assigned to the missing attribute, or if it extends the
convex hull in a direction away from the convex hull enclosing the other
classification, the point can be ignored; it cannot be a support vector. If the data
point is mapped in such a way
that it extends the convex hull in a direction
toward the other convex hull, it may be significant. If varying the value of the
missing attribute within some range of values moves the point along a line that
is parallel to the closest face of the other c
onvex hull, then it is safe to assign any
of those values to the missing attribute.
The difficulty with this analysis is that the SVM does not actually map the
points to the higher
dimensional space; it evaluates a
chosen such that
its result is proportional to the inner product in the higher
dimensional space. The values obtained from the kernel function are supplied to
quadratic programming optimization
algorithm to select the support vectors.
In order to handle examples with
missing values within the SVM as it is
learning, it might be possible to somehow tag the values and generate a range of
results from the kernel evaluation, then modify the quadratic optimization
algorithm to determine whether the point should be a candidat
e for a support
For a detailed introduction to the workings of SVMs, see (Chen, 2003). For
more detailed analysis of SVMs and their applications, see (Cristianini, 2000)
and (Schölkopf, 2002).
Surprisingly, the K
Nearest Neighbors and
SVM imputation techniques did
not perform significantly better than the technique of replacing the missing
values with the attribute mean, at least for these data sets. For other more
realistic data sets, we would expect to possibly see increased accurac
y using the
more complicated imputation techniques; however, experiments on a database of
automobile data demonstrated similar results (the mean value and KNN
techniques were slightly better than ignoring attributes with missing values).
We encountered so
me other problems with using the SVM imputation technique
that prevented a meaningful comparison of results on that data set. One problem
convergence in the SVM optimization algorithm; normalizing the
values of each continuous attribute to the ran
ge [0, 1] helped somewhat, but
normalizing to [
1,1] might provide better results.
All in all, more experimentation is probably required to better gauge
whether the additional effort required to use an SVM to impute missing values is
worthwhile. It would
also be useful to attempt to find heuristics to characterize
the data that would act as a guide for choosing the most appropriate imputation
It would definitely be worthwhile to pursue research into incorporating the
handling of missing values dire
ctly into the SVM learning algorithm. However,
there is no guarantee that this will lead to a computationally feasible algorithm.
The experiments conducted for this research used the SVM
written by Thorsten Joachims (Joachi
ms, 1999). For information about the
package, and to download, visit http://svmlight.joachims.org/.
Chen, P.H., C.
J. Lin, and B.
Schölkopf, 2003, “A Tutorial on Nu
Support Vector Machines,”
Cristianini, N. and J. Shawe
An Introduction to Support Vector Machines
University Press, 2000.
Joachims, T., 1998, “Making Large
Scale SVM Learn
ing Practical,” in B. Schölkopf, C. Burges,
and A. Smola (eds.),
Advances in Kernel Methods
Support Vector Learning
Cambridge, MA, 1999.
Schölkopf, B. and A. J. Smola, 2002,
Learning with Kernels,
MIT Press, Cambridge, MA, 2002.