Applying Set Covering Problem in Instance Set Reduction for Machine
Learning Algorithms
PRASANNA K, KHEMANI D.
Computer Science and Engineering
Department
Indian Institute of Technology Madras
Chennai

600036, Tamil Nadu
INDIA
Abstract:

The various
mac
hine learning algorithms suffer
from the drawback of storing all training instances.
There are various algorithms proposed to selectively retain instances. Almost all the algorithms proposed have
been for classification task. We deviate from these appro
aches and present a strategy based only upon the
distance between the instances. We show that the problem of instance selection is
analogous to the famous set
covering problem
. We
apply the greedy and genetic algorithm based solution proposed for the
set c
overing
to
instance selection
.
We discuss
our
preliminary results regarding the performance of these approaches.
Key

Words:

training set reduction,
instance selection
, data preprocessing
1 Introduction
Recently there has been a lot of focus on the
maintenance aspect of various machine learning
tasks. The various algorithms for learning from data
can be widely classified as model based and instance
based algorithms. In model based approaches the
given past data is used to learn a model such as
decisi
on tree which is consistent with the training
set. In instance based approaches the available
training
data are stored as such without any
processing and relevant instances for a query are
retrieved based on the notion of distance.
T
here are
various
disadv
antages of using large training set in
learning algorithms.
In the context of model based
approaches t
hey are the computational cost of
learning algorithms and the labeling cost for the
instances as class labeling requires valuable experts’
assistance and
time [
6
]
.
I
nstance based
approaches
commonly use the
k

nearest neighbor(k

NN)
algorithm
for retrieval
.
Although k

NN algorithm is
simple to implement and allows for incremental
learning it has
the time and space requirements
as
O
(n)
which is clearly undesi
rable
.
There have been various methods proposed
in order to overcome these disadvantages with
a
large training set. We shall see how the issue of
instance
set
reduction is related to the famous set
covering problem (SCP) and
show how to
adopt a
solution o
f SCP to
the former
. We shall discuss
recent work
on
reducing size of instance set in the
next section and in the
following
sections see how
the problem can be approached using solutions
prescribed for SCP. Finally, we will discuss the
results of experimen
ts done on a synthetic data set.
2
Related Work
Almost all the approaches
for training set reduction
are based on incremental building of the
reduced
target set i.e. they consider all instances from the
original unedited set one after another and decide
upon whether to retain an instance based on
certain
criteria. They differ mainly in the order in which the
instances are considered at each
iteration and the
factor that is used to decide the utility or the
usefulness of
a particular
instance. Approaches
b
ased on editing rules
[1] such as Condensed
Nearest
Neighbor
(CNN),
Reduced Nearest
Neighbor
(RNN) and
Wilson

editing concentrate on
keeping the edited set consistent with the original set
while noisy and redundant instances are removed.
The main drawback
of these approaches is that the
quality of the reduced set is dependant on the order
in which the instances are considered. Authors in [1]
have
proposed three storage reduction schemes RT1,
RT2 and RT3 which are order independent. RT1 is
based on a simple
rule that an instance is removed if
the removal of the instance has no effect in the
generalization accuracy of the classifier.
RT2 and
RT3 are improvements to the basic RT1 by having
additional instance

ordering and noise removal
steps.
In the context of
case

based reasoning
(CBR)[7] the issue of instance set reduction is
addressed under the issue of case

base maintenance.
Recently, authors in [
5
] have proposed a reduction
scheme called iterative case filtering (ICF)
for CBR
which
retains the cases in cla
ss boundaries and
deletes the interior cases.
In [
8
,9
] authors have
proposed a model for measuring the competence of
a case

base and used that in identifying critical
instances that need to be stored and redundant
instances that can be discarded.
Almost a
ll the
se
approaches
assume that the
class information is available so that it can be used
in
validating the reduced set
. But in real world
problems such as monitoring of manufacturing
process and monitoring sensor data from patients,
satellites or robots
the task is prediction of
a
value
rather than classification. In those situations, there
are no classes. We address this problem of obtaining
a representative set of instances from a large set of
instances and assume that the only information
available is
the distance between the instances. We
make the assumption that a distance threshold is
given so that two instances are considered similar
and close enough if and only if the distance between
them is less than this threshold.
3
I
nstance
Set
R
eduction
as
S
CP
3.
1
Set Covering Problem
The set covering problem (SCP) is the problem of
covering the rows of an m

row, n

column, zero

one
matrix by a subset of the columns at minimal cost.
Formally, Let
A
be the zero

one matrix <a
ij
> where
a
ij
=1 indicates that j
th
c
olumn covers i
th
row else
a
ij
=0. Let x
j
=1 indicates jth column is in the solution
and x
j
=0 otherwise and c
j
be the cost of selecting j
th
column. Then the SCP is
Minimize
(
1
)
Subject to
(
2
)
Equation 1 is the actua
l goal i.e. to minimize the
total cost of choosing the columns in the solution
and equation 2 ensures that each row is covered by
at least one column in the solution. If the cost of all
columns is same then the objective reduces to
choosing the minimum num
ber of columns and the
problem is called unicost SCP. Set covering problem
is proven to be NP

Complete [
4
].
3.2
Mapping Instance Set Reduction
to SCP
Given a set of N instances
C
= {c
1
, c
2
,…, c
N
} let
dist
()
be the distance
function
such that
d
ist
(
c
i
,
c
j
)
is
the distance of the c
j
from c
i
. Let τ be the distance
threshold and
COVERS
(c
i
,c
j
) be a predicate which
indicates instance c
i
covers c
j
and is
true
if the
distance of c
j
from c
i
is less than τ i.e.
COVERS
(c
i
,c
j
) =
true
iff
dist
(c
i
,c
j
) ≤ τ
(3)
Now our p
roblem is to choose a minimum number
of instances from the actual set which covers the
entire set. Let
R
be the reduced set {r
1
, r
2
,.., r
k
} then
Minimize
such that
and
Given the distance threshold τ and the dist
ance
matrix
<d
ij
>
where d
ij
is the distance between c
i
and
c
j
,
we can create a binary matrix
B
such that b
ij
=1 if
f
d
ij
<= τ.
B
is the NxN coverage matrix in which
b
ij
=1
indicates that i
th
instance covers j
th
instance in the
actual set.
Our goal is to
select
minimal set of
instances such that all instances are covered. Thus
the actual problem is simply a special case of SCP
where m=n i.e. the number of columns is same as
number of rows.
3.
3 Greedy Algorithm
The greedy algorithm is the simple yet powerful
sol
ution for SCP which produces near

optimal
solutions. It is outlined in the Fig 1.
Input
: Given the set of instances
I
and the coverage
matrix
C
Output
: Select the representative instances
R
1. Select the instance
i
which covers maximum
number of instanc
es uncovered so far
2. Add the instance to
R
:
R
=
R
U
i
3. If all instances are not covered then go to 1.
4. Return
R
Fig
1.Greedy Algorithm
The greedy algorithm proceeds by selecting the
instance that seem to be best choice at each iteration
and hence sol
ves the problem of global optimization
by making choices locally. The
downside
of using
greedy algorithm for our actual problem of instance
set reduction is that the distance threshold specified
is
only
approximate.
The resultant set from greedy
approach i
s heavily dependent on this threshold.
If
the thresholds used for reduction and validation are
different then the accuracy is considerably different.
This point is elaborated while discussing the results.
Authors in [
4
] proposed a genetic algorithm
based a
pproach for solving SCP and showed that its
performance is better for problems with large size.
This encouraged us to investigate the
prospect of
us
ing
genetic algorithms for instance set reduction.
4
Genetic Algorithm Approach
Genetic algorithms are ra
ndomized search and
optimization techniques, which are guided by the
theory of evolution of species and the concept of
survival of the fittest. Genetic algorithms have been
to shown to perform better in optimizing problems
involving number of parameters. T
he theory of
evolution states that the population of species
evolves over time according to the principle of
natural selection and survival of the fittest.
Individuals that are more fit and successful in
adapting to the environment have better chance of
su
rvival whereas least fit individuals are less likely
to survive in the next generation. Individuals
combine with others and produce offspring. Thus
with the passage of time, the individuals become
fitter and the whole population as such evolves to
become b
etter.
For any genetic algorithm a suitable
coding/representation scheme for the problem needs
to be devised. A fitness function, which assigns a
figure of merit to each coded solution, is required.
Mechanisms for selecting parents and recombining
t
hem to produce children need to be devised. The
basic steps of a simple GA are shown in
Fig
3.
Comprehensive overview of genetic algorithms can
be found in [2
,
3
].
Generate an initial population;
Evaluate fitness of individuals in the population;
Repeat
Select parents from the population;
Recombine parents and produce children;
Replace population with newly formed
children
Until better solution or user defined condition;
Fig 3. Basic steps in a genetic algorithm
4.1
R
epresentation
The solution is repre
sented in the form of a string
called as
chromosome
in genetic terminology.
Although various other types of encoding can be
used, the binary encoding is used as it is shown that
GA performs well under binary schemes.
The encoding scheme is as follows: Give
n N cases,
the solution string B is a n

bit binary string
Solution
B
=
such that
=1 indicates c
i
Є
R
i.e., i
th
case is
selected in the reduced set
; its 0 otherwise
.
The number of
selected
instances
k
=
.
4.2
Population Initialization
The population of a GA is usually randomly
initialized such that each bit in the string can take
eit
her 0 or 1 with equal probability.
4.3
Fitness Evaluation
The computation of fitness function is the most
important factor that governs the performance of
genetic algorithm. For a particular chromosome the
fitness function returns a single numerical value,
w
hich is supposedly proportional to the ‘ability’ or
the ‘utility’ of the solution represented by the
chromosome.
In our case the fitness function is simply the number
of instances selected as representatives in the final
reduced set. Thus fitness value of
a
chromosome
i
is
fitness
(
i
) =
where
k
as indicated above is the
number of selected representatives and the objective
of GA is to find
a
solution with maximum fitness.
Observe that the fitness of
the chromosome
is lesser
if
k
is greate
r and vice versa.
4.4
Parent Selection
The selection process selects individuals from the
population based on the survival of fittest concept of
natural genetic systems. We adopted the
proportional selection scheme, whereby each
individual is assigned a sele
ction probability of each
individual is based on the fitness value, such that
highly fit individuals are selected for mating.
The roulette wheel based selection is the
common technique used to implement proportionate
selection. The probability of an indiv
idual
i
being
selected
is
.
The greate
r the fitness of the individual,
the higher
the probability of it being selected as a parent.
4.5
Reproduction
In the reproduction phase two individuals are
selected as parents and are recombi
ned to create
children. The recombination of parents
chromosomes is aimed at the exchange of gene
information between them and is typically done
through the mechanisms of crossover and mutation.
b
1
b
2
b
3
…
b
N

1
b
N
Crossover takes two individuals and
exchanges genes between t
hem to produce children.
There are various schemes proposed such as single
point, double point and uniform crossover on how to
recombine the parents.
Fig
.
4. Illustration of single

point crossover
We adopted the single point crossover
scheme in
our implementation. It is illustrated in the
Fig
.4
.
In the single point crossover, the two
chromosome strings representing the individuals are
cut at a randomly chosen position to create four
parts, two 'heads' and two 'tails'. Then heads and
tails are in
terchanged to produce two full

length
chromosomes
, each of which has inherited genes
from the parents. Usually a probability μ
c
called as
the
crossover probability
is associated with this
scheme. It is the likelihood of crossover being
applied. If the crossover is not applied the parents
are simply dupli
cated in the next generation.
Mutation is applied to each child after
crossover. It randomly alters a bit in the child string
with a very small probability called as
mutation
probability
μ
m
. Mutation is used in order to provide
small amount of random sea
rch, such that all points
in the solution space have non

zero probability of
being explored. In our case, mutation adds a new
case into or removes a selected case from the
solution randomly.
4.6
Feasibility Check
The genetic operators over the parents to crea
te
children may result in children which are not
solutions. For example in our case the child string
might select a set of representatives which is not
covering the entire instance set. So, after creation of
children we have to check whether it represents
a
valid solution or not and if it is not valid we need to
make it valid with the addition of smaller number of
instances i.e. by changing smaller number of 0 bits
to 1. The heuristic used to make it feasible is the
greedy approach of selecting the instance
which
covers maximum number of uncovered instance first
and so on. It is same as outlined in the
Fig
1 but here
the chance of a bad selection is less as the number of
uncovered instances is lesser and the number of
choices is also lesser.
4.7
Population Repl
acement
Once the new strings are created, they need to be
added into the population at the expense of pre

existing individuals. We adopted the steady state
replacement strategy in which the two least fit
individuals from the population are replaced by the
two newly created children.
4.8
Termination
In order to keep our implementation simple, we ran
the GA for fixed number of generations in all our
experiments. We observed that a good solution is
usually
obtained in a fewer iterations.
5
Experiments
5.1 Expe
rimental Setup
The performance and accuracy of the reduced
instance set is evaluated based on the closeness of
the retrieved result from the query. Since, the
reduced set is built based on distance threshold
information the accuracy is also determined bas
ed
on this notion. That is, we assumed that the retrieval
is a success if the retrieved result is within a
threshold distance
from the query. In order to keep
our implementation of GA simple we fixed the
various GA parameters throughout our experiments
as
shown below
Population size=50
Number of generations=30
Crossover type=Single point crossover
Crossover probability=0.6
Mutation probability=0.1
5.1.1
Synthetic data generation
Before testing on some real datasets we wanted to
test on synthetic dataset. For p
reliminary testing
purposes we assumed that the attributes are real

valued and for easy visualization of results we fixed
the number of attributes to be 2. So, in effect we
generated 2

dimensional dataset and used euclidean
distance as the distance functio
n. In real world
domains, cases or instances are distributed in
clusters so that there are regions in space with high
density of cases and regions with very few cases.
We simulated this by choosing k number of (x, y)
pairs as center of clusters and distrib
uted points
randomly around the centers. We experimented with
values of k from 3 to 5 and obtained similar results.
Using this method, we generated
80%
points for the
training dataset and
20%
points for test set. We used
Parent1
Parent2
Crossover
Point
Child1
Child2
both the greedy approach and the GA
approach to
reduce this training set and evaluated the
performance of both the reduced set as well as
training set using the test set.
We considered
retrieval as success if the distance between the
retrieved instance from training set and the query
instan
ce is less than a value called as
validation
threshold
. In our results, we shall see the effect of
using different thresholds for compaction and
validation.
Since both the approaches require distance
threshold
τ
as their input
for compacting the training
set
, we tried for different values of τ.
We observed
that
t
his method of
generating clusters implicitly
assumes that shape of clusters is
rectangle
with fixed
boundaries
.
Thus
regarding size of reduced set,
it
f
avors
greedy approach
since irrespective of initial
number of points the cluster boundary is clearly
defined
and
only few points are needed to cover
entire
cluster
. This is
explained
while discussing the
results
.
Since
only the distance information is used
in all our experiments, i.e. the actual distribution of
instances in the instance space is not needed we can
expect that the performance of the approach
es
to be
similar in real datasets.
The sample reduced set obtained using both greedy
as well as GA app
roach is shown in the
Fig
.
5.
As
indicate
d
earlier the greedy approach retains only
fewer points compared to genetic algorithm
approach.
Fig
5. Visualizing the reduced sets
5.2
Results
The issues which we wanted to study were how the
greedy approach and t
he GA approach trades

off
with the size and accuracy of the reduced set? What
is the effect of using different distance thresholds
used in creating and validating the reduced set
respectively? In other words, what happens when a
smaller value for threshold
for validating than that is
used in input to the algorithms? The reason to study
this is that fixing
the
distance threshold is one of the
most complex tasks even for experts in that
particular domain. If it is too low, then all the
instances will be retai
ned and if it is too high then
only very few instances will be retained.
Thus the
size of the reduced set obtained is very much
dependant upon the distance threshold value.
So,
we
wanted to study what happens when we test the
reduced set with a threshold v
alue lesser than the
value for which it is actually tuned for.
Since GA is
based on probabilities, for same input parameters the
output need not be same for different runs.
For
each
input condition we ran GA
ten
times and averaged
the results.
The greedy
algorithm successfully retrieves
reduced set of lesser size that that is retrieved by
GA.
As mentioned earlier, this is mainly due to the
fact that we assumed that the cluster boundaries are
fixed and so adding more points to a cluster will
only make it
de
nser
and will not affect its
boundaries.
Also
the genetic algorithms need to be
tuned i.e. the genetic parameters need to be altered
for this particular problem. But the GA approach
scores over greedy when the accuracy of the reduced
instance set is determ
ined.
W
e will discuss the
results in the following
sections
.
All the graphs
given correspond to the initial set with five clusters.
5.2.1
Effect of size of initial unedited
instance set
The graph shown in Fig 6 is the plot of the size of
reduced set obtai
ned using both the greedy and
genetic algorithm based approaches. The X

axis is
the size of the initial full instance set synthetically
generated as mentioned above. The Y

axis is the
size of the reduced set obtained. These results are
obtained by averagin
g the results for 10 different
initial datasets.
Fig
6
. Effect of varying threshold during validation
As shown in the plot, the initial size has little effect
on the outcome of greedy algorithm but the size of
reduced set from genetic algorithm increase
s.
This
just confirms the fact that our initial method of
building clusters with fixed boundaries
favors
greedy approach.
But
when compared with the
actual set, the
size of reduced set is very much less
even from genetic algorithm approach.
5.2.2
Effect
of threshold used in
va
lidation
The important factor that we wanted to
analyze
was
the sensitivity of the two approaches to the distance
threshold value given for
compaction of
the instance
sets. In particular, we studied the effect of
using
different thre
sholds for reduction and validation is
shown in
Fig
7. Th
is
particular graph is for the
reduc
ed
sets obtained from actual set containing 200
instances. Similar graphs
were obtained for actual
sets with different sizes. Threshold used in reduction
is 0.5.
T
hat is, the reduced set is tuned for this value.
This can be seen from the graph as the accuracy for
both the approaches is 100% for threshold 0.5 or
more. But when the result set is tested for smaller
Fig 7. Effect of varying threshold during validation
validation
thresholds
the accuracy decreases.
Usually, the threshold given for reduction purpose is
approximate. While during actual retrieval, in order
to make sure the retrieved instance is sufficiently
close to query, the threshold used is very much l
ess.
The reduced set obtained from greedy algorithm
degrades drastically. But reduced set from GA based
approach does not degrade as much.
This shows that
the approaches is very much sensitive to the user
inputs and GA approach by having more points in
the
result set does not degrade as drastically as
greedy.
6
Conclusions and Future work
We first showed how the issue of instance reduction
can be mapped as a set covering problem. We then
demonstrated how the greedy and genetic algorithm
based solutions
for SCP can be modified for this
purpose. We then discussed the results of
experiments and observed that there is a tradeoff
required
between size and the accuracy.
Also the
greedy approached is very much sensitive to the
distance threshold used for
compaction. But GA
approach is robust enough that its performance is
not that much sensitive to the threshold.
Fitness evaluation is the most important part
of which
determines the quality and performance of
the result obtained from GA. Currently, the fitn
ess
function takes into account only the number of
instances selected. Future work includes
investigating ways to obtain a better fitness function.
Also we assumed that the distance threshold is given
by the user.
The issue of deciding d
istance threshold
a
utomatically
or
devising a distance

threshold
independent strategy requires further investigation
.
References:
[1]
Wilson, D. R. & Martinez, T.R.: Instance
Pruning Techniques.
In Proceedings of the 14th
International Conference on Machine Learning
,
pp.
404

411,
1997.
[2]
D. E. Goldberg,
Genetic Algorithms in Search,
Optimization and Machine Learning
, Addison
Wesley, 1989
[
3
]
D. Beasley, D. R. Bull, and R. R. Martin, "An
Overview of Genetic Algorithms: Part I,
Fundamentals,"
University Computing
,
V
ol. 15,
N
o. 2,
pp.
58

69, 1993.
[
4
]
J.E. Beasley and P.C. Chu, A Genetic Algorithm
for Set Covering Problem,
European Journal of
Operational Research
,
pp.
392

404, 1996.
[
5
]
Henry Brighton and Chris Mellish., Advances in
Instance Selection for Instance

Based Learn
ing
Algorithms,
Data Mining and Knowledge
Discovery
,
Vol.
6
, No.
2
, pp.
153

172, 2002.
[
6
]
Blum, A.L., Langley, P., Selection of Relevant
Features and Examples in Machine Learning,
Artificial Intelligence
, 97(1

2),
pp.
245

271,
1997.
[
7
]
Kolodner, J.:
Case

Based Reasoning
, Morgan
Kaufmann
Publishers,
1993.
[
8
]
Smyth, B.; McKenna, E.: Modeling the
Competence of Case Bases. In Smyth, B.,
Cunningham, P. (eds.):
Advances in Case

Based
Reasoning
,
pp.
208

220, Springer LNAI 1488,
Berlin 1998
[
9
]
Smyth, B. and McK
enna, E.. Competence
Models and the Maintenance Problem.
Computational Intelligence: Special Issue on
Maintaining Case

Based Reasoning
Systems
.
Vol.
17
, No.
2,
pp.
235

249
, 2001
.
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο