V-Detector: An Efficient Negative Selection Algorithm with ... - AIS

habitualparathyroidsAI and Robotics

Nov 7, 2013 (7 years and 9 months ago)


V-Detector:An Efficient Negative Selection
Algorithm with “Probably Adequate” Detector
Zhou Ji
Columbia University
New York,NY 10032
Dipankar Dasgupta
The University of Memphis
Memphis,TN 38152
This paper describes an enhanced negative selection algorithm (NSA) called V-
detector.Several key characteristics make this method a state-of-the-art advance
in the decade-old NSA.First,individual-specific size (or matching threshold) of
the detectors is utilized to maximize the anomaly coverage at little extra cost.
Second,statistical estimation is integrated in the detector generation algorithm
so the target coverage can be achieved with given probability.Furthermore,
this algorithm is presented in a generic form based on the abstract concepts of
data points and matching threshold.Hence it can be extended from the cur-
rent real-valued implementation to other problem space with different distance
measure,data/detector representation schemes,etc.By using one-shot process
to generate the detector set,this algorithm is more efficient than strongly evo-
lutionary approaches.It also includes the option to interpret the training data
as a whole so the boundary between the self and nonself areas can be detected
more distinctly.The discussion is focused on the features attributed to negative
selection algorithms instead of combination with other strategies.
keywords:negative selection algorithms,artificial immune systems,anomaly
detection,classification,computational intelligence,algorithm
1 Introduction
The negative selection algorithms are inspired by natural immune system’s
self/nonself discrimination mechanism.It was designed by modeling the biolog-
ical process in which T-cells mature in thymus through being censored against
self cells[14].It is one of the earliest models of artificial immune systems (AIS)
In a negative selection algorithm,a collection of detectors,usually called
detector set,is generated so not to match any self samples (training data).The
detectors are subsequently used to check whether incoming new data items are
normal (self) or not (nonself).It is typical regarded as an anomaly detection or a
one-class classification method because the training data are from normal cases
only.The common engineering goals in various negative selection algorithms
are:(1) to limit the number of the detectors needed to be generated;(2) to
make the detector set cover as many of the anomalies as possible (ideally all the
anomalies);(3) to generate the detector set efficiently.To a certain extent,all
the concerns are around the so-called detector coverage,namely the proportion
of the nonself space that is covered or recognized by the detector set.Two
intertwined questions arise at this point:First,howdo we estimate the coverage?
It is desirable to know how effective the detector set is before using it to detect
anomaly.Second,how do we achieve enough coverage with relatively smaller
number of detectors?Because the number of detectors is the main factor that
decides the performance of the algorithms,especially during detection phase,it
is desirable to use as few detectors as possible.
From both ends,V-detector[23,24],as introduced in this paper,handles
the issues with innovative and efficient techniques

.Furthermore,this method
is potentially very useful when the near-perfect coverage is not necessary and
alternative estimation is inaccurate.The rest of this paper is organized as
following.In section 2,we briefly review previous research efforts to address the
issue of detector coverage and the related works that V-detector is based on.In
section 3,we will describe the algorithm in detail.Section 4 uses experimental
results to illustrate the properties of V-detectorand to discuss the applicability
of this method.Lastly,the conclusion is summarized in section 5.
2 Related Works
2.1 Knowledge of Detection Coverage in Negative Selec-
tion Algorithms
Ever since the time when negative selection algorithm was first proposed,the
detector coverage is a major concern.It is desirable but never trivial to deter-
mine the coverage quantitatively for a specific negative selection algorithm,or
to decide the necessary number and distribution of detectors for a given cover-
age.Statistical methods were used in several works.D’haeseleer et al [10] did
thorough probability analysis on the relation between the number of detectors
and the probability that a random anomaly can be detected.It used matching
probability or failure probability to decide or evaluate the number of detectors.
In some later works on negative selection algorithms,similar analysis was in-
cluded as the part of theoretical support [1,11,12].Some other works focused
on different aspects,e.g.lower bound for the fault probability [31],but used
statistics from the same point of view.

Implementation and more information about V-detector can be found at http://
For binary or finite-alphabet string representation,such analysis is relatively
easier to carry out.When negative selection algorithmwas extended to different
data representations,probability cannot always be computed by straightforward
combinatorics anymore.Real-valued representation plays a unique role in many
applications that cannot be represented effectively in binary form.For the prob-
lems with their natural real-valued representation,it is easier to interpret the
output and usually results in more stable algorithm by maintaining affinity in
representation space.Coverage in such cases is relatively less explored because
the search space is continuous and hard to analyze by enumerative combina-
In real-valued negative selection algorithms,the detectors are usually repre-
sented as hyperspheres,hyper-rectangles,or hyperellipsoids.They are generated
by either the original generation-elimination method or various other methods.
Gonzalez et al [17] used a randomprocess to generate and redistribute detectors
so the total overlap can be minimized generally.In another work by Dasgupta
et al [6],the detectors are represented as rules or in fact rectangular areas in
a multi-dimensional real space and generated by a genetic algorithm.A frame-
work of a multi-level learning algorithm (MILA)[8] consists mainly of negative
selections in real-valued representation using matching rule defined in lower di-
mensional sub-spaces.Another multi-level AIS compared binary and real-valued
representation [2].Some algorithms involve resizing and redistribution of the
initial detectors [7].
In real-valued representation,Gonzalez et al [17] successfully used Monte
Carlo method to estimate the volume of self region and then decided the number
of detectors based on the estimate.That method is more sophisticated than the
analysis in binary representation in the sense that (1) the proportion of possible
self samples are evaluated in a probabilistic way instead of deterministically and
(2) the geometry and assumed distribution of the detectors have to be taken
into consideration to carry out the analysis.
Nevertheless,these seemingly different analysis tackled the problem with a
similar approach,namely,to determine the number of detectors that is con-
sidered enough before the detector set is generated.It was oriented to general
analysis of the relation between the number of detectors and the coverage and a
given detector set is not in the question.A drawback is that eventually we have
no direct knowledge of detector coverage of the actual detectors that are gener-
ated.Moreover,the link from volume of self or nonself region to the number of
detectors is not only an intuitive estimate,but also depends on the assumption
of detector distribution that may be very different from the actual detector set.
For binary representation,the distribution is usually assumed to be uniform,
which is unrealistic considering the specific application and the detector gener-
ation algorithm.For real-valued case [17],the distribution is assumed to follow
simple geometric pattern so the overlap barely can be estimated,though still
V-detector [24] approach the issue from a different angle.It estimates the
coverage of the actual set of detector,which means (1) whatever information
we can obtain reflects the actual coverage instead of estimate based purely
on the number of detectors;(2) no assumption is needed for the distribution
of the detectors.As it will be shown in the algorithm details described in
section 3,it has the further advantage that the geometry of the detector doesn’t
matter so there is no difficulty to use for different representation of detectors.
While the original experiments were done in real-valued space,it is possible
to implement a similar mechanism in other representations,e.g.the popular
binary or other finite alphabet string representation.More generally,it can be
used in any detection mechanism as long as individual point can be verified to
be recognized or not even though there is no explicit algorithm to evaluate the
rate of detection.The same methodology also applies to different algorithms in
which the similar issue of proportion estimation exists.
2.2 Detectors with Individual Properties
In a negative selection algorithm,detectors are essentially artificial nonself sam-
ples with a match threshold.In the original form,detectors are different with
one other only in their location in the searching space.The rationale is simply
that we assume a uniform matching criterion should be used throughout the
Later on,some works took up detector with variable properties,especially
variable size,for various reasons.Detectors of more than one size are allowed
in Branco et al’s work [3] as a secondary complement to improve the detection
performance.The rectangular detectors generated by genetic algorithm [6] are
naturally variable in size.The method by Dasgupta et al [7] resizes and moves
the detectors during the training stage to increase coverage.
V-detector extends the idea of variable sized detectors to be more general
and simpler.Although the ‘V’ in V-detector came from the word ‘variable’,
‘individual-specific size’ is a more accurate description.The scheme is simply
assigning each detector its own size or matching rule.The sizes are determined
individually when the detectors are generated.An important fact is that this
size property only brings very little cost in the implementation but provides
a significant advantage in coverage compared with most of the earlier NSA re-
ported.The purpose is clearly for maximizing the detector coverage - or avoiding
any unnecessary detectors.It also eliminates the need to redistribute detectors
for the purpose of minimal overlap.In the current version,only variable size
is considered,but the strategy can be extended to other variable properties.
Even hybrid matching rules can be introduced as a form of variable property
to maximize the coverage.Generally,the additional cost of allowing variable
property is limited in the representation of each detector.
2.3 Statistical Inference
Although V-detector estimates a different parameter,namely the detector cov-
erage,from Gonzalez et al’s work [17] estimating the volume of self region,they
are based on the same statistical principle.The percentage of certain category
in the sample can be used to estimate the proportion in the entire population.
Considering a collection of random points in the nonself region,V-detector al-
gorithm uses the percentage of the points that are covered by the detectors to
estimate the covered proportion of the entire nonself region.However,an es-
timate of the proportion value itself does not tell us how much or how likely
the estimate may be different from the real population proportion - in our case,
the actual detector coverage.To draw a more meaningful conclusion,we should
construct a confidence interval based on the Central limit theorem [21,13,20].
In other words,although we do not get the same percentage every time if we
repeat sampling of a fixed size,the distribution of these percentage is close to a
normal distribution.
The central limit theorem justifies using a normal distribution as an approx-
imation for the distribution of ¯x when n is sufficiently large.There are two
apparent sources of error in using the normal distribution as an approximation
of the binomial distribution:
1.The normal distribution is always symmetrical;the binomial distribution
is symmetric only if the probability of one outcome,p,is 0.5.
2.The normal distribution is continuous;the binomial distribution is dis-
A rule of thumb taking into account both the problems of asymmetry and
discreteness is to use the normal distribution approximation only if np > 5,
n(1 −p) > 5 and n > 10.
There are alternative distributions that can be used to deal with the asym-
metry problem,and there are mathematically strict corrections for discontinuity
too.Even though these are not the main concerns in the application in ques-
tion,the issue of asymmetry is in fact not negligible.Because the proportion of
covered nonself points,p,is the variable to be considered,it is very likely that
we need to consider a large p,for example,90% or 99%.Fortunately,we can
circumvent the issue with proper strategy in our proposed algorithmconsidering
the fact that we care more about enough coverage than its exact value.
2.3.1 Confidence interval
Confidence level is the probability that the population parameter falls within
the range called confidence interval around the sample parameter [13,21].For
normal distribution,we have
ˆp −E < p < ˆp +E,(1)
where p is the population parameter,ˆp is the sample statistic,
E = z
σ (2)
is the margin of error.σ is the standard error of the sample statistic.For
proportion,σ =
,which can be estimated as
σ =
where ˆq = 1 − ˆp and n is the sample size.In the case of estimating detector
coverage,we are more interested in making a conclusion about the lower limit
of coverage,p > p
,where p
is the minimum coverage we can presume
with some certainty.So we can use a one-side confidence interval
p > ˆp −E,(4)
E = z
To ensure the assumption that the binomial random variable is approximately
normally distributed with the mean  = np and standard deviation σ =

we should have np ≥ 5,nq ≥ 5.
In Equation (2),z
is the z score for a confidence level of 1 −α/2 - the
positive standard z value that separates an area of α/2 in the right tail of
the standard normal distribution curve.For a standard normal distribution,
the probability of −z
≤ x ≤ z
≤ x ≤ z
) = 1 − α.The
probability of x ≤ z
,P(x ≤ z
) = 1 −α/2.Similarly,z
in Equation (5)
is where P(x ≤ z
) = 1 −α.
2.3.2 Hypothesis Testing
Hypothesis testing is another way of statistical inference also based on Equation
(4).It fits our purpose better because the goal here is to decide when to stop
generating or including more detectors.In conducting a statistical hypothesis
test,we need to identify the null hypothesis.We assume that Type I Error
(rejecting the true null hypothesis) is more costly than Type II Error (accepting
a false null hypothesis).
The normal procedure of hypothesis testing involves the following steps:
1.State the null hypothesis and alternative hypothesis.The null hypothesis
is the statement that we’d rather take as true if there is not strong enough
evidence showing otherwise.
2.Determine the cost associated with the two types of decision-making er-
3.Choose the significant level,α.That is the maximum probability we are
willing to accept in making Type I Error.Typical values are 0.05 or 0.01.
4.Collect the data and compute the sample statistic.To test based on
proportion we can use z score
z =
ˆp −p
5.Reject or accept the null hypothesis.The traditional method is to check
whether the test statistic is in critical region (z > z
) or not.If z > z
we reject the null hypothesis.An alternative way is to use a p-value test,
which is easier [21].
2.4 Inspiration from Learning Theory
Statistical inference does not tell us the exact value of the detector coverage.As
we will see later,it tells the probability that the approximation of coverage is
good enough.This idea of “Probably Adequate” becomes more comprehensible
when we look into the similar concepts in machine learning.One of the major
models of computational learning theory is Probably Approximately Correct
learning (PAC learning) [5,19].In terms of PAC learning,successful learning
of an unknown target concept should entail obtaining,with high probability,
a hypothesis that is a good approximation of it.Accuracy,or how good the
approximation is,is described by ǫ:the hypothesis returned,h,should satisfy
error(h) ≤ ǫ.Confidence,or the chance we can correctly obtain the hypothesis
h,is described by σ:the probability of returning h is at least 1−σ.h is parallel
to the target coverage.
Learning theory such as PAC learning could provide guidance to the devel-
opment of negative selection algorithms.Although analyzing a specific nega-
tive selection algorithm,e.g.V-detector,may not be a straightforward task,
it should be a potentially helpful work to analyze whether the problem that is
solved by a negative selection algorithm,or more particularly,by V-detector,is
PAC learnable or not.As the first step leading to formalization of the problem,
regardless of the algorithm that may be used to solve it,we should clarify the
basic assumptions about the training data our algorithms take.Some of the
previous works in this area,especially those which were not based on binary
representation and did not assume all self features are present in training data,
are hard to compare with one another shoulder-to-shoulder due to the lack of
equivalent assumptions.
In the following analysis,we assume
• Both self and nonself points appear in some bounded n-dimensional real
space.For simplicity,let us assume it is [0,1]
• Some finite number of self samples are provided as input.They are ran-
domly distributed over the self region.
• The training data is noise free,meaning all the self samples are real self
points.This is not necessary in principle,but used to simplify the discus-
• To evaluate the detection performance,the testing data are finite number
of random points over the entire space in question described above.Each
of those points can be verified to be self or nonself.
3 Algorithm
3.1 Coverage - Proportion - Probability
Definition 1 The detector coverage of a given detector set is defined as the
ratio of the volume of the nonself region that can be recognized by any detector
in the detector set to the volume of the entire nonself region.
Generally,it can be written as
p =
S is the set of nonself points and D is the set of nonself points that are
recognized by the detectors.In the case of 2-dimensional continuous space,it is
reduced to the ratio of the area covered to the area of the entire nonself region
p =
If the space in question is discrete and finite,it can re-written as
p =
where |A| denotes the cardinality of a set A.
Figure 1 illustrates the three regions in the question:self region,covered
nonself region,and uncovered nonself region in a 2-Ddiagram.The area without
hatched shade on the right-side of the diagram is
S and the dotted area covered
by circular detectors are D.
Self region
Nonself region
Covered nonself region
Figure 1:Negative selection algorithm using real-valued representation:Differ-
ent regions
In statistical term,the points of the nonself region are our population.Gen-
erally speaking,the population size is infinite.The probability of each point to
be covered by detectors has a binomial distribution.The detector coverage is
the same as the proportion of the covered points,which equals to the proba-
bility that a random point in the nonself region is a covered point.Assuming
all the points from the entire nonself region are equally likely to be chosen in
a random sampling process,the probability of a sample point being recognized
by the detectors is thus equal to p.For a sample of fixed size,the proportion of
covered points is
ˆp =
S is the sample;and
D is the set of sample points that are recognized by
the detectors.|
S| is thus the sample size.ˆp is the sample statistic that is the
point estimate of the population proportion.
3.2 Integration of Hypothesis Testing and Detector Gen-
The main idea of this method is to finish generating detectors when the coverage
is close enough to the target value.This contrasts with other works that replies
on the number of detectors to provide enough coverage.
The original V-detector [23] has a simple estimate to stop the detector gen-
eration procedure.Random points are generated to be detector candidates.If
it is a nonself point but not covered,a new detector is generated on it.If it is a
covered nonself point,it is discarded as a candidate but the attempt is recorded
in a counter which will be used to estimate the coverage.If the counter of con-
secutive attempts that fall on covered point reaches a limit m,the generation
stage finishes with the belief that the coverage is large enough.m is not preset.
It is decided by the target coverage.
1 −α
where α is the target coverage,a control parameter.Equation (6) is explained
as following.If there is 1 uncovered point in a sample of size m

,the point
estimate of proportion of uncovered region is

,and the estimate of coverage

= 1 −

If in fact there is 0 uncovered point in a sample of size m

,we have a better
than average chance that the actual coverage is larger than α

.Because m is
decided by Equation (6),when we see mconsecutive points that are all covered,
we can estimate that the actual coverage is more likely to be at least α.As men-
tioned before,that is based on point estimation without a confidence interval.
Compared with the new algorithm,we call that method “na¨ıve estimate”.
To extend to more strict statistical inference,estimating with a confidence
interval directly does not fit the problem as well as hypothesis testing because
our goal is to make a decision of adding more detectors or not.What makes
this paper’s method different from traditional statistical inference is that the
testing can be done as part of the detector generation algorithm.Although it
may be implemented as a relatively independent module,we still have to face
a dilemma:the detector coverage or the proportion to be estimated is actually
changing during the detector generation.So we need to design a process in
which the hypothesis testing happens only when we temporarily stop adding new
detectors.Otherwise,the testing will be meaningless.At the same time,we also
try to reuse the random samples we use in hypothesis testing as the candidate
detectors.This doubles the advantage of integrating hypothesis testing in V-
In the case of estimating coverage,the null hypothesis would be “The cover-
age of the nonself region by all the existing detectors is below percentage p
If we accept the null hypothesis,we would include more detectors.If the null
hypothesis is actually false,the cost of a Type II Error would be more unneces-
sary detectors.On the other hand,if we reject the null hypothesis by mistake,
we would end up with lower than actual coverage.The latter,so called Type I
Error,is exactly our concern.The significant level α is the maximumacceptable
probability that we may make a Type I Error - end up with fewer than needed
detectors.We need a fixed sample size to do the hypothesis testing.If the con-
clusion is that we need more detectors,we take all the uncovered sample points
to make new detectors.This largely saves the cost of the entire algorithm.
Figure 2 shows the diagram of the modified V-detector that uses hypothesis
testing to estimate the detector coverage.
To guarantee the assumption np ≥ 5 and nq ≡ n(1 −p) ≥ 5 is valid,we can
choose sample size by
n > max(5/p,5/(1 −p)).
If there is x points covered,ˆp = x/n,where n is the sample size,we have
z =


During the procedure to test more points,x will either increase (when the
point is covered) or stay unchanged (when the point is uncovered).So does z.
Before the procedure finishes for all n points,if z based on the tested points is
larger than z
,it is enough to reject the null hypothesis and claim enough cov-
erage.At that point,the test can be stopped.Because the ultimate conclusion
from the procedure is either rejection or acceptance of the null hypothesis,not
the estimate of p and confidence interval,it is not necessary to finish trying to
get a “better” answer.
If the the assumption nq > 5 is in fact invalid because the real p is larger
than the p we used,then the actual coverage is more than what we want to test.
Our confidence in the coverage is not comprised in this case.If the assumption
np > 5 is in fact invalid because p is so small,the hypothesis test will pass only
when it could pass a test using the actual non-normal distribution.Because the
probability curve skew to the left side (origin side),z
of such a distribution
Choose p and α
Choose sample size n
n > max(5/p,5/(1 −p))
N = 0,x = 0
Sample a point
N = N +1
Save the candidate
x = x +1
z =


z > z
N = n?
End:enough coverage
Accept all saved
as new detectors
Figure 2:V-detector generation algorithm with statistical estimate of coverage
would be smaller than z
of normal distribution.If z does not pass this skewed
,it will not pass normal distribution’s z
either:z ≤ z
≤ z
3.3 Using Detectors of Variable Size
In V-detector,the sizes of detectors are not pre-defined by the matching thresh-
old as in most real-valued negative selection algorithms.Instead,the radius of
each detector is decided individually simply as the maximum value that does
not match any self points.In fact,this makes use of the key idea of the original
negative selection algorithm and allows to achieve the maximum coverage by
each detector.Because each detector is implemented as a center point and its
radius,there is no difference in computational cost between a large detector and
small detector.Therefore,we are not concerned about the overlap of the detec-
tors.As long as a detector has its contribution to the total coverage,namely
the part that is not overlapped,it is as useful as any other detectors.
Figure 3 illustrates the core idea of variable detectors in 2-dimensional space.
The dark grey area represents the actual self region,which is usually given
through the training data (self samples).The light grey circles are the possible
(a) Constant-sized detectors (b) Variable-sized detectors
Figure 3:Main concept of detectors with variable properties
detectors covering the non-self region.Figure 3(a) shows the case where the
detectors are of constant size.In this case,a large number of detectors are
needed to cover the large area of nonself space.The well-known issues of “holes”
are illustrated in black.In figure 3(b),using variable-sized detectors,the larger
area of non-self space can be covered by fewer detectors,and at the same time,
smaller detectors can cover the holes.Since the total number of detectors is
controlled by using the large detectors,it becomes more feasible to use smaller
detectors when necessary.
Another advantage of this new method is that it facilitates the usage of the
above described statistical inference.
It can be further extended to variable matching rules or at least different
distance measures.That would be an easy way to realize detectors of different
geometric shapes pursued by many other works.
3.4 Boundary-Aware Algorithm
What exactly does a self sample or a collection of self samples mean in term of
presenting or defining the self region?This needs to be answered first before
we judge any soft computing algorithms of anomaly detection.Regardless of
specific algorithm or scheme of negative selection,what a self sample means
eventually comes down to the matching rules.
Fig.4 shows how we may interpret a self sample in different ways.In
Fig.4(a),we assume that a circular area around the self (normal) sample is
entirely normal.It is a straightforward way to achieve generalization,similar to
partial matching in binary representation in principle.We call an interpretation
that allows much variability a “conservative interpretation”,referring to the
“conservative” attitude to claim a new sample to be abnormal.In Fig.4(b),
we only consider the exact point of the self sample to be normal.In reality,
we may still allow some small deviance because we have to compare float point
numbers,but basically we only regard the samples we already see as normal.
We call an interpretation that doesn’t allow much variability an “aggressive
self sample
“self radius”
abnormal region
self sample
abnormal region
(a) Conservative (b) Aggressive
Figure 4:Possible interpretations of a single self sample:Conservative or Aggressive
At the first look,it seems that in an extremely aggressive interpretation like
fig.4(b),no generalization could happen.That doesn’t have to be the case.Fig.
5 shows a group of three self sample points.Even if we do not take any circular
surrounding area of a single self sample as normal,we can still generalize to a
self region by considering the neighboring self points together,as shown in fig.
5(c).Compared with fig.5(a) or (b),this is more aggressive to detect anomaly,
but only to the outside of the perceived “self region”.
self sample
self sample
self sample
abnormal region
self sample
self sample
self sample
abnormal region
self sample
self sample
self sample
abnormal region
(a) Large threshold (b) Small threshold (c) As a collection
Figure 5:Possible interpretations of a group of self samples
Naturally,each self sample point can be interpreted as an evidence that its
vicinity is self region.On the other hand,we can fairly assume that the self
samples can be drawn anywhere over the entire self region.There is no reason
to exclude the points that are close to the boundary between self and nonself
regions no matter what kind of matching rule or distance measure is used.
Fig.6 illustrates the “boundary dilemma”,the scenario that the self samples
close to the boundary inevitably extend the actual self region due to the vari-
ability allowed by the algorithm.In this figure,the shaded area is the “real”
self region;the dots are the self samples and the circles are their generaliza-
tion.If the self threshold is too small,the space between self samples could not
be represented.In other words,more samples are needed to train the system
Figure 6:“Boundary Dilemma”
properly.On the other hand,if the self threshold is large,the false self region
represented by the boundary samples may be too large to accept.
In the case that the over-covered area is too large compared with the real
nonself region,the error would be large.When the nonself region is a thin stripe
between two self regions,it may not be able to be represented at all.In those
cases,the issue of boundary dilemma will be more considerable.
The issue described above is tackled by an ingenious simple strategy using a
negative selection algorithmto achieve the interpretation illustrated in fig.5(c).
The above discussion concerns general interpretation of self samples.From
the view point of a negative selection algorithm,the difference in interpretation
is shown in fig.7.Fig.7(a) shows the coverage of a detector set using a
conservative interpretation;fig.7(b) shows one using an extremely aggressive
interpretation.Similar to the conceptual discussion,it is possible to generalize
the self samples to a finite self region even if detection is extremely aggressive
outside the self region.This is illustrated in fig.7(c).
self sample
“self radius”
self sample
self sample
self sample
self sample
(a) Conservative (b) Aggressive (c) Boundary-aware
Figure 7:Detectors enclosing the perceived “self region”
Therefore,we end up with two versions of V-detector algorithm.One treats
each training data point (self sample) individually [23].We call it point-wise V-
detector.The other brings out a new advantage of negative selection algorithm
so that it is able to detect the boundary of self region.We call it boundary-aware
V-detector [22].
4 Experiments and Discussion
To understand the behavior of the algorithm described in the previous section,
experiments were carried out using 2-dimensional synthetic data.Over the
unit square [0,1]
,various shapes are used as the ‘real’ self regions in these
experiments.They belong to one of the six types listed in Table 1,which
also shows the geometric parameters that extend each type to different sizes or
variations.Figure 8 shows the basic shapes of the six types of self region.
Table 1:Shapes of self area
Type of Shape
Geometric Parameters
thickness and location of the cross
size(radius of circumscribed circle)
outer and inner radius
cross size and location,circle radius
size (radius of circumscribed circle)
Figure 8:Different types of shape
A fixed number of random points from the self region are used as the self
samples to generate the detector set.Another number of random points,in
which some are self,some nonself,are used to test the detection performance of
the detector set.Figure 9 shows examples of training data (self samples) and
test data:9(a) is a self sample of 100 points;9(b) is a self sample of 1000 points;
9(c) is 1000 test data including both self points and nonself points.It can be
predicted from this figure that the number of training data will have obvious
influence on the detection results.Figure 10 shows the detector-covered area
using these two different numbers of training points (boundary-aware algorithm,
hypothesis testing,99%target coverage):10(a) the area trained with 100 points;
10(b) the area trained with 1000 points.When other control parameters are
different,e.g.using point-wise algorithm,the covered area will not be the same
as in Figure 10,but the number of training points still plays an important role.
(a) 100 points of self sample (b) 1000 points of self sample
(c) 1000 points of test data
Figure 9:Self samples and test data
The influence of the control parameters and the differences of strategies were
explored with more experiments.From the data side,the difference in results
may come from the number of sample points or the different shapes (including
their specific geometric parameters) of the self region.From the algorithm side,
the difference may come from:target coverage,significance level of hypothe-
sis testing,methods of estimation (na¨ıve estimate or hypothesis testing),self
threshold,and V-detector strategy (point-wise or boundary-aware).The per-
formance we want to compare include detection rate,false alarm rate,and the
number of detectors.Significant level α is set to be 0.1 in the results reported
(a) Trained with 100 points (b) Trained with 1000 points
Figure 10:Detector-covered area
in this paper.
Figure 11 compares some results of detection rate using na¨ıve estimate and
hypothesis testing for target coverages from 90% through 99%.The number
of sample points is 1000.The boundary-aware algorithm was used.The self
region in figure 11(a) is an ‘intersection’ shape,which is basically four separated
regions.The one in figure 11(b) is a pentagram whose radius of circumscribed
circle is 1/3.The plot shows the mean of 100 repeated tests;standard deviation
is shown as error bar on the graph.Results obtained with na¨ıve estimate and
hypothesis testing are plotted together to compare.Hypothesis testing has a
small but consistent advantage over the na¨ıve method.
Detection Rate
Target Coverage
hypothesis testing
naive estimate
Detection Rate
Target Coverage
hypothesis testing
naive estimate
(a) Intersection shape (b) Pentagram shape
Figure 11:Influence of target coverage
Table 2 again highlights the difference between na¨ıve estimate and hypothesis
testing.The results were from the following setting:boundary-aware strategy,
1000 self sample points,target coverage 90%,and self threshold 0.05.The
numbers are the mean of 100 repeated tests and the standard deviation σ is also
tabulated with the corresponding variables.Results for two different shapes of
self region (‘intersection’ and pentagram) are shown.
Table 2:Performance difference between na¨ıve estimate and hypothesis testing
detection rate/σ
false alarm rate/σ
number of detectors/σ
na¨ıve estimate (‘intersection’)
hypothesis testing (‘intersection’)
na¨ıve estimate (pentagram)
hypothesis testing (pentagram)
Figure 12 shows the difference between the point-wise and boundary-aware
V-detector algorithms when all the other settings are the same.The target
coverage is 99% and 1000 training points are used.The actual self region is
ring-shaped.Figure 12(a) shows the detection rate;(b) shows the false alarm
rate;(c) shows the number of detectors.The boundary-aware algorithm has
obvious better detection rate under this setting.Although it has higher false
alarm,especially for very low self threshold,than point-wise interpretation,it
is not an issue generally.The difference in the two strategies’ performance
is related to the fact that the concept of ‘self’ here is defined by the discrete
self points.The improvement brought by the boundary-aware V-detector is
more obvious when it is important to detect the boundary more accurately.At
least two characteristics are noteworthy in figure 12(c).First,the number of
detectors is near constant as long as the self threshold is larger than 0.05;second,
boundary-aware algorithm resulted in slightly fewer detectors.The reason for
the first inclination is that the detector candidates were processed in groups of
proper sample size required by hypothesis testing.That disadvantage is limited
and will not scale with the number of training points or other parameters.It can
be avoided by not using all the uncovered randompoints to make new detectors.
Figure 13 shows the detection rate results of different shapes of self regions
for a range of self threshold using the boundary-aware algorithm.Totally 10
different shapes are shown in this figure including five types in Figure 8 plus their
complementary shapes.The results are consistent without major difference.100
self points were used to train in those results.When 1000 points were used,the
difference were even smaller.
Figure 14 shows the detection rate and false alarm rate,respectively,com-
paring 100 points and 1000 points of the self samples on a pentagram region.
The boundary-aware algorithm plus hypothesis testing was used.The advan-
tage of more training points in detection rate seems small,but the false alarm
using 100 points is significantly higher.On the other hand,if the point-wise
algorithm is used,the false alarm rate can be controlled over a range of self
thresholds,but the detection rate of 100 points will be much lower.It is not
surprising that the number of self sample points has a major affect on detection
performance.Improvement in detector generation and detection process can
hardly eliminate the false alarms mainly coming from the definition of self that
is totally based on the discrete samples.
As discussed in the previous sections,this method is not limited to specific
Detection Rate
Self Threshold
False Alarm Rate
Self Threshold
(a) Detection rate (b) False alarm rate
Number of Detectors
Self Threshold
(c) Number of detectors
Figure 12:Comparison of two strategies in V-detector
problem space,detector representation,matching rule,etc.For example,Eu-
clidean distance,or 2-norm distance,widely used in real-valued representation
and in the earlier experiments can be generalized to Minkowski distance of or-
der m,or L
distance,for any arbitrary m.For a point (x
,∙ ∙ ∙,x
) and a
point (y
,∙ ∙ ∙,y
) in n-dimensional space the 1-norm distance is Manhattan
The m-norm distance is defined as
The infinity norm distance is defined as
= max(|x
|,i = 1,2,∙ ∙ ∙,n)
Detection Rate
Self Threshold
inverted cross
inverted ring
inverted intersection
inverted stripe
inverted pentagram
Figure 13:Detection rate for various shapes of self region
Detection Rate
Self Threshold
1000 points
100 points
False Alarm Rate
Self Threshold
1000 points
100 points
(a) Detection rate (b) False alarm rate
Figure 14:Performance for different training sizes
For different norm,the detector (or recognition region) will take different geo-
metric shapes and have different covering area.Fig.15 illustrates the different
shapes in 2-dimensional space.They are shown with the same radius.If we use
radius r to indicate the size,r can be interpreted as the radius of the circle in the
case of 2-norm distance.For Manhattan distance,the detector is a 45

square whose edge is

2r;for infinity norm,the detector has the shape of a
square whose edge is 2r;for any norm between 2 and ∞,the shape is evidently
between the radius r circle and the edge 2r square.
Tables 3 and 4 are the results obtained using different distance measures,for
the “intersection” self region and the “5-circles” self region,respectively.There
are two different implementations of Euclidean distance.One is the default
setting of V-detector,in which the distance measure and matching process are
actually implemented using the square of Euclidean distance for better perfor-
mance in speed.The other Euclidean distance is implemented as L
in the general way.In term of detection results,there seems to be little differ-
1 −norm distance 2 −norm distance
(Manhattan) (Euclidean)
3 −norm distance ∞ norm distance
Figure 15:Various geometric shapes of detector (recognition region) correspond-
ing to different m-norm distances
ence between different distance measures for these two examples,except that
the Manhattan distance is slightly more aggressive to raise alarm of anomaly.
However,the running time of the algorithm is noticeably different with differ-
ent distance measures.The ∞ norm distance is the fastest.For general L
distance,the algorithm runs slower for higher m.
Although NSA were widely used in various applications and have developed
many variations,there are still some skepticism [28] or in some cases confusion
about whether and how they could be used [25].For example,detector coverage
and detection rate are two terms that may lead to misunderstanding when we
discuss how well the detector set works.Failure to make clear distinction may
muddle otherwise clear analysis.Coverage is the proportion of nonself space
that is covered by detectors.For a given instance,we usually do not know the
actual value because the nonself space is the unknown we are seeking for.If we
Table 3:Effects of different distance measure:‘Intersection’ shape
Distance measure
detection rate
false alarm rate
Euclidean (default efficient implementation)
infinity norm
Table 4:Effects of different distance measure:‘Five circles’ shape
Distance measure
detection rate
false alarm rate
Euclidean (default efficient implementation)
infinity norm
discuss coverage in terms of a number,we are making assumption about how
the nonself space (or self space) can be induced from the self sample points we
have,at least conceptually.Detection rate,on the other hand,refers to the
percentage of nonself sample points that are detected by the detector set in a
particular experiment.Thus,the difference is two-folded:
• Coverage depends on how we interpret the training data set.Even for
a defined set of detectors,the value of coverage must be based on some
assumption that cannot be verified.For example,Stibor et al [29] showed
examples of coverage provided by V-detector[23] at termination.Nine self
points were used to train the system.The discussion of coverage was based
on the assumption that real self region is all the perfect circles around the
training points.V-detector’s termination is decided by estimated cover-
age.Detection rate is influenced by the coverage as well as the validity of
the assumption or interpretation we make about the training data.
• Coverage is the ratio of covered nonself space to the entire nonself space.
The probability distribution is usually not considered to evaluate the cov-
erage.Detection rate,on the other hand,depends on the actual frequency
distribution of test data.The distribution is usually reflected in the real
data.This exposes a weakness of V-detector’s termination criterion.The
statistical estimate of coverage using random sampling does not take into
consideration the probability distribution of the data to be detected.Thus
the conclusion of enough coverage or not are always bias depending on how
different the actual distribution is from uniform distribution.Logically,
this cannot be totally solved because the self training data at best can
only provide distribution of the self space.
In fact,either NSA in general or a specific flavor like V-detector have their
• Limitation of specific matching rules
Following the above discussion,we notice that matching rule,which usu-
ally takes the form of a distance measure plus a matching threshold,plays
a very important role.Visually,the same concept can be expressed as
the geometric shape of the detectors.Hart [18] noticed that importance
of choosing the proper recognition region,which refers to the similar idea
as the shape of detectors.It should be pointed out that this is as impor-
tant in any other AIS systems or any learning paradigms as in negative
selection algorithms.
Sometimes,the apparent limitation of a negative selection algorithm is
in fact the limitation of a specific matching rule or detector shape.For
real-valued negative selection algorithms,Euclidean distance and therefore
hyperspherical detectors are commonly used,but they are not the only
possibility.Limitation of Euclidean distance or hyperspherical detectors
is not the special problem of real-valued negative selection algorithms.In
fact,other matching rules and detector shapes were used in several works,
for example,rectangular detectors [6],hyper-ellipsoid detectors [27],etc.
Negative selection algorithms are methods with great flexibility.First,the
concept of negative selection can be realized in very different flavors of so-
called negative selection algorithms.Second,even for a specific negative
selection algorithm,e.g.,V-detector,there are many elements in the model
that are not inherently limited as it appears.For example,for real-valued
representation,which is not necessarily the only choice in the first place,
we could use very different distance measures or matching rules.We see
that Euclidean distance can be easily extended to be L
could result in different shapes of detectors.
• Limitation of one-class classification
Generally speaking,performance of a classification algorithms or a learn-
ing method depends on the probability distribution of the data.Any
serious analysis cannot be done without taking into consideration that
distribution.One-class classification,however,is an effort to learn when
no information of the second class is available [30].That means that the
probability distribution of the abnormal data (or nonself data) is never
known according to the basic assumption.That is the main reason that
Freitas et al [15] cast doubt on negative selection algorithms.On the other
hand,one-class learning is a valid need and has been studied from various
aspects and used in many applications [26].In summary,limitation does
exists,but it is not specific to negative selection algorithms or V-detector.
It is noteworthy that the probability distribution of only self space could be
taken into account in one-class classification,including negative selection
Nevertheless,when used in suitable problems,V-detector can showits strength
compared with more time-tested methods.SVM (Support Vector Machine) is
a popular statistical learning algorithm and did very well in many experiments.
However,such good results do not guarantee that it can replace alternative
methods like V-detector under any conditions.As a simple example,let us con-
sider a scenario when V-detector is much easier to use than SVM.Two cases
were designed so that the self region is a disconnected region.(1) Fig.16(a) is
a self region that is a circle partially cut by a cross,which we will call “intersec-
tion”.This is one of the synthetic data sets tested in earlier work [24].(2) Fig.
16(b) is a self region made of five small circles.Both are over the unit square
2-dimensional search space.
(a) ‘Intersection’-shaped (b) ‘5 circles’-shaped
Figure 16:Two shapes of disconnected self region
Tables 5 and 6 show clearly that SVM does not work as well as negative
selection algorithm when default kernel function is used as in previous experi-
ments.That means at least we need to choose proper kernel function to make
SVM work.The correct choice depends on extra knowledge of the problem.
V-detector got significantly better results without the need to refine the control
The choice of the kernel function is a known big limitation of SVM [4].In
SVM,when the decision function is not a linear function of the data,the data
needs to be mapped to a higher dimensional space in which a linear separation
can be done.The kernel function plays the key role in the mapping.The best
choice is still a research issue even with prior knowledge.V-detector and other
approaches that do not use decision function have obvious advantage in this as-
pect.The disconnected self region in the examples mentioned above is designed
to make the possible mapping complicated and a simple kernel function hard
to work.Nonlinear problems are very common in the real world applications,
where V-detector can be used more easily.Furthermore,SVM also has difficul-
ties for very large training dataset and discrete data [4].Both cases are where
V-detector shows its advantage.
Table 5:Results over Intersection self region
detection rate
false alarm rate
SVM ν = 0.05
V-detector r
= 0.05
SVM ν = 0.1
V-detector r = 0.1
Table 6:Results over 5-circles self region
detection rate
false alarm rate
SVM ν = 0.05
V-detector r
= 0.05
SVM ν = 0.1
V-detector r = 0.1
5 Conclusions
A novel strategy of negative selection algorithm called V-detector was intro-
duced.Its unique features give negative selection algorithms more chances to
be applied successfully in more applications.
• A statistical approach is integrated to analyze the detector coverage in a
negative selection algorithm.It makes the algorithm more reliable.An
effective strategy was developed for implementation.
• Variable sized detectors make maximum coverage with limited number of
• Boundary-aware algorithm interprets the training points as a collection
instead of independently.Thus,the boundary of the group of the training
points can be detected.
• The simple generation process makes this method highly efficient.
The detector generation process in V-detector makes it a prefect platform
to integrate hypothesis testing as a component.Furthermore,it can be imple-
mented partly as a byproduct of the generation process without adding much
extra computational cost.
Another advantage of this method is that it applies to any detector schemes
and detection mechanisms as long as it is verifiable whether a sample point is
covered or not.For example,extension to other representation will make this
method applicable to a much larger variety of applications.
Many issues in the performance of negative selection algorithms are related
with the properties of the training data.For the comparison and analysis of
negative selection algorithms to be more meaningful,it is important to develop
a framework concerning the fundamental assumptions and to categorize the
types of data to be processed.
[1] M.Ayara,J.Timmis,R.de Lemos,L.de Castro,and R.Duncan.Nega-
tive selection:How to generate detectors.In J.Timmis and P.J.Bentley,
editors,Proceedings of the 1st International Conference on Artificial Im-
mune Systems (ICARIS),volume 1,pages 89–98,University of Kent at
Canterbury,September 2002.University of Kent at Canterbury Printing
[2] M.Bereta and T.Burczy´nski.Comparing binary and real-valued coding
in hybrid immune algorithm for feature selection and classification of ecg
signals.Eng.Appl.Artif.Intell.,20(5):571–585,August 2007.
[3] P.J.C.Branco,J.A.Dente,and R.V.Mendes.Using immunology
principle for fault detection.IEEE Transactions on Industrial Electron-
ics,50(2):362–373,April 2003.
[4] C.J.C.Burges.A tutorial on support vector machines for pattern recog-
nition.Data Mining and Knowledge Discovery,2:121–167,1998.
[5] F.Cucker and S.Smale.On the mathematical foundations of learning.
Bulletin (New Series) of the American Mathematical Society,39(1):1–49,
October 2001.
[6] D.Dasgupta and F.Gonzalez.An immunity-based technique to character-
ize intrusion in computer networks.IEEE Transactions on Evolutionary
Computation,6(3):1081–1088,June 2002.
[7] D.Dasgupta,K.KrishnaKumar,D.Wong,and M.Berry.Negative selec-
tion algorithm for aircraft fault detection.In Proceedings of Third Inter-
national Conference on Artificial Immune Systems (ICARIS 2004),pages
1 – 13,2004.
[8] D.Dasgupta,S.Yu,and N.S.Majumdar.MILA - multilevel immune
learning algorithm.In Proceedings of the Genetic and Evolutionary Com-
putation Conference (GECCO 2003),LNCS 2723,pages 183–194,Chicago,
IL,July 12-16 2003.Springer.
[9] L.N.de Castro and J.Timmis.Artificial Immune System:A New Com-
putational Intelligence Approach.Springer,2002.
[10] P.D’haeseleer,S.Forrest,and P.Helman.An immunological approach to
change detection:Algorithms,analysis,and implications.In Proceedings
of the 1996 IEEE Symposium on Computer Security and Privacy,pages
110–119,Washington,DC,USA,1996.IEEE Computer Society.
[11] F.Esponda,E.S.Ackley,S.Forrest,and P.Helman.Online negative
databases.In G.N.et al,editor,Proceedings of Third International Con-
ference on Artificial Immune Systems (ICARIS 2004),pages 175 – 188,
September 2004.
[12] F.Esponda,S.Forrest,and P.Helman.A formal framework for positive
and negative detection schemes.IEEE Transactions on System,Man,and
Cybernetics,34:357–373,February 2004.
[13] J.L.Fleiss.Statistical Methods for Rates and Proportions.John Wiley &
[14] S.Forrest,A.Perelson,L.Allen,R.,and Cherukuri.Self-nonself discrim-
ination in a computer.In Proceedings of the 1994 IEEE Symposium on
Research in Security and Privacy,pages 202–212,Los Alamitos,CA,1994.
IEEE Computer Society Press.
[15] A.A.Freitas and J.Timmis.Revisiting the foundation of artificial immune
systems:A problem-oriented perspective.In Proceedings of Second Inter-
national Conference on Artificial Immune System (ICARIS 2003),pages
[16] S.M.Garrett.Howdo we evaluate artificial immune systems?Evolutionary
[17] F.Gonzalez,D.Dasgupta,and L.F.Nino.A randomized real-value nega-
tive selection algorithm.In Proceedings of Second International Conference
on Artificial Immune System (ICARIS 2003),pages 261–272,September
[18] E.Hart.Not all balls are round:An investigation of alternative recognition-
region shapes.In ICARIS,pages 29–42,2005.
[19] D.Haussler.Probably approximately correct learning.In National Confer-
ence on Artificial Intelligence,pages 1101–1108,citeseer.ist.psu.edu/
[20] C.A.Hawkins and J.E.Weber.Statistical Analysis - Applications to
Business and Economics.Harper & Row,Publishers,New York,1980.
[21] R.V.Hogg and E.A.Tanis.Probability and Statistical Inference.Prentice
Hall,6th edition,2001.
[22] Z.Ji.A boundary-aware negative selection algorithm.In Proceedings of
IASTED International Conference of Artificial Intelligence and Soft Com-
puting (ASC 2005),pages 379–384,Spain,September 2005.
[23] Z.Ji and D.Dasgupta.Real-valued negative selection algorithm with
variable-sized detectors.In LNCS 3102,Proceedings of GECCO,pages
[24] Z.Ji and D.Dasgupta.Estimating the detector coverage in a negative
selection algorithm.In H.-G.Beyer and et al,editors,GECCO 2005:Pro-
ceedings of the 2005 conference on Genetic and evolutionary computation,
volume 1,pages 281–288,Washington DC,USA,25-29 June 2005.ACM
[25] Z.Ji and D.Dasgupta.Applicability issues of the real-valued negative
selection algorithms.In Genetic and Evolutionary Computation Conference
(GECCO 2006),pages 111–118,Seattle,Washington,8-12 July 2006.
[26] R.E.Sanchez-Yanez,E.V.Kurmyshev,and A.Fernandez.One-class
texture classifier in the CCR feature space.Pattern Recognition Letters,
[27] J.M.Shapiro,G.B.Lamont,and G.L.Peterson.An evolutionary al-
gorithm to generate hyper-ellipsoid detectors for negative selection.In
Bonabeau,E.Cantu-Paz,D.Dasgupta,K.Deb,J.A.Foster,E.D.de Jong,
Tyrrell,J.-P.Watson,and E.Zitzler,editors,GECCO 2005:Proceedings
of the 2005 conference on Genetic and evolutionary computation,volume 1,
pages 337–344,Washington DC,USA,25-29 June 2005.ACM Press.
[28] T.Stibor,P.Mohr,J.Timmis,and C.Eckert.Is negative selection ap-
propriate for anomaly detection?In H.-G.Beyer,U.-M.O’Reilly,D.V.
gupta,K.Deb,J.A.Foster,E.D.de Jong,H.Lipson,X.Llora,S.Man-
E.Zitzler,editors,GECCO 2005:Proceedings of the 2005 conference on
Genetic and evolutionary computation,volume 1,pages 321–328,Washing-
ton DC,USA,25-29 June 2005.ACM Press.
[29] T.Stibor,J.Timmis,and C.Eckert.A comparative study of real-valued
negative selection to statistical anomaly detection techniques.In ICARIS,
pages 262–275,2005.
[30] D.M.J.Tax.One-class classification.PhD thesis,Technische Universiteit
[31] S.Wierzchon.Discriminative power of the receptors activated by k-
contiguous bits rule.Journal of Computer Science and Technology,1(3):1–