Proceedings of 2nd International Conference on Intelligent Knowledge Systems (IKS

2005), 06

08 July
2005
CLASSIFICATION MODEL
S
FOR INTRUSION
DETECTION SYSTEMS
Srinvas Mukkamala
Andrew
H. Sung
Rajeev Veeraghattam
email: srinivas@cs.nmt.edu
email:
sung@cs.nmt.edu
email: rajeev@nmt.edu
Department of Computer Science
, New Mexico Tech, Socorro, NM 87801,
USA
Institute of Complex Additive Systems Analysis
, New Mexico Tech, Socorro, NM 87801, USA
Key words:
Machine learning
,
Intrusion detection systems, CART, MARS, TreeNet
ABSTRACT
This paper describes results
concerning
the classification
capability
of
su
pervised machine learning techniques
in
detecting intrusi
ons
using
network audit trails
.
In this paper
we investigate three well know
n
machine learning
techniques: classification and regression tress
(CART)
,
multivariate regression splines
(MARS)
and tre
enet. The
best
model is chosen based on the
classification accuracy
(ROC curve analysis).
The results show that high
classifica
tion accuracies can be achieved in a fraction of the
time required by
well
kno
wn
support vector machines and
artificial
neural ne
tworks
.
Treenet
performs
the best for
normal, probe and denial of service attacks (DoS).
C
ART
performs
the best for user to super user (U2su) and remote
to local (R2L).
1
.
INTRODUCTION
Since the ability of an
Intrusion Detection System (
IDS
)
to
identify
a large variety of intrusions in real time with
high
accuracy is of primary concern, we will in this paper
consider performa
nce of machine learning

based IDSs
with
respect to
classification accuracy
and false alarm rates.
AI techniques have been used
to a
utomate the intrusion
de
tection process
;
they include neural networks, fuzzy
inference systems, evolutionary computation
,
machine
learning,
support vector machines,
etc
[1

6
]
.
Often
model
selection
using SVMs
,
and other popular machine learning
methods req
uires extensive resources and long execution
times
[7,8]
.
In this
paper,
we present a few machine
learning methods
(MARS, CART, TreeNet)
that can
perform model selection with higher or comparable
accuracies in a fraction of the time required by the SVMs
.
MARS is a nonparametric regression procedure that is
based on the “divide and conquer” strategy, which
partitions the input space into regions, each with its own
regression equation
[9]
. CART is a tree

building algorithm
that determines a set of
if

then
l
ogical (split) conditions that
permit accurate prediction or classification
of classes
[10]
.
TreeNet a tree

building algorithm that uses
stochastic
gradient boosting to combine trees via a weighted voting
scheme, to achieve accuracy without the drawback of
a
tendency to be misled by bad data
[11
,12
]
.
We perfor
med experiments using MARS, CART, Treenet
for
classifying
each of the five classes (normal, probe,
denial of service, user to super

user, and remote to local) of
network traffic
patterns in the
DARPA d
ata.
A brief introductio
n MARS
and model selection is given in
section II.
CART
and a tree generated for classifying
normal vs. intrusions in DARPA data is explained i
n
section III. TreeNet is briefly described
in section IV.
Intrusion detection data
used
for experiments is
explained
i
n section V
.
In section
VI
, we
analyze
c
lassification
accuracies of MARS, CART, TreeNet
using ROC curves
.
Conclusions of our work are given in section VII.
II
.
MARS
Multivariate Adaptive Regression Splines (MARS) is a
nonpa
rametric regression procedure that makes no
assumption about the underlying functional relationship
between the dependent and independent variables. Instead,
MARS constructs this relation from a set of coefficients
and basis functions that are entirely “dr
iven” from the data.
The method is based on the “divide and conquer” strategy,
which partitions the input space into regions, each with its
own regression equation. This makes MARS particularly
suitable for problems with higher input dimensions, where
th
e curse of dimensionality would likely create problems
for other techniques
.
Basis functions:
MARS
uses two

sided truncated functions
of the form as basis functions for linear or nonlinear
expansion, which approximates the relationships between
the respon
se and predictor variables. A simple example of
two basis functions (t

x)+ and (x

t)+[
9,11
]. Parameter
t
is
the knot of the basis functions (defining the "pieces" of the
piecewise linear regression); these knots (parameters) are
also determined from the da
ta. The "+" signs next to the
terms
(t

x)
and
(x

t)
simply denote that only positive results
of the respective equations are considered; otherwise the
respective functions evaluate to zero.
Proceedings of 2nd International Conference on Intelligent Knowledge Systems (IKS

2005), 06

08 July
2005
1
The MARS Model
The basis functions together with the model parame
ters
(estimated via least squares estimation) are combined to
produce the predictions given the inputs. The general
MARS
Where the summation is over the M nonconstant terms in
the model, y is predicted as a function of the predictor
variables X (and their interactions); this function consists of
an intercept parameter (
) and the weighted by (
)
sum of one or more basis functions
.
Model Selection
After implementing
the forward stepwise selection of basis
functions, a backward procedure is applied in which the
model is pruned by removing those basis functions that are
associated with the smallest increase in the (least squares)
goodness

of

fit. A least squares error f
unction (inverse of
goodness

of

fit) is computed. The so

called Generalized
Cross Validation error is a measure of the goodness of fit
that takes into account not only the residual error but also
the model complexity as well. It is given by
w
ith
Where N is the number of cases in the data set, d is the
effective degrees of freedom, which is equal to the number
of independent basis functions. The quantity c is the
penalty for adding a basis function. Experiments
have
shown that the best value for C can be found so
mewhere in
the range 2 < d < 3 [9]
.
III
.
CART
CART builds classification and regression trees for
predicting continuous dependent variables (regression) and
categorical predictor variables (classificat
ion) [
10,11
].
CART analysis consists of four basic steps
1
[12]
:
The first step consists of tree building, during which a
tree is built using recursive splitting of nodes. Each
resulting node is assigned a predicted class, based on
the distribution of cla
sses in the learning dataset which
would occur in that node and the decision cost matrix.
The second step consists of stopping the tree building
process. At this point a “maximal” tree has been
1
Reference [12] was accidentally omitted during the
editing process of the original manuscript. Complete
reference is:
R. J. Lewis.
An Introduction to Classification
and Regression Tree (CART) Analysis.
Annual Meeting of
the Society for Academic Emergency Medicine, 2000.
produced, which probably greatly overfits the
information con
tained within the learning dataset.
The third step consists of tree “pruning,” which results
in the creation of a sequence of simpler and simpler
trees, through the cutting off of increasingly important
nodes.
The fourth step consists of optimal tree sel
ection,
during which the tree which fits the information in the
learning dataset, but does not overfit the information,
is selected from among the sequence of pruned trees.
The decision tree begins with a root node t derived from
whichever variable in th
e feature space minimizes a
measure of the impurity of the two sibling nodes. The
measure of the impurity or entropy at node t, denoted by
i(t), is as shown in the following equation
[11]
:
Where
p
(
wj

t
) is the proportion of pat
terns
x
i allocated to
class
wj
at node
t
. Each non

terminal node is then divided
into two further nodes,
tL
and
tR
, such that
pL
,
pR
are the
proportions of entities passed to the new nodes
tL
,
tR
respectively. The best division is that which maximizes the
difference given in
[11]
:
The decision tree grows by means of the successive sub

divisions until a stage is reached in which there is no
significant decrease in the measure of impurity when a
further additional division
s
is imp
lemented. When this
stage is reached, the node
t
is not sub

divided further, and
automatically becomes a terminal node. The class
wj
associated with the terminal node
t
is that which maximizes
the conditional probability
p
(
wj

t
).
No of
nodes
generated
an
d
terminal
node
values
for each class are
for the DARPA
data set described in section V are
presented in Table 1.
Figure 1.
Tree for classifying normal vs. intrusions
Proceedings of 2nd International Conference on Intelligent Knowledge Systems (IKS

2005), 06

08 July
2005
Figure 1 is represents a classification tree generated for
DARPA data d
escribed in section V for classifying normal
activity vs. intrusive activity. Each of the terminal node
describes a data value; each record is classifies into one of
the terminal node through the decisions made at the non

terminal node that lead from the r
oot to that leaf.
Table 1. Summary of tree splitters for all five classes.
Class
No of
Nodes
Terminal
Node Value
Normal
23
0.016
Probe
22
0.019
DoS
16
0.004
U2Su
7
0.113
R2L
10
0.025
IV
.
TREENET
In a TreeNet model classification and regression mo
dels are
built up gradually through a potentially large collection of
small trees. Typically consist from a few dozen to several
hundred trees, each normally no longer than
two to eight
terminal nodes. The model is similar to a long series
expansion (such
as Fourier or Taylor’s series)

a sum of
factors that becomes progressively more accurate as the
expansion continues. The expansion can be written as
[
11,13]
:
Where T
i
is a small tree
Each tree improves on its predecessors through
an error

correcting strategy. Individual trees may be as small as one
split, but the final models can be accurate and are resistant
to overfitting.
V
.
DATA USED FOR ANALYS
IS
A subset of the DARPA intrusion detection data set is used
for offline analysi
s. In the DARPA intrusion detection
evaluation program, an environment was set up to acquire
raw TCP/IP dump data for a network by simulating a
typical U.S. Air Force LAN. The LAN was operated like a
real environment, but being blas
ted with multiple attac
ks
[
1
4,15
]. For each TCP/IP connection, 41 various
quantitative and qualit
a
tive features were extracted [16
] for
intrusion analysis. Attacks are classified into the following
types.
The 41 features extracted fall into three
categorties, “intrinsic” featur
es that describe about
the individual TCP/IP
connections; can be obtained
fro
m network audit trails, “content

based” features
that describe about payload of the network packet;
can be obtained from the data portion of the network
packet, “traffic

based” fe
atures, that are computed
using a specific window (connection time or no of
connections). As
DOS and Probe attacks involve
several connections in a short time frame, whereas
R2U and U2Su attacks are embedded in the data
portions of the connection and often
involve just a
single connection; “traffic

based” features play an
important role in deciding whether a particular
network activity is engaged in probing or not.
Attack types fall into four main categories:
Denial of Service (DOS) Attacks: A denial of se
rvice
attack is a class of attacks in which an attacker makes
some computing or memory resource too busy or too
full to handle legitimate requests, or denies legitimate
users access to a machine. Examples are Apache2,
Back, Land, Mail bomb, SYN Flood, Ping
of death,
Process table, Smurf, Syslogd, Teardrop, Udpstorm.
User to Superuser or Root Attacks (U2Su): User to
root exploits are a class of attacks in which an attacker
starts out with access to a normal user account on the
system and is able to exploit v
ulnerability to gain root
access to the system. Examples are Eject, Ffbconfig,
Fdformat, Loadmodule, Perl, Ps, Xterm.
Remote to User Attacks (R2L): A remote to user
attack is a class of attacks in which an attacker sends
packets to a machine over a network
but who does
not have an account on that machine; exploits some
vulnerability to gain local access as a user of that
machine. Examples are Dictionary, Ftp_write, Guest,
Imap, Named, Phf, Sendmail, Xlock, Xsnoop.
Probing (Probe): Probing is a class of atta
cks in which
an attacker scans a network of computers to gather
information or find known vulnerabilities. An attacker
with a map of machines and services that are available
on a network can use this information to look for
exploits. Examples are Ipsweep,
Mscan, Nmap, Saint,
Satan.
In our experiments, we perform 5

class classification. The
(training and testing) data set contains 11982 randomly
generated points from the data set representing the five
classes, with the number of data from each class
proport
ional to its size, except that the smallest class is
completely included. The set of 5092 training data and
6890 testing data are divided in to five classes: normal,
probe, denial of service attacks, user to super user and
remote to local attacks. Where th
e attack is a collection of
22 different types of instances that belong to the four
classes described in
Section V
, and the other is the normal
data. Note two randomly generated separate data sets of
sizes 5092 and 6890 are used for training and testing
MA
RS, CART, and TreeNet respectively.
Section VI
summarizes the classifier accuracies.
VI
.
ROC CURVES
Detec
tion rates and false alarms are
eva
luated for the five

class pattern
in the
DARPA
data set and
the obtained
results are
used to form the
ROC
curve
s
.
The point (0,1)
is the perfect classifier, since it classifies all positive
Proceedings of 2nd International Conference on Intelligent Knowledge Systems (IKS

2005), 06

08 July
2005
cases
and negative cases correctly.
Thus an ideal
system will initiate by identifying all the positive
examples and so the curve will rise to (0,1)
immediately, having a zero rate
of false positives,
and then continue along to (1,1).
Figure
s
2 to 6
show
the ROC
curves of the detection
models by attack categories as well as on all intrusions. In
each of these ROC plots, th
e x

axis is the false positive
rate
, calculated as the perce
ntage of normal connections
considered as intrusions; the y

axis is the detection rate,
calculated as the percentage of intrusions detected. A data
point in the upper left corner corresponds to optimal high
performance, i.e, high detection rate with low fa
lse alarm
rate
.
Area of the ROC curves, no of false positives and
false negatives are presented in Tables 2 to 6.
Table 2. Summary of classification accuracy for normal.
Curve
Area
False
Positives
False
Negatives
MARS
0.993
56
4
CART
0.991
75
5
TreeNet
0.997
18
0
Figure 2
. Classification accuracy for normal
Table 3. Summary of classification accuracy for probe.
Curve
Area
False
Positives
False
Negatives
MARS
0.777
64
305
CART
0.998
24
0
TreeNet
0.999
14
0
Figure 3
. Classification accuracy fo
r
probe
Table 4. Summary of classification accuracy for DoS.
Curve
Area
False
Positives
False
Negatives
MARS
0.945
185
169
CART
0.998
1
16
TreeNet
0.998
3
9
Figure 4
. Classification accuracy for DoS
Table 5. Summary of classification accuracy for
U2Su.
Curve
Area
False
Positives
False
Negatives
MARS
0.700
3
15
CART
0.720
3
14
TreeNet
0.699
7
16
Proceedings of 2nd International Conference on Intelligent Knowledge Systems (IKS

2005), 06

08 July
2005
Figure 5
. Classification accuracy for U2Su
Table 6. Summary of classification accuracy for R2L
Curve
Area
False
Positives
False
Negatives
MARS
0.
992
17
7
CART
0.993
15
6
TreeNet
0.992
19
7
Figure 6
. Classification accuracy for R2L
VII
.
CONCLUSIONS
A number of observations and conclusions are
drawn from
the results reported in this paper:
TreeNet
easily achieve
s
high detection accuracy
(high
er than 99%) for each of the 5 classes of
DARPA
data
.
Treenet
performed the best for normal with 18
false positives (FP) and 0 false negatives (FP), probe
with 14 FP and 0 FN, and denial of service attacks
(DoS) with 3 FP and 9 FN.
C
ART performed the best
for user to super user
(U2su) with 3 FP and 14 FN and remote to local
(R2L) with 15 FP and 6 FN.
We demonstr
ate that using these fast execution machine
learning methods we can achieve
high classifica
tion
accuracies
in a fraction of the time required by
t
he
well
know support vector machines and
artificial
neural
networks
.
We note, however, that the difference in a
ccuracy figures
tend to be
small and may not be statistically significant,
especially in view of the fact that the 5 classes of patterns
diff
er
tremendously
in their sizes.
More definit
ive
conclusions perhaps can only be drawn
after analyzing
more compr
ehensive sets of network
data.
ACKNOWLEDGEMENTS
Partial support for this research received from ICASA
(Institute for Complex A
d
ditive Systems Anal
ysis, a
division of New Mexico Tech), a DoD IASP, and an NSF
SFS Capacity Building grants are gratefully acknow
l
edged.
REFERENCES
1.
S.
Mukkamala,
G.
Janowski,
A. H. Sung,
Intrusion
Detection Using Neural Networks and Support V
ector
Machines
. Proceedings o
f IEEE International Joint
Conference
on Neural Networks 2002
,
IEEE press,
pp.
1702

1707
, 2002.
2.
M.
Fugate,
J. R. Gattiker,
Computer Intrusion
Detection with Classification and Anomaly Detection,
Using SVMs. International Journal of Pattern
Recognit
ion and
Artificial Intelligence, Vol. 17(3), pp.
441

458
, 2003.
3.
W. Hu
,
Y. Liao, V. R. Vemuri,
Robust Support Vector
Machines for Anamoly Detection in Computer
Security.
International Conference on Machine
Learning
,
pp. 168

174
, 2003.
4.
K. A.
Heller
,
K. M.
Svore
,
A.
D. Keromytis, S. J.
Stolfo
,
One Class Support Vector Machines for
Detecting Anomalous Window Registry Accesses.
Proceedings of
IEEE Conference Data Mining
Workshop on Data Mining for Computer Security
,
2003.
5.
A.
Lazarevic,
L.
Ertoz, A.
Ozgur
, J.
S
rivastav
a, V.
Kumar
,
A Comparative Study of Anomaly Detection
Schemes in Network Intrusion Detection
. Proceedings
of Third
SIAM Conference on Data Mining
, 2003.
6.
S.
Mukkamala,
A. H. Sung,
Feature Selection for
Intrusion Detection Using Neural Networks and
Support
Vector Machines. Journal of the
Transportation Research Board of the National
Academics, Transportation Research Record No 1822
:
33

39
, 2003.
7.
S. J.
Stolfo
,
F.
Wei
,
W.
Lee
,
A.
Prodromidis
,
P. K.
Chan
,
Cost

based Modeling and Evaluation for Data
Mining with
Application to
Fraud and Intrusion
Detection.
Results f
rom the JAM Project
, 1999.
8.
S
. Mukkamala,
B. Ribeiro, A. H. Sung,
Model
Selection for Kernel Based Intrusion Detection
Systems.
Proceedings of International Conference on
Adaptive and Natural Computing
Algorithms
(ICANNGA), Springer

Ver
lag, pp. 458

461, 2005.
Proceedings of 2nd International Conference on Intelligent Knowledge Systems (IKS

2005), 06

08 July
2005
9.
T. Hastie, R. Tibshirani, J. H. Friedman,
The elements
of statistical
learning:
Data mining, inference, and
prediction. Springer, 2001.
10.
L. Breiman, J. H. Friedman, R. A. Olshen, C. J.
Stone,
Classi
fication and regression trees
.
Wadsworth
and Brooks
/Cole Advanced Books
and
Software,
1986.
11.
Salford Systems. TreeNet
, CART, MARS
Manual.
12.
R. J. Lewis. An Introduction to Classification and
Regression Tree (CART) Analysis.
Annual Meeting of
the Society for A
cademic Emergency Medicine, 2000.
13.
J. H. Friedman
,
Stochastic Gradient Boosting.
Journal
of
Computational Statistics and Data Analysis,
Elsevier Science
, Vol. 38
, PP. 367

378
, 2002.
14.
K.
Kendall
,
A Database of Computer Attacks for the
Evaluation of Intrusion
Detection
Systems.
Master's
Thesis,
Massachusetts Institute of Technology
(
M
IT)
,
1998.
15.
S. E. Webster,
The Development and Analysis of
Intrusion Detection Algorithms.
Master's Thesis
, MIT
,
1998.
16.
W. Lee, S. J. Stolfo,
A Framework for Constructing
Features a
nd Models for Intrusion Detection Systems
.
ACM Transactions on Information and System
Security
, Vol. 3, pp. 227

261,
2000.
Comments 0
Log in to post a comment