CLASSIFICATION MODELS FOR INTRUSION DETECTION SYSTEMS

stepweedheightsAI and Robotics

Oct 15, 2013 (4 years and 2 months ago)

89 views

Proceedings of 2nd International Conference on Intelligent Knowledge Systems (IKS
-
2005), 06
-
08 July
2005




CLASSIFICATION MODEL
S

FOR INTRUSION

DETECTION SYSTEMS


Srinvas Mukkamala



Andrew

H. Sung



Rajeev Veeraghattam


email: srinivas@cs.nmt.edu

email:
sung@cs.nmt.edu


email: rajeev@nmt.edu


Department of Computer Science
, New Mexico Tech, Socorro, NM 87801,

USA

Institute of Complex Additive Systems Analysis
, New Mexico Tech, Socorro, NM 87801, USA


Key words:
Machine learning
,
Intrusion detection systems, CART, MARS, TreeNet


ABSTRACT

This paper describes results
concerning

the classification
capability

of
su
pervised machine learning techniques

in
detecting intrusi
ons
using

network audit trails
.
In this paper
we investigate three well know
n

machine learning
techniques: classification and regression tress

(CART)
,
multivariate regression splines
(MARS)
and tre
enet. The
best

model is chosen based on the
classification accuracy
(ROC curve analysis).
The results show that high
classifica
tion accuracies can be achieved in a fraction of the
time required by
well

kno
wn

support vector machines and
artificial
neural ne
tworks
.

Treenet
performs

the best for
normal, probe and denial of service attacks (DoS).
C
ART
performs

the best for user to super user (U2su) and remote
to local (R2L).


1
.

INTRODUCTION


Since the ability of an
Intrusion Detection System (
IDS
)

to
identify
a large variety of intrusions in real time with
high
accuracy is of primary concern, we will in this paper
consider performa
nce of machine learning
-
based IDSs
with
respect to

classification accuracy

and false alarm rates.


AI techniques have been used

to a
utomate the intrusion
de
tection process
;

they include neural networks, fuzzy
inference systems, evolutionary computation
,

machine
learning,
support vector machines,
etc

[1
-
6
]
.
Often
model
selection

using SVMs
,
and other popular machine learning
methods req
uires extensive resources and long execution
times

[7,8]
.
In this
paper,

we present a few machine
learning methods
(MARS, CART, TreeNet)
that can
perform model selection with higher or comparable
accuracies in a fraction of the time required by the SVMs
.


MARS is a nonparametric regression procedure that is
based on the “divide and conquer” strategy, which
partitions the input space into regions, each with its own
regression equation

[9]
. CART is a tree
-
building algorithm
that determines a set of
if
-
then

l
ogical (split) conditions that
permit accurate prediction or classification

of classes

[10]
.
TreeNet a tree
-
building algorithm that uses

stochastic

gradient boosting to combine trees via a weighted voting
scheme, to achieve accuracy without the drawback of

a
tendency to be misled by bad data

[11
,12
]
.

We perfor
med experiments using MARS, CART, Treenet
for
classifying
each of the five classes (normal, probe,
denial of service, user to super
-
user, and remote to local) of
network traffic
patterns in the

DARPA d
ata.


A brief introductio
n MARS

and model selection is given in
section II.
CART

and a tree generated for classifying
normal vs. intrusions in DARPA data is explained i
n
section III. TreeNet is briefly described

in section IV.
Intrusion detection data

used

for experiments is
explained

i
n section V
.
In section
VI
, we

analyze
c
lassification
accuracies of MARS, CART, TreeNet

using ROC curves
.

Conclusions of our work are given in section VII.


II
.

MARS


Multivariate Adaptive Regression Splines (MARS) is a
nonpa
rametric regression procedure that makes no
assumption about the underlying functional relationship
between the dependent and independent variables. Instead,
MARS constructs this relation from a set of coefficients
and basis functions that are entirely “dr
iven” from the data.


The method is based on the “divide and conquer” strategy,
which partitions the input space into regions, each with its
own regression equation. This makes MARS particularly
suitable for problems with higher input dimensions, where
th
e curse of dimensionality would likely create problems
for other techniques
.


Basis functions:
MARS

uses two
-
sided truncated functions
of the form as basis functions for linear or nonlinear
expansion, which approximates the relationships between
the respon
se and predictor variables. A simple example of
two basis functions (t
-
x)+ and (x
-
t)+[
9,11
]. Parameter
t

is
the knot of the basis functions (defining the "pieces" of the
piecewise linear regression); these knots (parameters) are
also determined from the da
ta. The "+" signs next to the
terms
(t
-
x)

and
(x
-
t)

simply denote that only positive results
of the respective equations are considered; otherwise the
respective functions evaluate to zero.

Proceedings of 2nd International Conference on Intelligent Knowledge Systems (IKS
-
2005), 06
-
08 July
2005


1
The MARS Model

The basis functions together with the model parame
ters
(estimated via least squares estimation) are combined to
produce the predictions given the inputs. The general
MARS


Where the summation is over the M nonconstant terms in
the model, y is predicted as a function of the predictor

variables X (and their interactions); this function consists of
an intercept parameter (
) and the weighted by (
)
sum of one or more basis functions
.

Model Selection

After implementing
the forward stepwise selection of basis
functions, a backward procedure is applied in which the
model is pruned by removing those basis functions that are
associated with the smallest increase in the (least squares)
goodness
-
of
-
fit. A least squares error f
unction (inverse of
goodness
-
of
-
fit) is computed. The so
-
called Generalized
Cross Validation error is a measure of the goodness of fit
that takes into account not only the residual error but also
the model complexity as well. It is given by


w
ith


Where N is the number of cases in the data set, d is the
effective degrees of freedom, which is equal to the number
of independent basis functions. The quantity c is the
penalty for adding a basis function. Experiments

have
shown that the best value for C can be found so
mewhere in
the range 2 < d < 3 [9]
.


III
.

CART


CART builds classification and regression trees for
predicting continuous dependent variables (regression) and
categorical predictor variables (classificat
ion) [
10,11
].


CART analysis consists of four basic steps
1

[12]
:



The first step consists of tree building, during which a
tree is built using recursive splitting of nodes. Each
resulting node is assigned a predicted class, based on
the distribution of cla
sses in the learning dataset which
would occur in that node and the decision cost matrix.



The second step consists of stopping the tree building
process. At this point a “maximal” tree has been



1

Reference [12] was accidentally omitted during the
editing process of the original manuscript. Complete
reference is:
R. J. Lewis.
An Introduction to Classification
and Regression Tree (CART) Analysis.
Annual Meeting of
the Society for Academic Emergency Medicine, 2000.


produced, which probably greatly overfits the
information con
tained within the learning dataset.



The third step consists of tree “pruning,” which results
in the creation of a sequence of simpler and simpler
trees, through the cutting off of increasingly important
nodes.



The fourth step consists of optimal tree sel
ection,
during which the tree which fits the information in the
learning dataset, but does not overfit the information,
is selected from among the sequence of pruned trees.


The decision tree begins with a root node t derived from
whichever variable in th
e feature space minimizes a
measure of the impurity of the two sibling nodes. The
measure of the impurity or entropy at node t, denoted by
i(t), is as shown in the following equation

[11]
:




Where

p
(
wj
|
t
) is the proportion of pat
terns
x
i allocated to
class
wj
at node
t
. Each non
-
terminal node is then divided
into two further nodes,
tL
and
tR
, such that
pL
,
pR
are the
proportions of entities passed to the new nodes
tL
,
tR
respectively. The best division is that which maximizes the

difference given in

[11]
:




The decision tree grows by means of the successive sub
-
divisions until a stage is reached in which there is no
significant decrease in the measure of impurity when a
further additional division
s
is imp
lemented. When this
stage is reached, the node
t
is not sub
-
divided further, and
automatically becomes a terminal node. The class
wj
associated with the terminal node
t
is that which maximizes
the conditional probability
p
(
wj
|
t
).

No of
nodes
generated
an
d

terminal
node
values
for each class are
for the DARPA
data set described in section V are
presented in Table 1.
















Figure 1.

Tree for classifying normal vs. intrusions


Proceedings of 2nd International Conference on Intelligent Knowledge Systems (IKS
-
2005), 06
-
08 July
2005


Figure 1 is represents a classification tree generated for
DARPA data d
escribed in section V for classifying normal
activity vs. intrusive activity. Each of the terminal node
describes a data value; each record is classifies into one of
the terminal node through the decisions made at the non
-
terminal node that lead from the r
oot to that leaf.


Table 1. Summary of tree splitters for all five classes.

Class

No of
Nodes

Terminal
Node Value

Normal

23

0.016

Probe

22

0.019

DoS

16

0.004

U2Su

7

0.113

R2L

10

0.025


IV
.

TREENET


In a TreeNet model classification and regression mo
dels are
built up gradually through a potentially large collection of
small trees. Typically consist from a few dozen to several
hundred trees, each normally no longer than
two to eight
terminal nodes. The model is similar to a long series
expansion (such
as Fourier or Taylor’s series)
-

a sum of
factors that becomes progressively more accurate as the
expansion continues. The expansion can be written as
[
11,13]
:


Where T
i
is a small tree


Each tree improves on its predecessors through

an error
-
correcting strategy. Individual trees may be as small as one
split, but the final models can be accurate and are resistant
to overfitting.


V
.

DATA USED FOR ANALYS
IS


A subset of the DARPA intrusion detection data set is used
for offline analysi
s. In the DARPA intrusion detection
evaluation program, an environment was set up to acquire
raw TCP/IP dump data for a network by simulating a
typical U.S. Air Force LAN. The LAN was operated like a
real environment, but being blas
ted with multiple attac
ks
[
1
4,15
]. For each TCP/IP connection, 41 various
quantitative and qualit
a
tive features were extracted [16
] for
intrusion analysis. Attacks are classified into the following
types.

The 41 features extracted fall into three
categorties, “intrinsic” featur
es that describe about
the individual TCP/IP
connections; can be obtained
fro
m network audit trails, “content
-
based” features
that describe about payload of the network packet;
can be obtained from the data portion of the network
packet, “traffic
-
based” fe
atures, that are computed
using a specific window (connection time or no of
connections). As
DOS and Probe attacks involve
several connections in a short time frame, whereas
R2U and U2Su attacks are embedded in the data
portions of the connection and often

involve just a
single connection; “traffic
-
based” features play an
important role in deciding whether a particular
network activity is engaged in probing or not.

Attack types fall into four main categories:




Denial of Service (DOS) Attacks: A denial of se
rvice
attack is a class of attacks in which an attacker makes
some computing or memory resource too busy or too
full to handle legitimate requests, or denies legitimate
users access to a machine. Examples are Apache2,
Back, Land, Mail bomb, SYN Flood, Ping

of death,
Process table, Smurf, Syslogd, Teardrop, Udpstorm.



User to Superuser or Root Attacks (U2Su): User to
root exploits are a class of attacks in which an attacker
starts out with access to a normal user account on the
system and is able to exploit v
ulnerability to gain root
access to the system. Examples are Eject, Ffbconfig,
Fdformat, Loadmodule, Perl, Ps, Xterm.



Remote to User Attacks (R2L): A remote to user
attack is a class of attacks in which an attacker sends
packets to a machine over a network

but who does
not have an account on that machine; exploits some
vulnerability to gain local access as a user of that
machine. Examples are Dictionary, Ftp_write, Guest,
Imap, Named, Phf, Sendmail, Xlock, Xsnoop.



Probing (Probe): Probing is a class of atta
cks in which
an attacker scans a network of computers to gather
information or find known vulnerabilities. An attacker
with a map of machines and services that are available
on a network can use this information to look for
exploits. Examples are Ipsweep,
Mscan, Nmap, Saint,
Satan.


In our experiments, we perform 5
-
class classification. The
(training and testing) data set contains 11982 randomly
generated points from the data set representing the five
classes, with the number of data from each class
proport
ional to its size, except that the smallest class is
completely included. The set of 5092 training data and
6890 testing data are divided in to five classes: normal,
probe, denial of service attacks, user to super user and
remote to local attacks. Where th
e attack is a collection of
22 different types of instances that belong to the four
classes described in
Section V
, and the other is the normal
data. Note two randomly generated separate data sets of
sizes 5092 and 6890 are used for training and testing
MA
RS, CART, and TreeNet respectively.
Section VI

summarizes the classifier accuracies.


VI
.

ROC CURVES


Detec
tion rates and false alarms are

eva
luated for the five
-
class pattern

in the
DARPA
data set and
the obtained
results are
used to form the
ROC

curve
s
.
The point (0,1)
is the perfect classifier, since it classifies all positive
Proceedings of 2nd International Conference on Intelligent Knowledge Systems (IKS
-
2005), 06
-
08 July
2005


cases

and negative cases correctly.
Thus an ideal
system will initiate by identifying all the positive
examples and so the curve will rise to (0,1)
immediately, having a zero rate
of false positives,
and then continue along to (1,1).


Figure
s

2 to 6

show

the ROC

curves of the detection
models by attack categories as well as on all intrusions. In
each of these ROC plots, th
e x
-
axis is the false positive
rate
, calculated as the perce
ntage of normal connections
considered as intrusions; the y
-
axis is the detection rate,
calculated as the percentage of intrusions detected. A data
point in the upper left corner corresponds to optimal high
performance, i.e, high detection rate with low fa
lse alarm
rate
.

Area of the ROC curves, no of false positives and
false negatives are presented in Tables 2 to 6.


Table 2. Summary of classification accuracy for normal.

Curve

Area

False
Positives

False
Negatives

MARS

0.993

56

4

CART

0.991

75

5

TreeNet

0.997

18

0



Figure 2
. Classification accuracy for normal


Table 3. Summary of classification accuracy for probe.

Curve

Area

False
Positives

False
Negatives

MARS

0.777

64

305

CART

0.998

24

0

TreeNet

0.999

14

0



Figure 3
. Classification accuracy fo
r
probe


Table 4. Summary of classification accuracy for DoS.

Curve

Area

False
Positives

False
Negatives

MARS

0.945

185

169

CART

0.998

1

16

TreeNet

0.998

3

9



Figure 4
. Classification accuracy for DoS



Table 5. Summary of classification accuracy for

U2Su.

Curve

Area

False
Positives

False
Negatives

MARS

0.700

3

15

CART

0.720

3

14

TreeNet

0.699

7

16


Proceedings of 2nd International Conference on Intelligent Knowledge Systems (IKS
-
2005), 06
-
08 July
2005



Figure 5
. Classification accuracy for U2Su



Table 6. Summary of classification accuracy for R2L

Curve

Area

False
Positives

False
Negatives

MARS

0.
992

17

7

CART

0.993

15

6

TreeNet

0.992

19

7


Figure 6
. Classification accuracy for R2L



VII
.
CONCLUSIONS


A number of observations and conclusions are
drawn from
the results reported in this paper:



TreeNet
easily achieve
s

high detection accuracy
(high
er than 99%) for each of the 5 classes of
DARPA
data
.
Treenet
performed the best for normal with 18
false positives (FP) and 0 false negatives (FP), probe
with 14 FP and 0 FN, and denial of service attacks
(DoS) with 3 FP and 9 FN.



C
ART performed the best

for user to super user
(U2su) with 3 FP and 14 FN and remote to local
(R2L) with 15 FP and 6 FN.


We demonstr
ate that using these fast execution machine
learning methods we can achieve
high classifica
tion
accuracies
in a fraction of the time required by
t
he
well
know support vector machines and
artificial
neural
networks
.


We note, however, that the difference in a
ccuracy figures
tend to be
small and may not be statistically significant,
especially in view of the fact that the 5 classes of patterns
diff
er
tremendously
in their sizes.
More definit
ive
conclusions perhaps can only be drawn

after analyzing
more compr
ehensive sets of network
data.

ACKNOWLEDGEMENTS


Partial support for this research received from ICASA
(Institute for Complex A
d
ditive Systems Anal
ysis, a
division of New Mexico Tech), a DoD IASP, and an NSF
SFS Capacity Building grants are gratefully acknow
l
edged.


REFERENCES


1.

S.
Mukkamala,

G.
Janowski,

A. H. Sung,
Intrusion
Detection Using Neural Networks and Support V
ector
Machines
. Proceedings o
f IEEE International Joint
Conference

on Neural Networks 2002
,
IEEE press,
pp.
1702
-
1707
, 2002.

2.

M.
Fugate,

J. R. Gattiker,
Computer Intrusion
Detection with Classification and Anomaly Detection,
Using SVMs. International Journal of Pattern
Recognit
ion and
Artificial Intelligence, Vol. 17(3), pp.
441
-
458
, 2003.


3.

W. Hu
,
Y. Liao, V. R. Vemuri,

Robust Support Vector
Machines for Anamoly Detection in Computer
Security.
International Conference on Machine
Learning
,

pp. 168
-
174
, 2003.

4.

K. A.
Heller
,
K. M.
Svore
,

A.

D. Keromytis, S. J.
Stolfo
,
One Class Support Vector Machines for
Detecting Anomalous Window Registry Accesses.

Proceedings of
IEEE Conference Data Mining
Workshop on Data Mining for Computer Security
,
2003.

5.

A.
Lazarevic,

L.
Ertoz, A.

Ozgur
, J.

S
rivastav
a, V.

Kumar
,
A Comparative Study of Anomaly Detection
Schemes in Network Intrusion Detection
. Proceedings
of Third
SIAM Conference on Data Mining
, 2003.

6.

S.
Mukkamala,

A. H. Sung,
Feature Selection for
Intrusion Detection Using Neural Networks and
Support
Vector Machines. Journal of the
Transportation Research Board of the National
Academics, Transportation Research Record No 1822
:
33
-
39
, 2003.

7.

S. J.
Stolfo
,

F.
Wei
,

W.

Lee
,

A.

Prodromidis
,

P. K.
Chan
,
Cost
-
based Modeling and Evaluation for Data
Mining with
Application to
Fraud and Intrusion
Detection.
Results f
rom the JAM Project
, 1999.

8.

S
. Mukkamala,

B. Ribeiro, A. H. Sung,
Model
Selection for Kernel Based Intrusion Detection
Systems.
Proceedings of International Conference on
Adaptive and Natural Computing
Algorithms
(ICANNGA), Springer
-
Ver
lag, pp. 458
-
461, 2005.

Proceedings of 2nd International Conference on Intelligent Knowledge Systems (IKS
-
2005), 06
-
08 July
2005


9.

T. Hastie, R. Tibshirani, J. H. Friedman,
The elements
of statistical
learning:

Data mining, inference, and

prediction. Springer, 2001.

10.

L. Breiman, J. H. Friedman, R. A. Olshen, C. J.
Stone,

Classi
fication and regression trees
.

Wadsworth
and Brooks
/Cole Advanced Books
and

Software,
1986.

11.

Salford Systems. TreeNet
, CART, MARS

Manual.

12.

R. J. Lewis. An Introduction to Classification and
Regression Tree (CART) Analysis.
Annual Meeting of
the Society for A
cademic Emergency Medicine, 2000.

13.

J. H. Friedman
,

Stochastic Gradient Boosting.

Journal
of
Computational Statistics and Data Analysis,
Elsevier Science
, Vol. 38
, PP. 367
-
378
, 2002.

14.

K.
Kendall
,
A Database of Computer Attacks for the
Evaluation of Intrusion

Detection

Systems.
Master's
Thesis,
Massachusetts Institute of Technology

(
M
IT)
,
1998.

15.

S. E. Webster,
The Development and Analysis of
Intrusion Detection Algorithms.
Master's Thesis
, MIT
,
1998.

16.

W. Lee, S. J. Stolfo,

A Framework for Constructing
Features a
nd Models for Intrusion Detection Systems
.
ACM Transactions on Information and System
Security
, Vol. 3, pp. 227
-
261,
2000.