Identification of Network Applications based on Machine Learning Techniques

elbowcheepΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

163 εμφανίσεις

Centre de Comunicacions Avançades

de Banda Ampla (CCABA)


Universitat Politècnica

de Catalunya (UPC)


Identification of Network Applications based
on Machine Learning Techniques


COST
-
TMA Meeting, Samos 2008


Valentín Carela
-
Español

Pere Barlet
-
Ros

Josep Solé
-
Pareta


{vcarela, pbarlet, pareta}@ac.upc.edu

Outline


Scenario

and

objectives


Existing

solutions


Well
-
known

ports


Payload

based

(pattern

matching)


Machine

Learning


Supervised


Unsupervised



Proposed

method


Results


Conclusions

and

Future

work

Scenario and objectives



Scenario
:

SMARTxAC

Traffic

Monitoring

and

Analysis

System

for

the

Anella

Científica


Real
-
time

classification


Independent

from

packet

contents


High
-
speed

link



Objectives
:


Development

of

a

ML

Technique

to

identify

applications

in

SMARTxAC


Automate

the

ML

training

phase


Adapt

our

solution

to

Netflow


Study

how

it

affects

the

sampling


Outline


Scenario

and

objectives


Existing

solutions


Well
-
known

ports


Payload

based

(pattern

matching)


Machine

Learning


Supervised


Unsupervised



Proposed

method


Results


Conclusions

and

Future

work

Existing Solutions


Well
-
known

ports

+
Computationally

lightweight

-
Very

low

accuracy



Payload

based

(pattern

matching)

+
High

accuracy

-
Packet

contents

are

required

-
Computationally

expensive

-
Content

encryption

-
Privacy

legislations



Consequence
:

Not

a

feasible

solutions



Existing Solutions


Machine

Learning

Techniques

-
Difficult

training

phase

+
Packet

contents

are

not

required

+
High

accuracy

+
Computationally

viable


Two

main

possibilities
:


Supervised

methods
:

+
Better

accuracy

for

classes

expected

-
Need

a

complete

pre
-
labeled

dataset

-
Difficult

detection

of

retraining

necessity


-
No

detection

of

new

classes


Unsupervised

methods
:


+
Do

not

need

a

full

labeled

dataset

+
Automatic

detection

of

new

classes

+
Better

accuracy

for

new

classes

Outline


Scenario

and

objectives


Existing

solutions


Well
-
known

ports


Payload

based

(pattern

matching)


Machine

Learning


Supervised


Unsupervised



Proposed

method


Results


Conclusions

and

Future

work

Proposed method


Supervised

identification

based

on

C
4
.
5

algorithm


Developed

by

Ross

Quinlan

as

extension

of

ID
3


Based

on

the

construction

of

a

classification

tree



Training

set


Actual

traffic

flows



Pairs

<flow

features,

applications>


Feature

vector

contains

relevant

characteristics

of

traffic

flows


Application

is

identified

using

L
7
-
filter

Machine Learning process

1
)

Collection

of

the

training

set




Representative

flows

of

the

environment

to

be

monitored


2
)Automatic

flow

classification



application

class




Pattern

matching

using

L
7
-
filter




It

can

be

simplified

if

an

artificial

training

set

is

used

in

1
)


3
)

Feature

extraction

from

the

training

flows


4
)

Construction

of

a

C
4
.
5

classification

tree




E
.
g
.

using

Weka


5
)

Deployment

of

the

tree

obtained

in

4
)

in

the

monitoring

system


6
)

Retraining

of

the

system




Starting

from

phase

1
)

Outline


Scenario

and

objectives


Existing

solutions


Well
-
known

ports


Payload

based

(pattern

matching)


Machine

Learning


Supervised


Unsupervised



Proposed

method


Results


Conclusions

and

Future

work

Accuracy

0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
Unknown
P2P
HTTP
VoIP
Network
DNS
FTP
Email
Streaming
Others
Flow Accuracy

Netflow Accuracy

0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
Unknown
P2P
HTTP
VoIP
Network
DNS
FTP
Email
Streaming
Others
Flow Accuracy

Accuracy

0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
Bytes Accuracy

0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
Pkts Accuracy

0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
Netflow Pkts Accuracy

0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
Netflow Bytes Accuracy

Features Accuracy



Best

Normal

Feature

Subset

:

dport
,

bytes_out
,

avg_out_size
,

sport,

avg_in_size
,

push_in
.





Best

Netflow

Feature

Subset
:

dport
,

bytes,

push


0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
25
15
6
5
4
3
2
1
Accuracy

# Features

Accuracy vs # Features

Normal
Netflow
Well-known ports
How it affects the sampling?

0
10
20
30
40
50
60
70
80
90
100
1
0.9
0.75
0.5
0.25
0.1
0.05
0.01
0.005
0.001
0.0005
0.0001
% Accuracy

Accuracy vs Sampling

Flows
Netflows Flows
Pkts
Netflow Pkts
Bytes
Netflow Bytes
Outline


Scenario

and

objectives


Existing

solutions


Well
-
known

ports


Payload

based

(pattern

matching)


Machine

Learning


Supervised


Unsupervised



Proposed

method


Results


Conclusions

and

Future

work

Conclusions and Future Work



Machine

learning

techniques

are

a

good

solution

to

identify

applications



The

identification

in

sampled

scenarios

are

still

very

open




Future

work
:



Find

a

more

accurate

automatic

system

to

label

the

dataset


Build

early

decision

trees

to

identify

the

flow

as

soon

as

possible


Find

features

that

achieves

more

accuracy

and

more

resilient

to

sampling


Test

with

traces

from

another

networks

to

check

the

generality

of

the

solution
.


Thank you for your attention




Questions?