An Effective Similarity Metric for Application Traffic Classification

unknownlippsAI and Robotics

Oct 16, 2013 (3 years and 8 months ago)

80 views

DPNM, POSTECH









1
/23

NOMS 2010

Jae Yoon Chung
1
, Byungchul Park
1
, Young J. Won
1

John Strassner
2
, and James W. Hong
1, 2



{dejavu94, fates, yjwon, johns, jwkhong}@postech.ac.kr


April 20, 2010


1
Dept. of Computer Science and Engineering, POSTECH, Korea

2
Division of IT Convergence Engineering, POSTECH, Korea


An Effective Similarity Metric

for Application Traffic Classification

DPNM, POSTECH









2
/23

NOMS 2010

Contents


Introduction



Related Work



Research Goal



Proposed Methodology



Evaluation



Conclusion and Future Work

DPNM, POSTECH









3
/23

NOMS 2010

Introduction


Traffic classification for network management

Network planning

QoS

management

Security

Etc.



Diversity of today’s Internet traffic

New types of network applications

Increase of P2P traffic

Various techniques for avoiding detection



Document classification


T牡r晩挠捬慳獩晩捡瑩潮

Document classification in natural language processing

Comparing packet payload vectors is analogous to document
classification



DPNM, POSTECH









4
/23

NOMS 2010

Related Work


Well
-
known port
-
based classification

Low complexity

Low accuracy (approximately 50~70%)



Signature
-
based classification

High reliability

Exhaustive tasks for searching signatures

E.g.) Snort, LASER



Behavior
-
based classification

Focusing on traffic patterns and connection behaviors

Questionable accuracy

E.g.) BLINC



Machine Learning
-
based classification

Utilize statistical information

A huge computing resource consumption

E.g.) SVM, Bayesian Network



Similarity
-
based classification

Utilize document classification approach

Questionable scalability

E.g.) Flow similarity calculation [IPOM ‘09]

DPNM, POSTECH









5
/23

NOMS 2010

Summary of IPOM 2009


Proposed new traffic classification approach

Utilize document classification approach using Cosine similarity
calculation

Propose new packet representation using Vector Space Model

Propose flow similarity calculation methodology which is to
compare packets in flow sequentially



Methodology validation using real
-
world traffic on our
campus backbone network

Cannot classify flows in asymmetric routing environment



No comparison of Cosine similarity and other similarity
metrics

Cosine similarity that is common similarity metric for human
-
document classification

High variation of similarity value according to term
-
frequency



DPNM, POSTECH









6
/23

NOMS 2010

Research Goals


Propose new traffic classification algorithm

Automation of signature generation step


Generate application vector, which is an alternative signature, using simple
vector operation


Make groups according to traffic type and operation within single
-
application traffic

Accurate and feasible traffic classification algorithm


Classify application traffic using similarity calculation


Solve asymmetric routing classification problem


Validation using real
-
world network traffic to compare similarity metrics


Complexity analysis



Compare
three

similarity metrics

for traffic classification

Jaccard

similarity


counting fragment of signature

Cosine similarity


high weighting scheme for signature

RBF similarity


Euclidean distance between packets


DPNM, POSTECH









7
/23

NOMS 2010





Proposed Methodology

DPNM, POSTECH









8
/23

NOMS 2010

Vector Space Modeling


Vector Space Modeling

An algebraic model representing
text documents
as
vectors

W
idely used to document classification


Categorize electronic document based on its content

(e.g. E
-
mail spam filtering)



Document classification vs. Traffic classification

Document classification


Find documents from stored text documents which satisfy certain

information queries

Traffic classification


Classify network traffic according to the type of application based on

traffic information




DPNM, POSTECH









9
/23

NOMS 2010

Payload Vector Conversion (1/2)


Definition of
word

in payload

Payload data within an
i
-
bytes sliding window

|
Word set
| = 2
(8*sliding window size)




Definition of
payload vector

A term
-
frequency vector in NLP




Payload Vector
= [w
1

w
2


w
n
]
T

DPNM, POSTECH









10
/23

NOMS 2010

Payload Vector Conversion (2/2)

Word

Word

Word


The word size is 2 and the word set size is 2
16


The simplest case for representing the order of content in
payloads

DPNM, POSTECH









11
/23

NOMS 2010

Similarity Metrics for Traffic Classification


Jaccard

similarity

The size of the intersection of the sample sets X and Y divided
by the size of the union of the sample sets X and Y


Cosine similarity

Two vectors X and Y of n dimensions by fining the cosine angle
between them


RBF similarity

Radius based function of Euclidean distance between two vectors
X and Y

DPNM, POSTECH









12
/23

NOMS 2010

Application Vector Heuristics


Application vector

Represent typical packets that are generated by target applications
as the
center

(basis) of each cluster



Application vector generator

Read packets from the target application trace

Divide the packets into several types of clusters without any pre
-
processing

Application
vector
generator

Application
trace

Application
vector 1

Application
vector 2

Application
vector 3

Traffic cluster 1

Traffic cluster 2

DPNM, POSTECH









13
/23

NOMS 2010

Application Vector Generation


Unsupervised grouping within single
-
application traffic

Provide fine
-
grained classification

Classify single
-
application traffic according to traffic types

packet6

packet5

packet4

packet3

packet2

packet1

Application
vector 1

Application
vector 2

Application Traffic

Cluster 1

Cluster 2

DPNM, POSTECH









14
/23

NOMS 2010

Two
-
stage Traffic Classification


Packet level clustering

Classify signal packets regardless of flow information

Compare payload vectors with
application vectors
by calculating
similarity value

Mark on each packet with its
application

and
priority

Allow the permutation of packet sequence



Flow level classification

Rearrange

packets according to flow information

Ignore
mis
-
clustered packets that are caused by protocol
ambiguities


HTTP for Web


HTTP for P2P



DPNM, POSTECH









15
/23

NOMS 2010

Two
-
stage Traffic Classification

Flow 2

Flow 1

Cluster 3

Cluster 2

Cluster 1

F2 P2

F2 P3

F2 P1

F2 P4

F1 P1

F1 P2

F1 P4

F1 P3

F1 P2

F1 P4

F1 P3

F1 P1

F2 P2

F2 P3

F2 P1

F2 P4

Application
Vector 1

Application
Vector 2

Application
Vector 3

F1 P2

F1 P4

F1 P3

F1 P1

F2 P2

F2 P1

F2 P4

F2 P3

Stage 1

Stage 2

BackboneTraffic

BitTorrent

Traffic

FileGuri

Traffic

BitTorrent

FileGuri

Melon

BitTorrent

FileGuri

Mis
-
clustered

DPNM, POSTECH









16
/23

NOMS 2010





Evaluation

DPNM, POSTECH









17
/23

NOMS 2010

Classifying Real
-
world Traffic


Fix
-
port Applications

Traffic trace on one of two Internet
junctions at POSTECH using
optical tap

Ground
-
truth traffic


Some active flows among application

traffic distinguished by usage of active
port number

Target Applications


FileGuri
,
ClubBox
, Melon,
BigFile



Untraceable
-
port Applications

Traffic Measurement Agent (TMA)


Monitoring the network interface of the
host


Recording log data (five
-
flow
tuples
,
process name, packet count, etc)

Target Applications


eMule
,
BitTorrent



Backbone Traffic

Target Application
Traffic

Ground
-
truth
Traffic

Target Application
Traffic

Ground
-
truth Traffic

DPNM, POSTECH









18
/23

NOMS 2010

Classification Accuracy


Classification accuracy
comparison

Fixed
-
port application


FileGuri
,
ClubBox
, Melon,
BigFile

Untraceable
-
port application


eMule
,
BitTorrent

Jaccard

similarity


Reliable




count common segment

Cosine similarity


Emphasize common segment




cannot distinguish


ambiguous packets

RBF similarity


Difficulty of setting parameter




need guideline how to


set parameter



BitTorrent

traffic on
Backbone network

Traffic over
-
classification by
Cosine similarity

High false positive rate of
Cosine similarity

DPNM, POSTECH









19
/23

NOMS 2010

Histogram of Similarity Values

DPNM, POSTECH









20
/23

NOMS 2010

CDF of Distance among Payload Vectors

DPNM, POSTECH









21
/23

NOMS 2010

Complexity Analysis

DPNM, POSTECH









22
/23

NOMS 2010

Conclusion and Future Work


Develop new traffic classification research

Utilizing document classification approach to traffic classification

Unsupervised classification
to make cluster within a single
-
application traffic

Two
-
stage classification algorithm
to solve asymmetric routing classification
problem

Linear time complexity



Compare three similarity metrics

Provide guideline
for selecting similarity metrics
for traffic classification

Provide soft
-
classification that represents similarity as a numerical value
ranges from 0 to 1



Future Work

Enhance unsupervised classification methodology for automated signature
generation

Extract orthogonal application vectors to improve scalability

DPNM, POSTECH









23
/23

NOMS 2010