Exploiting Clustering Techniques

savagelizardAI and Robotics

Nov 25, 2013 (4 years and 1 month ago)

50 views

Exploiting Clustering Techniques

for Web Session Inference

A.Bianco, G. Mardente, M. Mellia, M.Munafò, L. Muscariello

(Politecnico di Torino)

Outline


Web Session Model


Clustering techniques


The proposed algorithm


Performance of the algorithm


Session statistics

Web session definition



A single web client generates a succession of TCP flows and
think times

think time T
off

think time T
off



A session here is defined as the set of TCP flows arriving
close enough one to each other



For example a threshold can be used to discriminate between
think times and inter arrivals of TCP flows


Algorithms


A threshold based approach needs a priori
knowledge of the source


An adaptive algorithm should be capable to
catch traffic variations


This is supposed to be less sensitive to
traffic characteristics


Clustering is the chosen approach

Proposed algorithm


Three steps


A
K
-
means

is used on all samples to obtain a
first clustering, K is chosen very large


A
hierarchical clustering

is used only on
representatives of each cluster, K is reduced


A
K
-
means

is used on all samples again


To test the algorithm we need a priori
known traffic, that is artificially generated

First Step: K
-
means


K is chosen large enough but significantly smaller than the
number of samples


The K farthest flows determine the first partition


K
-
means is performed 1000 iterations on all samples


Each cluster is then represented using a subset of samples,
one or two in our algorithm


The mean value (Centroid method)


The gth and (100
-
g)th percentiles (Single linkage method if g=0)


g
-
th

percentile


(100
-
g
)
-
th

percentile

Second step: a
hierarchical

method


A hiera
r
chical method is used on only representatives


This method merges clusters until a quality function
determines that the optimal number of cluster
s

Nc has
been found


Gamma function typical behaviour

-
10


0


10


20


30


40


50


60


70


0


200


400


600


800


1000


1200


1400

gamma

Step

Third Step: K
-
means


A K
-
means
is

performed

on all samples


This last step is not critical but rearranges
samples


positions within cluster
s

that is
flows within sessions


It is not CPU time consuming
, than it is not
critical to use it

Performance evaluation


Artificial traffic is generated according to an
ON/OFF process


During ON periods a succession of flows is
generated using i.i.d. inter
-
arrivals


In this model inferring is to recognize if an inter
arrival is an OFF period or an inter arrival between
flows within an ON period


Every time the algorithm does not guess correctly, an
error is counted


Suppose all variables are exponentially distributed

First step sensitivity (1/2)


If the initial number of clusters is chosen large
enough the method is less error prone


The algorithm is much more sensitive to the value
of the idle period


0.01


0.1


1


10


0


200


400


600


800


1000


1200


1400


1600


1800


2000

Percentage of errors

T_{off}

K=1000

K=1500

K=2000

K=2500

First step sensitivity (2/2)


Performance is sensitive to the choice of the percentile g


When clusters are represented through flows at the border of

the
session the method

is less sensitive to

traffic
, i.e. g=1


This is due to the fact


that cluster has a long


and narrow

shape and


those representatives


well model this fact



0.01


0.1


1


10


0


200


400


600


800


1000


1200


1400


1600


1800


2000

Percentage of errors

T_{off}

Single linkage

Centroid Method

g=1

g=5


0.01


0.1


1


10


0


200


400


600


800


1000


1200


1400


1600


1800


2000

T_{off}

g=15

g=25

g=35

g=45

Comparison with threshold based
algorithms


exponential case


0.1


1


10


0


200


400


600


800


1000


1200


1400


1600


1800


2000

Percentage of errors

T_{off}

clustering

etha=T_{off}/2

etha=T_{off}/4

etha=T_{off}/8


0.1


1


10


0


200


400


600


800


1000


1200


1400


1600


1800


2000

T_{off}

etha=T_{off}/16

etha=T_{off}/32

etha=T_{off}/64

etha=T_{off}/128


Threshold based algorithms work well if traffic characteristics are
known


But they are very sensitive to the threshold value


If

sessions are already


well clustered because


idle periods are large


enough compared to


flow’s inter arrivals
,


our algorithm is very


good

Comparison with threshold based
algorithms


Pareto

case


0.1


1


10


0


200


400


600


800


1000


1200


1400


1600


1800


2000

Percentage of errors

T_{off}

clustering

etha=T_{off}/2

etha=T_{off}/4

etha=T_{off}/8


0.1


1


10


0


200


400


600


800


1000


1200


1400


1600


1800


2000

T_{off}

etha=T_{off}/16

etha=T_{off}/32

etha=T_{off}/64

etha=T_{off}/128


Threshold based algorithms work well if traffic characteristics are
known


But they are very sensitive to the threshold value


If

sessions are already


well clustered because


idle periods are large


enough compared to


flow’s inter arrivals
,


our algorithm is very


good

Some statistics on
aggregated
sessions


0


0.05


0.1


0.15


0.2


0.25


0.3


1


10


100


1000


10000

PDF

Number of TCP connections per session


1e
-
005


0.0001


0.001


0.01


0.1


1


100


1000


10000

Number of TCP connections per session

Compl. CDF


0


0.01


0.02


0.03


0.04


0.05


0.06


1


10


100

PDF

Session Length [s]

First SYN
-
> Last TCP Tear
-
Down

First SYN
-
> Last Data Segment


0.0001


0.001


0.01


0.1


1


100


1000


10000

Session Length [s]

Compl. CDF


The session sizes are heavy tailed (broadly)


Usually each session is made of a few TCP flows


Flow termination definition is not that important

Some statistics on
aggregated
sessions


0


0.005


0.01


0.015


0.02


0.025


0.03


100


1000


10000


100000


1e+006

PDF

Session data [bytes]

Server
-
> Client

Client
-
> Server


1e
-
005


0.0001


0.001


0.01


0.1


1


10000


100000


1e+006


1e+007

Session data [bytes]

Compl. CDF


Similar results concerning server to client and
client to server data


Similar distribution law, asymetries on volume
only



Flow’s and session’s inter
-
arrivals


0


0.1


0.2


0.3


0.4


0.5


0.6


0.7


0.8


0.9


1


0.1


1


10


100


1000


10000

CDF

Time [s]

Apr.04 T_{off}

Oct.02 T_{off}

Apr.04 T_{arr}

Oct.02 T_{arr}


The method infers session which are similar
even when considering very different traces


Tarr and Toff are well identified

Conclusions


Clustering techniques could be easily used
to infer web
-
session


The p
ro
posed algorithm is a mix a known
clustering approaches


It is able to deal with huge amount of data


Sessions seems to be very well recognized