# Exploiting Clustering Techniques

AI and Robotics

Nov 25, 2013 (4 years and 7 months ago)

58 views

Exploiting Clustering Techniques

for Web Session Inference

A.Bianco, G. Mardente, M. Mellia, M.Munafò, L. Muscariello

(Politecnico di Torino)

Outline

Web Session Model

Clustering techniques

The proposed algorithm

Performance of the algorithm

Session statistics

Web session definition

A single web client generates a succession of TCP flows and
think times

think time T
off

think time T
off

A session here is defined as the set of TCP flows arriving
close enough one to each other

For example a threshold can be used to discriminate between
think times and inter arrivals of TCP flows

Algorithms

A threshold based approach needs a priori
knowledge of the source

An adaptive algorithm should be capable to
catch traffic variations

This is supposed to be less sensitive to
traffic characteristics

Clustering is the chosen approach

Proposed algorithm

Three steps

A
K
-
means

is used on all samples to obtain a
first clustering, K is chosen very large

A
hierarchical clustering

is used only on
representatives of each cluster, K is reduced

A
K
-
means

is used on all samples again

To test the algorithm we need a priori
known traffic, that is artificially generated

First Step: K
-
means

K is chosen large enough but significantly smaller than the
number of samples

The K farthest flows determine the first partition

K
-
means is performed 1000 iterations on all samples

Each cluster is then represented using a subset of samples,
one or two in our algorithm

The mean value (Centroid method)

The gth and (100
-
g)th percentiles (Single linkage method if g=0)

g
-
th

percentile

(100
-
g
)
-
th

percentile

Second step: a
hierarchical

method

A hiera
r
chical method is used on only representatives

This method merges clusters until a quality function
determines that the optimal number of cluster
s

Nc has
been found

Gamma function typical behaviour

-
10

0

10

20

30

40

50

60

70

0

200

400

600

800

1000

1200

1400

gamma

Step

Third Step: K
-
means

A K
-
means
is

performed

on all samples

This last step is not critical but rearranges
samples

positions within cluster
s

that is
flows within sessions

It is not CPU time consuming
, than it is not
critical to use it

Performance evaluation

Artificial traffic is generated according to an
ON/OFF process

During ON periods a succession of flows is
generated using i.i.d. inter
-
arrivals

In this model inferring is to recognize if an inter
arrival is an OFF period or an inter arrival between
flows within an ON period

Every time the algorithm does not guess correctly, an
error is counted

Suppose all variables are exponentially distributed

First step sensitivity (1/2)

If the initial number of clusters is chosen large
enough the method is less error prone

The algorithm is much more sensitive to the value
of the idle period

0.01

0.1

1

10

0

200

400

600

800

1000

1200

1400

1600

1800

2000

Percentage of errors

T_{off}

K=1000

K=1500

K=2000

K=2500

First step sensitivity (2/2)

Performance is sensitive to the choice of the percentile g

When clusters are represented through flows at the border of

the
session the method

is less sensitive to

traffic
, i.e. g=1

This is due to the fact

that cluster has a long

and narrow

shape and

those representatives

well model this fact

0.01

0.1

1

10

0

200

400

600

800

1000

1200

1400

1600

1800

2000

Percentage of errors

T_{off}

Centroid Method

g=1

g=5

0.01

0.1

1

10

0

200

400

600

800

1000

1200

1400

1600

1800

2000

T_{off}

g=15

g=25

g=35

g=45

Comparison with threshold based
algorithms

exponential case

0.1

1

10

0

200

400

600

800

1000

1200

1400

1600

1800

2000

Percentage of errors

T_{off}

clustering

etha=T_{off}/2

etha=T_{off}/4

etha=T_{off}/8

0.1

1

10

0

200

400

600

800

1000

1200

1400

1600

1800

2000

T_{off}

etha=T_{off}/16

etha=T_{off}/32

etha=T_{off}/64

etha=T_{off}/128

Threshold based algorithms work well if traffic characteristics are
known

But they are very sensitive to the threshold value

If

well clustered because

idle periods are large

enough compared to

flow’s inter arrivals
,

our algorithm is very

good

Comparison with threshold based
algorithms

Pareto

case

0.1

1

10

0

200

400

600

800

1000

1200

1400

1600

1800

2000

Percentage of errors

T_{off}

clustering

etha=T_{off}/2

etha=T_{off}/4

etha=T_{off}/8

0.1

1

10

0

200

400

600

800

1000

1200

1400

1600

1800

2000

T_{off}

etha=T_{off}/16

etha=T_{off}/32

etha=T_{off}/64

etha=T_{off}/128

Threshold based algorithms work well if traffic characteristics are
known

But they are very sensitive to the threshold value

If

well clustered because

idle periods are large

enough compared to

flow’s inter arrivals
,

our algorithm is very

good

Some statistics on
aggregated
sessions

0

0.05

0.1

0.15

0.2

0.25

0.3

1

10

100

1000

10000

PDF

Number of TCP connections per session

1e
-
005

0.0001

0.001

0.01

0.1

1

100

1000

10000

Number of TCP connections per session

Compl. CDF

0

0.01

0.02

0.03

0.04

0.05

0.06

1

10

100

PDF

Session Length [s]

First SYN
-
> Last TCP Tear
-
Down

First SYN
-
> Last Data Segment

0.0001

0.001

0.01

0.1

1

100

1000

10000

Session Length [s]

Compl. CDF

The session sizes are heavy tailed (broadly)

Usually each session is made of a few TCP flows

Flow termination definition is not that important

Some statistics on
aggregated
sessions

0

0.005

0.01

0.015

0.02

0.025

0.03

100

1000

10000

100000

1e+006

PDF

Session data [bytes]

Server
-
> Client

Client
-
> Server

1e
-
005

0.0001

0.001

0.01

0.1

1

10000

100000

1e+006

1e+007

Session data [bytes]

Compl. CDF

Similar results concerning server to client and
client to server data

Similar distribution law, asymetries on volume
only

Flow’s and session’s inter
-
arrivals

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.1

1

10

100

1000

10000

CDF

Time [s]

Apr.04 T_{off}

Oct.02 T_{off}

Apr.04 T_{arr}

Oct.02 T_{arr}

The method infers session which are similar
even when considering very different traces

Tarr and Toff are well identified

Conclusions

Clustering techniques could be easily used
to infer web
-
session

The p
ro
posed algorithm is a mix a known
clustering approaches

It is able to deal with huge amount of data

Sessions seems to be very well recognized