Intelligent Bayesian Network-Based Approaches

lettuceescargatoireAI and Robotics

Nov 7, 2013 (3 years and 7 months ago)

53 views



Intelligent Bayesian Network
-
Based Approaches


for

Web Proxy Caching


Prepared By :

Waleed

Ali Ahmed &
Siti

Mariyam

Shamsuddin


Soft Computing Research Group, Faculty of Computer
Science and Information Systems,

Universiti

Teknologi

Malaysia, 81310 Johor, Malaysia

waleedalodini@gmail.com, mariyam@utm.my

Introduction

Related Works

The Proposed Intelligent Web Proxy
Caching Approaches

Implementation and Performance
Evaluation

Conclusion and Future works

Outline

Introduction





Background



Web caching is
one of the most successful solutions for improving the
performance of Web
-
based systems.



Web caching is a well
-
known strategy for improving the performance of
Web
-
based system by keeping Web objects that are likely to be used in
the near future in location closer to user.




Why?




To decrease latencies



To reduce web server loads



To reduce bandwidth usage


In Web proxy caching, the popular web objects that are likely to be revisited
in the near future are stored on the proxy server which plays the key roles
between users and web sites in reducing the response time of user requests
and saving the network bandwidth
.


Web proxy caching



Proxy servers play the key roles between users and
web sites,
which could reduce the response time and
save network bandwidth.


The most common caching strategy
. The proxy caching
is widely utilized by computer network administrators,
technology providers, and businesses to reduce user
delays and to alleviate Internet congestion
(Kaya et al., 2009;
Kumar, 2009, Kumar et al., 2008)

Why Web proxy caching?



Since the apportioned space to the cache is limited, the space
must be utilized judiciously
(Romano and
ElAarag

, 2011).


The most common Web caching methods are not enough
efficient and may suffer from
cache pollution problem
(Cobb and
ElAarag
, 2008 ;
Koskela

et al., 2003).


Reduction of the effective cache size


Low hit


Wasting bandwidth.


Overload on the original server



So far, the difficulty in determining which ideal web objects will
be re
-
visited is still a major challenge

Problem Statement


Motivations for using machine learning
In Web caching

Availability of web access logs and trace files

or history
of accesses that considered complete and prior
knowledge of future accesses
.


The need to

efficient and adaptive scheme
since Web
environment changes and updates rapidly and
continuously .


Recent studies have proposed utilized ANN in web proxy
caching
although ANN training may consume long time and
require extra computational overhead.


More significantly, integration of intelligent technique in web
cache replacement is still under research.

Intelligent Web Proxy Caching


The suggested solutions


We present new intelligent approaches that depend on the
capability of
Bayesian Network (
BN)
to learn from Web proxy
logs files and predict the classes of objects to be re
-
visited or not.





More significantly
,
the trained BN classifier is incorporated
effectively with traditional Web proxy caching algorithm

to
present novel intelligent web proxy caching approaches

Bayesian Network
(BN)






A Bayesian network is one of the most popular machine
learning models that depends on probability estimations to find
a class of an observed pattern.

Rationale:



The Bayesian network (BN) is defined as
a directed acyclic
graph
over which is defined a probability distribution. Each
node in the graph represents a random variable or event,
while the arcs or edges between the nodes represent
association or causal relationship

Bayesian Network
(BN)






The probabilistic dependency is maintained by
the conditional
probability table(CPT)
, which is attached to the corresponding
event.

In classification tasks :



the classification decision is calculated simply

using formula.

max{ (\) ( )}
i r r
r i
x c P x c P c

 
probability of finding the pattern x in class

c ,



probability of class c

(\)
r
P x c
( )
r
P c
Why BN in Web Caching
?



Bayesian networks are popular supervised learning algorithms
that have great popularity in medical filed and other
applications such as military applications, forecasting, control,
modelling

for human understanding, cognitive science,
statistics, and philosophy
.




Hence, Bayesian networks can be utilized to produce
promising solutions for Web proxy caching.

Related Works



Intelligent Web Caching?


The conventional Web caching methods are not
enough efficient
(Cobb and
ElAarag
, 2008 ;
Koskela

et al., 2003)


Therefore,
several
researchers have proposed
incorporating intelligent solutions to cope with Web
caching problem.



According to
Chen (2008),
the intelligent approaches are
more efficient and more adaptive to Web caching
environment compared to others approaches

Related
works on
intelligent
web
caching

Summary of intelligent web caching


From the previous studies, we can observe two approaches in
intelligent web caching.


An intelligent technique is employed in web caching
individually.


An intelligent technique is employed with LRU Algorithm.



Both approaches may predict Web objects that can be re
-
accessed;
However,



They did not take into account the cost and size of the predicted objects in
the cache replacement decision.


Some important features are ignored.


The training process requires long time and extra
computational
overhead
.

Proposed approach

Existing approaches

takes in consideration the most effective
factors in cache replacement decision

One factor or more ignored in cache
replacement decision

depends on BN that can achieve much
better accuracy and faster

than BPNN
and ANFIS.

depend on ANN or ANFIS that their
training may consume long time and
require extra computational over head.

Integrates BN classifier into GDS
algorithm that takes the cost and size of
cached objects in consideration



---


BN is effectively integrated with LRU


Intelligent technique is employed
individually or integrated with LRU


Proposed Approach
VS

Existing Approaches

The Proposed Intelligent Web Proxy
Caching Approaches







The operational framework

for the
proposed approach


The framework consists of two functional components:



Offline component:

It works only while the proxy server
in leisure
periods
. It is
responsible for training

BN classifier.



Online component:

The intelligent caching strategies are executed in
this part.




A Framework for the proposed approach



In the online component,

the intelligent caching strategies are achieved
for managing proxy cache content.


We propose
intelligent web proxy caching approaches depends on
integrating BN with traditional Web caching
to provide more effective
caching policies



Bayesian Network
-
Greedy
-
Dual
-
Size Approach (BN
-
GDS):
BN classifier
is integrated with GDS for improving the performance in terms of the
byte hit ratio of GDS.


Bayesian Network
-
Least
-
Recently
-
Used Approach (BN
-
LRU)
:
BN
classifier is combined with LRU to form a new algorithm called BN
-
LRU.


Bayesian Network
-
Dynamic Aging Approach(BN
-
DA):
BN classifier is
combined with
dynamic aging (DA)

to form a new algorithm called BN
-
DA.



Online Component



The Greedy
-
Dual
-

Size (GDS)
caching algorithm was proposed by Cao and
Irani

(1997). The algorithm assigns a key value
K
(
p
) to each object
p
in the
cache, so that the object with the lowest key value is replaced :



( )
( )
( )
C p
K p L
S p
 
where
C
(
p
) is the cost
to bring object
p
into the cache;
S
(
p
) is the object size;

L
is an inflation factor
that starts at 0 and is updated to the key value of the last replaced object.

If an object is accessed again, its key value is updated using the new
L
value.


1
-

The intelligent BN
-
GDS approach


Cherkasova(1998)
enhanced GDS algorithm by incorporating
a frequency
count
, so the algorithm is called Greedy
-

Dual
-
Size
-
Frequency (GDSF)
algorithm.






where
F
(
p
) is the access count of object
p
.


One advantage of GDSF policy is GDSF performs well in terms of the hit
ratio.
However, the byte hit ratio of GDSF policy is too low.

Therefore, BN classifier is integrated with GDS for improving the
performance in terms of the byte hit ratio, called BN
-
GDS
.





( )* ( )
( )
( )
F p C p
K p L
S p
 
1
-

The intelligent BN
-
GDS approach



In the proposed BN
-
GDS,

GDS is enhanced by incorporating the
accumulative scores or probabilities of revisiting object

g

depending on
BN classifier
as in Eq.







This means that the key value of object
g
is determined not just by its past
occurrence frequency, but also by the class predicted depending on the six factors.

The rationale behind the proposed BN
-
GDS approach is that we can enhance the
priority of those cached objects that may be revisited in the near future according
to the BN classifier, even if they are not accessed frequently enough

( )* ( )
( )
( )
W g C g
K g L
S g
 
( )
W p
1
-

The intelligent BN
-
GDS approach


LRU policy is the most common proxy caching policy;
However, LRU policy suffers from cold cache pollution
.
In other words, in LRU, a new object is inserted at the top of
the cache stack. If the object is not requested again, it will
take some times to be moved down to the bottom of the
stack before removing it.





For reducing cache pollution in LRU
, BN classifier is
combined with LRU to form a new algorithm called BN
-
LRU.


2
-

The intelligent BN
-
LRU approach


The proposed SVM
-
LRU is worked as follows
:



When the web object g is requested by user, BN
classifier predicts the class of that object either will be
revisited again or not.


If the object g is classified by BN as object will be re
-
visited again, the object g will be placed on the top of
the cache stack.



Otherwise, the object g will be placed in the middle of
the cache stack.



Hence, BN
-
LRU can efficiently remove the unwanted
objects early to make space for the new Web objects.

2
-

The intelligent BN
-
LRU approach

2
-

The intelligent BN
-
LRU approach


In addition to frequency,
several factors can contribute in
predicting the revisiting of the object in the future.


The proposed BN
-
DA approach
combines the most significant
factors depending on Bayesian network (BN)classifier for
predicting probability that Web objects can be re
-
visited later.


In the proposed BN
-
DA approach,
when user visits Web object
g
, the trained BN classifier can predict the probability of
belonging g to the class with objects may be revisited. Then,
the probabilities of
g
are accumulated as scores

used in
cache replacement decision


3
-

The intelligent BN
-
DA approach

( ) ( )
K g L W g
 
( )
W p
Implementation and Performance
Evaluation




1
-
Data collection


We have obtained data of
the proxy logs files of web
objects requested in several
proxy servers located around
the United States of the
IRCache network for fifteen
days
(
NLANR, 2010).


In this study,
the proxy log
files of 21st August, 2010
were used in the training
phase
, while the proxy log
files of the following days
were used in simulation and
implementation phase

Proxy dataset

Proxy server name

Location

Duration of
collection

BO2

bo.us.ircache.net

Boulder, Colorado

21/8


4/9/2010

SV

sv.us.ircache.net

Silicon Valley, California
(FIX
-
West)

21/8


4/9/2010

SD

sd.us.ircache.net

San Diego, California

21/8


28/8/2010

NY

ny.us.ircache.net

New York, NY

21/8


4/9/2010

2
-
Data Pre
-
processing


The data preprocessing involves removing the
irrelevant
requests
from the log files since some the log entries are not valid or
irrelevant entries.

The trace preparation is carried out as follows


Parsing:
identifying the boundaries between successive fields and
records in logs file


Filtering:

This includes elimination of irrelevant entries such as The
uncacheable

requests and Entries with unsuccessful HTTP status
codes.


Finalizing:

This involves removing unnecessary fields. Moreover,
each unique URL is converted to a unique integer identifier for reducing
time of simulation.


The final format of our data consists of URL ID, timestamp,
elapsed time, size and type of web object


URL_ID

Timestamp

Elapsed Time
(milliseconds)

Size(bytes)

Type

1

1282348905.73

33

33070

application/octet
-
stream

2

1282348907.41

703

14179

image/jpeg

3

1282348908.47

284

1276

image/jpeg

4

1282349578.75

154

24612

text/html

1

1282349661.61

31

33070

application/octet
-
stream

5

1282349675.35

203

5592

text/html

6

1282349688.90

231

34796

text/html

4

1282349753.72

375

24612

text/html

4

1282350464.01

133

24612

text/html

1

1282351887.76

135

33070

application/octet
-
stream

4

1282352609.09

55

24612

text/html

1

1282352861.56

111

33070

application/octet
-
stream

2
-
Data Pre
-
processing


3
-
Training Phase



The training pattern takes the format:




Input

Meaning

1
x

Recency of web object based o
n sliding window

2
x

Frequency of web object

3
x

Frequency of Web object based sliding window

4
x

Retrieval time of web object

5
x

Size of web object

6
x

Type of web object


Inputs

Target

Recency

Frequenc
y

SWL
Frequenc
y

Retriev
al Time

Size

Type

1800

1

1

33

33070

5

1

1800

1

1

703

14179

2

0

1800

1

1

284

1276

2

0

1800

1

1

154

24612

1

1

1800

2

2

31

33070

5

0

1800

1

1

203

5592

1

0

1800

1

1

231

34796

1

0

1800

2

2

375

24612

1

1

1800

3

3

133

24612

1

0

2226.15

3

1

135

33070

5

1

2145.08

4

1

55

24612

1

0

1800

4

2

111

33070

5

0

Preparation of Dataset for web objects
classification

3
-
Training Phase


3
-
Training

Phase



Each proxy dataset is then divided randomly into training data (70%)
and testing data (30%).


Subsequently, the dataset is discretized accordingly using MDL method
suggested by
Fayyad & Irani (1993)

with default setup in WEKA.



Finally, the Bayesian network (BN) is trained using WEKA
as well. In WEKA, BN algorithm is available in the Java
class “weka.classifiers.bayes.BayesNet”. The default values
of parameters and settings predefined in WEKA are used in
BN training.



4
-
Performance Evaluation



We have modified the WebTraff simulator
(Markatchev and
Williamson,2002)
to meet our proposed proxy caching
approaches.




The trained classifiers are integrated with WebTraff
simulator to simulate the proposed intelligent web proxy
caching approaches.




There are common measures to analyze the efficiency



Hit Ratio (HR)





Byte Hit Ratio (BHR)



requested

objects

of
number

Total

cache

the
from

acquired

objects

of
Number



Ratio
Hit

requested

bytes

of
number

Total


cache

from

acquired

bytes

of
Number



Ratio
Hit

Byte

4
-
Performance Evaluation

Analysis of IRcache traces

4
-
Performance Evaluation

BO2

NY

UC

SV

SD

#Total requests

1210693

3248452

8891764

2496001

29871204

#Cacheable requests

594989

1518232

2827904

1194098

6059349

#Cacheable bytes

23204930341

68402036319

469362584083

48043794224

230326816876

#Unique requests

530192

1144885

2402406

1012355

5284441

Total size of unique
requests ( bytes)

18690093450

56147903761

156538171752

38364029432

190539902251

#Hits

64797

373347

425498

181743

774908

#Byte Hits

4514836891

12254132558

312824412331

9679764792

39786914625


Max HR(%)

10.89

24.59

15.05

15.22

12.79

Max BHR(%)

19.46

17.91

66.65

20.15

17.27

4
-
Performance Evaluation

Impact of
cache size on
HR for
different proxy
datasets


(a) BO2 HR



(b) NY HR



BN
-
GDS achieves the best HR among all algorithms,
while LRU achieves the worst HR among all
algorithms .


BN
-
GDS and BN
-
LRU improve the performance in
terms of HR for GDS and LRU respectively



Although HR of BN
-
DA is worse than HR of GDS and
GDSF, HR of BN
-
DA is better than HR of NNPCR
-
2,
BN
-
LRU and LRU.

4
-
Performance Evaluation

In terms of Hit Ratio(HR)

Impact of
cache size on
BHR for
different proxy
datasets

4
-
Performance Evaluation


(a) BO2 HR



(b) NY HR



BN
-
LRU and BN
-
DA achieve the best BHR among
all algorithms, while GDS and GDSF attain the worst
BHR.


BHR of LRU is better than BHR of BN
-
GDS, GDS
and GDSF.


BN
-
GDS improve significantly BHR of GDS and
GDSF


BN
-
LRU and BN
-
DA have better BHR compared with
BHR of LRU and NNPCR
-
2 .



4
-
Performance Evaluation

In terms of Byte Hit Ratio(BHR)

Conclusion

This study has proposed three Intelligent Web proxy
caching approaches called BN
-
GDS, BN
-
LRU and BN
-
DA
for improving performance of the conventional Web proxy
caching algorithms.


BN classifier learns from Web proxy logs file to predict the classes of
objects to be re
-
visited or not.


The trained classifier is integrated effectively with conventional web
proxy caching to provide more effective proxy caching policies.



The simulation

results have revealed that
BN
-
GDS achieved
the best HR, better BHR compared to GDS and GDSF, and acceptable BHR
compared to BN
-
LRU and BN
-
DA that achieved the best BHR. That means
BN
-
GDS was able to make better balance between HR and BHR than other
algorithms. On the other hand, BN
-
LRU and BN
-
DA achieved the best BHR
among all algorithms, and better HR compared LRU and NNPCR
-
2
.

Future works



In the future:



Other intelligent classifiers can be utilized to improve the
performance of traditional web caching policies.



Clustering algorithms can be used for enhancing
performance of web caching policies
.


References

Kaya, C.C., Zhang, G., Tan, Y., & Mookerjee, V.S. 2009. An admission
-
control technique for
delay reduction in proxy caching.
Decision Support Systems, 46
, 594
-
603.

Kumar, C. 2009. Performance evaluation for implementations of a network of proxy caches.
Decision Support Systems, 46
, 492
-
500.

Kumar, C., & Norris, J.B. 2008. A new approach for a proxy
-
level web caching mechanism.
Decision Support Systems, 46
, 52
-
60.

Romano, S., & ElAarag, H. 2011. A neural network proxy cache replacement strategy and its
implementation in the Squid proxy server. Neural Computing & Applications, 20, 59
-
78.

Cobb, J., & ElAarag, H. 2008. Web proxy cache replacement scheme based on back
-
propagation neural network.
Journal of Systems and Software, 81
, 1539
-
1558.

Koskela, T., Heikkonen, J., & Kaski, K. 2003. Web cache optimization with nonlinear model
using object features.
Computer Networks, 43
, 805
-
817.

Chen, H.T. 2008.
Pre
-
fetching and Re
-
fetching in Web caching systems: Algorithms and
Simulation.

TRENT UNIVESITY,Peterborough, Ontario, Canada, Peterborough, Ontario,
Canada.

Cao, P., & Irani, S. 1997. Cost
-
Aware WWW Proxy Caching Algorithms. IN PROCEEDINGS OF
THE 1997 USENIX SYMPOSIUM ON INTERNET TECHNOLOGY AND SYSTEMS.
Publishing, Monterey, CA.

Cherkasova, L. 1998. Improving WWW Proxies Performance with Greedy
-
Dual
-
Size
-
Frequency
Caching Policy.
In HP Technical Report, Palo Alto
.

References

NLANR. 2010. National Lab of Applied Network Research(NLANR).
Sanitized access logs:
Available at
http://www.ircache.net/
.

Fayyad, U.M., & Irani, K.B. 1993. Multi
-
interval discretization of continuous
-
valued attributes for
classification learning, 13th International Joint Conference on Artificial Intelligence (IJCAI
-
93). Publishing, pp. 1022
-
1027.

Markatchev, N., & Williamson, C., 2002. WebTraff: A GUI for Web Proxy Cache Workload
Modeling and Analysis. Proceedings of the 10th IEEE International Symposium on
Modeling, Analysis, and Simulation of Computer and Telecommunications Systems.
Publishing, p. 356.