D a t a M i n i n g o f U s e r N a v i g a t i o n P a t t e r n s

sentencehuddleData Management

Nov 20, 2013 (3 years and 6 months ago)


tec hniques whic h study the user b eha viour when na vigating within a w eb site
Understanding the visitors na vigation preferences is an essen tial step in impro v
ing the qualit y of electronic commerce services In fact the understanding of the
most lik ely access patterns of users allo ws the service pro vider to customise and
adapt the site s in terface for the individual user 
 and to impro v e the site s
static structure within the underlying h yp ertext system  

When w eb users in teract with a site data recording their b eha viour is stored
in w eb serv er logs whic h in a medium sized site can amoun t to sev eral megab ytes
p er da y  Moreo v er since the log data is collected in a ra w format it is an ideal
target for b eing analysed b y automated to ols Curren tly sev eral commercial log
analysis to ols are a v ailable 
 ho w ev er these to ols ha v e limited analysis capa
bilities pro ducing only results suc h as summary statistics and frequency coun ts
of page visits In the mean time the researc h comm unit y has b een studying data
mining tec hniques to tak e full adv an tage of information a v ailable in the log les
There ha v e so far b een t w o main approac hes to mining for user na vigation pat
terns from log data In the rst approac h log data is mapp ed on to relational
tables and an adapted v ersion of standard data mining tec hniques suc h as min
ing asso ciation rules are in v ok ed see for example 
 In the second approac h
tec hniques are dev elop ed whic h can b e in v ok ed directly on the log data see for

In this pap er w e prop ose a new mo del for handling the problem of mining
log data whic h directly captures the seman tics of the user na vigation sessions
W e mo del the user na vigation records inferred from log data as a h yp ertext
probabilistic grammar whose higher probabilit y generated strings corresp ond to
the user s preferred trails Therefore a compact and self con tained mo del of
user in teraction with the w eb is pro vided There are t w o con texts in whic h suc h
mo del is p oten tially useful On the one hand it can help the service pro vider
to understand the users needs and as a result impro v e the qualit y of its service
The qualit y of service can b e impro v ed b y pro viding adaptiv e pages suited to the
individual user b y building dynamic pages in adv ance to reduce w aiting time
or b y pro viding a sp eculativ e service whic h sends in addition to the requested
do cumen t a n um b er of other do cumen ts that are exp ected to b e requested in
the near future On the other hand suc h a mo del can b e useful to the individual
w eb user b y acting as a p ersonal assistan t in tegrated with hisher w eb bro wser
In fact if the bro wser k eeps a user s log le whic h c haracterises hisher in ter
actions with the w eb a h yp ertext probabilistic grammar can b e incremen tally
built Suc h a grammar w ould b e a represen tation of the user s kno wledge of
the w eb whic h can act as a memory aid b e analysed in order to infer the user
preferred trails for a giv en sub ject or w ork as a prediction to ol to prefetc h
in teresting pages in adv ance
Section  presen ts the prop osed h yp ertext grammar mo del while Section 
presen ts the results of the p erformed exp erimen ts Section  discusses related
w ork and Section presen ts a preliminary discussion of recen t impro v emen ts
to the mo del Finally  in Section w e giv e our concluding remarks and discuss
further w ork
 Hyp ertext Probabilistic Grammars
A log le can b e seen as a p eruser ordered set of w eb page requests from whic h
it is p ossible to infer the user na vigation sessions In this w ork w e simply dene a
user navigation session as a sequence of page requests suc h that no t w o consec
utiv e requests are separated b y more than X min utes where X is a parameter
the authors prop osed for X the v alue of  min utes whic h corresp onds to

standard deviations of the a v erage time b et w een user in terface ev en ts Since
then man y authors ha v e adopted the v alue of  min utes W e note ho w ev er
that more adv anced data preparation tec hniques suc h as those describ ed in 

could b e used in a data prepro cessing stage to fully tak e adv an tage of all the
information a v ailable in the log les
The user na vigation sessions inferred from the log data are mo delled as a
hyp ertext pr ob abilistic language generated b y a hyp ertext pr ob abilistic gr ammar
or simply HPG  
whic h is a prop er sub class of probabilistic regular gram
 A HPG is a probabilistic regular grammar whic h has a onetoone
mapping b et w een the set of nonterminal sym b ols and the set of terminal sym
b ols Eac h nonterminal sym b ol corresp onds to a w eb page and a pro duction rule
corresp onds to a link b et w een pages Moreo v er there are t w o additional states
S and F  whic h represen t the start and nish states of the na vigation sessions
F rom the set of user sessions w e obtain the n um b er of times a page w as
requested the n um b er of times it w as the rst state in a session and the n um b er
of times it w as the last state in a session The n um b er of times a sequence of t w o
pages app ears in the sessions giv es the n um b er of times the corresp onding link
w as tra v ersed The probabilit y of a pro duction from a state that corresp onds to
a w eb page is prop ortional to the n um b er of times the corresp onding link w as
c hosen relativ e to the n um b er times the user visited that page
The probabilit y of a pro duction from the start state is prop ortional to the
n um b er of times the corresp onding state w as visited implying that the desti
nation no de of a pro duction with higher probabilit y corresp onds to a state that
w as visited more often W e dene  as a parameter that attac hes the desired
w eigh t to a state b eing the rst in a user na vigation session If    only states
whic h w ere the rst in a session ha v e probabilit y greater than zero of b eing in
a pro duction from the start state on the other hand if   all state visits are
giv en prop ortionate w eigh t Note that when    ev ery grammar state has an
initial probabilit y greater than zero The probabilities of the pro ductions from
the start state corresp ond to the v ector of initial probabilities   and the proba
bilities of the other pro ductions corresp ond to the transition matrix of a Mark o v
c hain  

In the example sho wn in Figure w e ha v e user sessions with a total of 
page requests wherein state A

w as visited  times  of whic h are the rst state
in a user session therefore since    w e ha v e   A



Figure  sho ws the grammar inferred from the giv en set of trails for N  and
    W e freely utilise in our gures the dualit y b et w een gramma rs and

the en trop y is high then there should b e a large n um b er of strings with similar
and lo w probabilit y 
Assuming a transition with probabilit y one from state F to S a HPG cor
resp onds to an irreducible and ap erio dic Mark o v c hain with a stationary dis
tribution v ector  and transition matrix A  Th us w e can estimate the en trop y
with the follo wing expression H  H      

l og A
 where H    is
included to tak e in to accoun t the randomness of the c hoice of the initial page
for detail on the en trop y of a Mark o v c hain Note that w e use the v ec
tor of initial probabilities  as an estimator of the stationary v ector   since it
is prop ortional to the n um b er of times eac h state w as visited The v alue of H
can b e normalised to b e in the range b et w een  and b y considering its ratio
with the corresp onding random grammar The random grammar consists in a
gramm ar with the same structure but in whic h all states ha v e their outlinks
probabilities according to a uniform distribution
Suc h a measure whic h is an estimator of the statistical distribution of the
gramm ar string probabilities can b e useful in helping the user in the sp ecication
of the supp ort and condence thresholds
 Exp erim en tal Ev aluation
T o assess the p erformance and the eectiv eness of the prop osed mo del exp eri
men ts w ere conducted with b oth random and real data T ests with random data
pro vide the means of ev aluating man y dieren t top ologies and congurations of
a HPG and tests with real data allo w us to v erify whether or not the mo del is
p oten tially useful in practice
The metho d used to create the random data consisted of four consecutiv e
steps i giv en the required n um b er of states and the a v erage branc hing factor
randomly create a directed graph ii for eac h gramma r state assign outgoing
links w eigh ts according with the c hosen probabilit y distribution iii v erify if
the resulting gramma r has the required prop erties that is if ev ery state has a
path to F and if not add a link to F  iv normalise the grammar s w eigh ts that
is calculate the pro duction probabilities
F or the exp erimen ts with real data w e used log les obtained from the authors
 The log les con tain t w o mon ths of usage from the site http wwwhy
perrealorgmus icma chine s  It should b e noted that the data w as collected
while cac hing w as disabled and that w e used the data without cleaning it W e
divided eac h mon th in to four subsets eac h corresp onding to a w eek and for eac h
subset w e built the corresp onding HPG for sev eral v alues of the history depth
One of the w eeks w as discarded b ecause it presen ted c haracteristics signican tly
dieren t from the others namely in the n um b er of states inferred This last fact
indicates that the data should b e cleaned ho w ev er since data cleaning w as not
the aim of the w ork together with the b elief that the probabilistic nature of the
mo del could partially o v ercome the dirtiness of the data w e decided to use the
data without cleaning
The use of data mining tec hniques to analyse log data w as rst prop osed b y

the log data is con v erted in to a set of maxim al forw ard refer
ences a form whic h is amenable to b eing pro cessed b y existing asso ciation rules
tec hniques Tw o algorithms are giv en to mine the rules whic h in this con text
consist of large itemsets with the additional restriction that the references m ust
b e consecutiv e in a transaction In 
is prop osed a metho d to classify w eb site
visitors according to their access patterns Eac h user session is stored in a v ector
that con tains the n um b er of visits to eac h page and an algorithm is giv en to nd
clusters of similar v ectors The metho d do es not tak e in to accoun t the order in
whic h the page visits to ok place
the authors c hallenged the AI comm unit y to use the the log data to
create adaptiv e w eb sites and in  
they presen t a tec hnique whic h automati
cally creates index pages from the log data ie pages con taining collections of
links whic h the user na vigation b eha viour suggests are related In our previous
w ork 
 w e prop osed to mo del the log data as a directed graph with the arcs
w eigh ts in terpreted as probabilities that reect the user in teraction with the site
and w e generalised the asso ciation rule concept The authors of 
prop ose to
use log data to predict the next URL to b e requested so the serv er can generate
in adv ance w eb pages with dynamic con ten t A tree whic h con tains the user
paths is generated from the log data and an algorithm is prop osed to predict the
next request giv en the tree and the curren t user session In 
the authors pro
p ose a log data mining system comp osed of an aggregation mo dule and a data
mining mo dule for the disco v ery of patterns with predened c haracteristics The
aggregation mo dule infers a tree structure from the data in whic h the mining is
p erformed b y a h uman exp ert using a mining query language
prop ose the in tegration of data w arehousing and data mining
tec hniques to analyse w eb records and 
study cleaning and preparation tec h
niques whic h con v ert log data in to user na vigation sessions in a form amenable
to pro cessing b y the existing data mining tec hniques
 Heuristics for Mining High Qualit y P atterns
W e are curren tly w orking on the sp ecication of new algorithms to impro v e the
qualit y of the results relativ e to the deterministic DFS used in the exp erimen ts
rep orted herein In fact as w as sho wn in section   the exhaustiv e computation
of all grammar strings with probabilit y ab o v e the cutp oin t has the dra wbac k
of p oten tially returning a v ery large n um b er of rules for small v alues of the cut
p oin t Note that if the user w an ts to nd longer rules the threshold needs to b e
set lo w This fact led us to the study of heuristics whic h allo w us to compute
a subset of the rule set while b eing able to con trol b oth its size and qualit y  In
the follo wing sections w e briey describ e the ideas b ehind the heuristics w e are
dev eloping a complete description can b e nd in