The semantics of frequent subgraphs:
Mining and navigation pattern analysis
Bettina Berendt
Institute of Information Systems,Humboldt University Berlin
Spandauer Str.1
D10178 Berlin,Germany
http://www.wiwi.huberlin.de/berendt
ABSTRACT
The search for frequent subgraphs is a useful extension of
common approaches in Web mining.For example,it allows
the study of revisitation patterns in Web usage and the dis
covery of richer navigation structures such as\landmarks"
or\hubs"that serve to organize a user's conceptual map of a
site or a part of the Web.Any use of graph structures in Web
usage mining,however,should also take into account that it
is essential to integrate background knowledge into the anal
ysis,and that behaviour must be studied at dierent levels
of abstraction.To capture these needs,we propose to use
taxonomies in mining and to extend the standard notions of
interestingness frequency/support by the notion of context
induced interestingness.The APIP mining problem then
consists of nding all frequent abstract patterns and the
individual patterns that constitute them and are therefore
interesting in this context (even though they may be infre
quent).The paper presents the APIP algorithm that uses
a taxonomy to search for the abstract and individual pat
terns,We also show that the search for labelabstracted but
isomorphic subgraphs does not always give an accurate im
age of navigation strategies,and we develop a procedure for
mining at the concept level to solve this problem.A case
study of a reallife Web site shows the advantages of the
proposed solutions.
Categories and Subject Descriptors
H.2.8 [Database Management]:Database Applications
data mining;H.5.4 [Information Interfaces and Presen
tation]:Hypertext/Hypermedianavigation,user issues
Keywords
Web mining,graph mining,integrating Web content and
semantics into Web mining
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for prot or commercial advantage and that copies
bear this notice and the full citation on the rst page.To copy otherwise,to
republish,to post on servers or to redistribute to lists,requires prior specic
permission and/or a fee.
WebKDD'05,August 21,Chicago,Illinois,USA.
Copyright 2005 ACM,ISBN:1595932143,$5.00.
1.INTRODUCTION
Background knowledge is an invaluable help in the min
ing of Web data.Examples include the use of ontologies for
text mining and the exploitation of anchor texts for crawl
ing,indexing,and ranking.In Web usage mining,back
ground knowledge is needed because the raw data consist
of administrationoriented URLs,whereas site owners and
analysts are interested in events in the application domain.
In the past years,various algorithms for including back
ground knowledge in mining have been proposed.In par
ticular,the use of taxonomies in association rule,sequence,
and graph mining has been investigated.In Web usage min
ing,taxonomies describe the application events by concepts
that abstract from specic URLs and that form one or more
concept hierarchies.
In the raw data,each individual set,sequence,or graph of
items is generally very infrequent.When relying on statisti
cal measures of interestingness like support and condence,
the analyst either nds no patterns or a huge and unmanage
able number.Taxonomies solve this problem and generally
produce a much more manageable number of patterns with
high values of the statistical measures.In addition,they al
low the analyst to add semantic measures of interestingness
into the mining process.
A number of powerful and ecient algorithms and tools
exist that can mine patterns determined by a wide range
of measures.However,they ignore an important source of
interest:taxonomic context.What is behind a frequent ab
stract patternwhat types of individual items/behaviour
constitute it,and how are they distributed (e.g.,equally
distributed or obeying Zipf's law)?To make patterns with
interestingness induced by the context of their abstraction
visible,a mining tool must support a\drilldown"into the
frequent patterns it has found,and it should provide simul
taneous detailandcontext views on patterns.
This notion of detail and context has a number of ap
plications that include a deepened understanding of data
and a highly sensitive monitoring of temporal changes in a
volatile Web.By allowing the analyst to focus on seman
tically interesting areas (abstract patterns),it can support
the identication of patterns like an abstractlevel stability
with a simultaneous individuallevel drifteven if this drift
remains below the chosen statistical thresholds,or if this
drift consists of a change in the distribution of the individ
ual patterns that constitute the abstract pattern.An exam
ple are variations induced by seasonal in uences or changes
in a site's catalog.On the other hand,there may also be
variations on the abstract level,such as changing search be
haviour due to the increasing Internet sophistication of a
site's users.
Such detailed analysis of patterns becomes particularly
interesting when patterns reveal a lot of structure,as in se
quence or in graph mining.Viewing Web navigation as a
graph is interesting because it gives insight into the\men
tal map"of a site created by its users:What pages serve
as orientation points to organize the site,which areas are
traversed in linear fashion?Do users cycle between content,
get lost in deadend streets,or do they return to previously
visited content in order to start new and betterinformed
searches from there?Last but not least,important dier
ences between user groups (e.g.,novices and experts) can
be found by looking at their navigation graphs.User move
ments can also be seen as votes for content relevance,thus
helping mining and ranking based on the hyperlink graph
structure (such as recent PageRank extensions).
The algorithm described in the present paper provides a
solution to these problems.It combines the need to mine
at abstract levels to nd statistically frequent patterns and
the interest in what is behind those patterns,by giving the
possibility of\drilling down"into the abstract patterns in
order to get information about their most typical individ
ual instances.To do so,it introduces a new denition of
abstract patterns,individual patterns,and the relation be
tween them.The algorithm is implemented in the APIP
tool that visualizes the results as context and detail.In ad
dition,we argue that a conceptual view of navigation can
lead to patterns that have a dierent graph structure and are
more informative,but that cannot be found by traditional
mining in item space.We therefore extend the scheme to
mining in concept space.
The paper is organized as follows:Section 2 gives a short
overview of related work.Section 3 describes the algorithm.
Section 4 describes selected results froma case study of nav
igation data from an information site,and it motivates the
mining in concept space that is described in more detail in
Section 5 and applied to the case study in Section 6 Sec
tion 7 concludes the paper and gives an outlook on future
research.
2.RELATED WORK
The problem of frequent subgraph mining has been stud
ied intensively in the past years,with many applications in
bioinformatics/chemistry,computer vision and image and
object retrieval,and Web mining.Algorithms are based on
the a priori principle stating that a graph pattern can only
be frequent in a set of transactions if all its subgraphs are
also frequent.Frequent subgraphs are generated during a
breadthrst search following the Apriori algorithm [1] or in
a depthrst fashion following Eclat [30].The main prob
lem of frequent subgraph mining is that two essential steps
involve expensive isomorphism tests:First,during candi
date generation the same graph may be generated multiple
times and has to be compared to already generated candi
dates;second,the candidates have to be embedded in the
transactions to compute support.
Several algorithms have been developed to discover ar
bitrary connected subgraphs.One research direction has
been to explore ways of computing canonical labels for can
didate graphs to speed up duplicatecandidate detection [17,
26].Various schemes have been proposed to reduce the num
ber of generated duplicates [26,10].To deal with the embed
ding problem (NPcomplete in the general form),[7] have
proposed to reduce the number of embeddings,while [5]
store them to allow a fast parallel generation of new can
didates.
Other algorithms constrain the form of the sought sub
graphs:[12,13] search for induced subgraphs (a pattern
can only be embedded in a transaction if its nodes are con
nected by the same topology in the transaction as in the
pattern).[22] exploit the observation that most patterns are
rather simple,searching rst for sequences,then for trees,
and then for cyclic graphs.
Going beyond exact matching,[9] preprocess their data
to nd rings which they treat as special features and intro
duce wildcards for node labels,and [19] nd\fuzzy chains",
sequences with wildcards.
The qualitative and quantitative analysis of single users'
navigation graphs has a long tradition (cf.the transfer of
Web graph metrics to navigation graphs [18]),but the anal
ysis of (general) graph patterns via mining many users'navi
gation graphs has received less attention.A number of stud
ies have employed tree mining (e.g.,[29]) and extensions of
sequence mining [2].The latter allows one to detect patterns
of repeated visits to pages/concepts and cycling behaviour.
The former allows one to detect\hubs"of navigation in the
sense of [25]:centers of breadthrst searches.
Taxonomies have been used to nd patterns at dierent
levels of abstraction for dierentlystructured data.In [23,
24],frequent itemsets/association rules and frequent se
quences are identied at the lowest possible levels of a tax
onomy.Concepts are chosen dynamically,which may lead
to patterns linking concepts at dierent levels of abstrac
tion (e.g.,\People who bought'Harry Potter 2'also bought
books by Tolkien").In [11],this idea was transferred to
frequent induced subgraph mining.
In Web usage mining,the use of concepts abstracting from
URLs is common (see [3] for an overview and the papers in
[21] for recent examples of the uses for personalization).The
general idea is to map a URL to the most signicant contents
and/or services that are delivered by this page,expressed as
concepts.One page may be mapped to one or to a set of
concepts.For example,it may be mapped to the set of
all concepts and relations that appear in its query string.
Alternatively,keywords from the page's text and from the
pages linked with it may be mapped to a domain ontology,
with a generalpurpose ontology like WordNet serving as an
intermediary between the keywords found in the text and
the concepts of the ontology [8].
After the preprocessing steps in which access data are
mapped into taxonomies,subsequent mining techniques can
use these taxonomies statically or dynamically.In static
approaches,mining operates on concepts at a chosen level of
abstraction;each request is mapped to exactly one concept
or exactly one set of concepts.This approach is usually
combined with interactive control of the software,so that
the analyst can readjust the chosen level of abstraction after
viewing the results (e.g.,in the miner WUM;see [4] for a case
study).Alternatively,the taxonomy is used dynamically as
in [23,24,11].
The subsumption hierarchy of an existing ontology can
be employed for the simultaneous description of user inter
ests at dierent levels of abstraction as in [20].In [6],a
scheme is presented for aggregating towards more general
Algorithm APIP(D;T;m;;K) (APIP frequent subgraph mining)1:CIP
1
;,CAP
1
;
2:for each edge e in D do
3:cip
1
feg
4:CIP
1
CIP
1
[ fcip
1
g
5:apipgen(cip
1
)
6:FAP
1
fcap
1
2 CAP
1
j j
S
cip
1
2cap
1
:IPs
cip
1
:TIDj g
7:FIP
1
fcip
1
2 CIP
1
j 9fap
1
2 FAP
1
:cip
1
2 fap
1
:IPsg
8:k 2
9:while FAP
k1
6=;and k K do
10:CIP
k
;,CAP
k
;
11:for each fip
k1
2 FIP
k1
do
12:for each fip
1
2 FIP
1
that can be grown from fip
k1
as described in [26] do
13:cip
k
fip
k1
[ fip
1
14:if cip
k
has support > 0 & it is not automorphic to any element of CIP
k
then
15:CIP
k
CIP
k
[ fcip
k
g
16:apipgen(cip
k
)
17:FAP
k
fcap
k
2 CAP
k
j j
S
cip
k
2cap
k
:IPs
cip
k
:TIDj g
18;FIP
k
fcip
k
2 CIP
k
j 9fap
k
2 FAP
k
:cip
k
2 fap
k
:IPsg
19:k k +1Procedure apipgen(cip
k
) (for simplicity of notation,CIP
k
;CAP
k
,T,and m are assumed global)1:cap
k
abstraction(cip
k
;T;m)
2:for each g 2 CAP
k
s.t.g and cap
k
have the same number of nodes and edges and the same node labels and degrees do
3:if cap
k
is a known automorphism of g then
4:g:IPs g:IPs [ fcip
k
g
5:return
6:for each g 2 CAP
k
s.t.g and cap
k
have the same number of nodes and edges and the same node labels and degrees do
7:if cap
k
:cl = g:cl then
8:g:IPs g:IPs [ fcip
k
g
9:return
10:cap
k
:IPs fcip
k
g
11:CAP
k
CAP
k
[ fcap
k
gconcepts when an explicit taxonomy is missing.Clustering
of sets of sessions identies related concepts at dierent lev
els of abstraction.Sophisticated forms of content extraction
are made possible by latent semantic models [14].In [8],
association rules,taxonomies,and document clustering are
combined to generate conceptbased recommendation sets.
Some approaches use multiple taxonomies related to OLAP:
objects (in this case,requests or requested URLs) are de
scribed along a number of dimensions,and concept hier
archies or lattices are formulated along each dimension to
allow more abstract views [28].
3.APIPFREQUENTSUBGRAPHMINING
Based on the requirements described in the introduction,
we formulate the APIPmining problem:Given a dataset
of graph transactions D consisting of items with labels from
a set I,a taxonomy T consisting of concepts from a set C,
a mapping m:I 7!C,and a minimum support 2 [0;1],
nd all frequent abstract subgraphs and their corresponding
APfrequent individual subgraphs.
Denition 1.An abstract subgraph is a connected graph
G = (V;E) with node labels given by l
a
G
:V 7!C.A
frequent abstract subgraph is one that can be embedded
in at least jDj transactions.
An APfrequent individual subgraph of G is dened
as follows:(a) It is a graph G
0
= (V
0
;E
0
) with node labels
given by l
i
G
0
:V
0
7!I.(b) There exists a frequent abstract
subgraph G such that the graph G
00
= (V
0
;E
0
) with node
labels given by m:I 7!C with m l
i
G
0 (v
0
) = m(l
i
G
0 (v
0
)),is
an automorphism of G.G
00
is also called the abstraction
of G
0
with respect to T and m.(c) G
0
is automorphic to at
least one subgraph of at least one transaction.
Gcan be embedded in a transaction d 2 Dif it has an AP
frequent individual subgraph G
0
such that G
0
is automorphic
to at least one subgraph of d.
Frequent subgraphs are also called\patterns",or APs (ab
stract patterns) and IPs (individual patterns).
An example shows the role of abstract and individual pat
terns:If 10% of the transactions contain the chain (Harry
Potter 1,Lord of the Rings),5% contain (Harry Potter 2,
The Hobbit),and 7% contain (Harry Potter 1,Lord of the
Rings,The Hobbit,Harry Potter 2) (all sets disjoint),and
the minimal support is 20%,then (Rowlings book,Tolkien
book) has support 22%and is a frequent abstract pattern.Its
APfrequent individual patterns are (Harry Potter 1,Lord of
the Rings) (17%) and (Harry Potter 2,The Hobbit) (12%).
The algorithm to solve this problem,APIP,is shown in
the table above.
The following notation is used:a k(sub)graph g
k
is a (sub)graph
with k edges.CIP
k
(CAP
k
) is a set of IP (AP) candidates with
k edges.FAP
k
(FIP
k
) is the set of frequent abstract subgraphs
(APfrequent individual subgraphs) with k edges.cip
k
;cap
k
;fip
k
;
fap
k
are elements of these sets.K is the maximal number of
edges in patterns sought.For simplicity of notation,we assume
that K 2.g:cl is the canonical label of g.g:IPs are the AP
frequent individual subgraphs of g.g:TID is the transactionID
list of g,i.e.the ordered list of transactions that contain g.
APIP uses and extends various ideas to achieve a fast pro
cessing of graphs:
Search space traversal.The search for patterns pro
ceeds breadthrst (see APIP,step 9./19.).This simplies
the search for possible duplicate APs (see below).Pattern
search terminates when no more patterns exceed the sup
port threshold,or when the maximal sought pattern size is
exceeded.
IP candidate generation In each round,individualsub
graph candidates are generated:all subgraphs that occur at
least once in the data.For k = 1,these are all single edges in
the dataset (APIP,3.,4.).For larger k,a pattern candidate
is any set of k edges that can be grown from a frequent
pattern of size k 1 by extending it by an extra edge (AP
IP,12.,13.,15.),eliminating those candidates that have no
support in the data and those that have been discovered
before (APIP,14.).
To reduce the number of generated candidates,frequent
patterns are extended by one edge at a time (\growing"),
and growing occurs only from the rightmost path [26].For
a proof of completeness,see [27].
Individuallevel bijective label functions.The IP can
didates are treated as sets of edges because it is assumed
that the graph of each transaction in D has a bijective label
functions l
i
.
This assumption is motivated by the domain:Graph anal
ysis only becomes meaningful if each node is treated as
a unique\location in hyperspace"which can be revisited.
Thus,in any transaction/session each potential URL was
either requested or not,so each transaction's nodes (edges)
are a subset of I (I I).
This assumption implies that APIP can be transferred to other
domains in which this uniquenames assumption also holds.So
for example,APIP could be used for texts dealing with proper
names,but not for most molecularmining tasks.APIP could
easily be extended to also discover patterns in transactions with
a nonbijective label function,but it would require the same au
tomorphism tests for duplicate detection as those carried out for
APs,and it would change the tests for embeddings in the data.
The eects on runtime remain to be investigated.
IP candidate duplicate removal Given the bijective la
bel function,an IP's lexicographically sorted edges labeled
by their node pair,dene a canonical form that can be com
puted eciently (list sorting in APIP,13.) and allows for
ecient duplicate detection (comparison of two ordered lists,
APIP,14.).
IP candidate support counting and pruning (part I)
Bijective label functions also allow for ecient identication
of embeddings:a graph G
0
is a subgraph of d i the canon
ical form of G
0
is a subset of the canonical form of d.Each
edge is initialized with a sorted list of all transaction IDs
that contain this edge (not shown in the pseudocode),see
for example [17].Each IP candidate's TID list is then the
intersection of all of its edges'TID lists.This can be com
puted in one step by intersecting two sorted lists:the TID
lists of the rst fip
k1
and fip
1
that are found to form this
candidate (APIP,13.).
Support pruning at the IP level lters out all IPs that
would in the best case not contribute support to their AP
abstractions and in the worst case lead to the generation of
redundant,zerosupport APs.
One structural and one semantic type of pattern were
deemed uninteresting for the domain and are ltered out to
increase eciency:selfloops and edges involving items that
are not mapped to a concept by the taxonomy (in APIP,
3.;not shown in the pseudocode).
APcandidate generation and duplicate removal Next,
the AP candidate corresponding to the identied individual
pattern candidate is formed.First,the individual node la
bels are mapped to their abstractions according to the tax
onomy T and the concepts from T that are chosen by m
(apipgen,1.).
Usually,there is more than one IP that maps to an AP.
To simplify duplicate checking,each IP is mapped to and
each AP is stored as a sparse adjacencylist representation.
To limit the number of tests,a newly considered AP candi
date is compared only to known candidates to which it can
be automorphic.This step relies on vertex invariants,i.e.
standard procedures fromisomorphismtesting (apipgen,2.,
6.).First,an equality test may show that the new candidate
has already been found in the same order of node labels and
edges,i.e.that it is a known automorphism(apipgen,3.;see
also [17]),Alternatively,the new candidate may be an asyet
unknown automorphism.To verify this,its canonical label is
computed (apipgen,7.).Isomorphism/automorphism test
ing by canonicallabel comparison is generally,but of course
not always,fast;see the literature in Section 2 The canoni
cal form is determined by the maximum label sequence and
the maximum upper triangular of the adjacency matrix as
in [13].In either case,the IP is added to the set of IPs
contributing to the AP candidate (apipgen,4./5.,8./9.).
The graph cap
k
is only registered as a new candidate if
both tests failed (apipgen,10.,11.).
This procedure ensures that all abstract pattern candi
dates are found:(a) all individual pattern candidates with
at least one embedding in the data are found (see above),
(b) each individual pattern has a unique abstraction,and
(c) all individual patterns that map to the same AP are col
lected into that AP's IPs set by the automorphism tests in
apipgen.
Support counting and pruning.The AP can be embed
ded in a transaction if and only if at least one of its IPs can
be embedded in that transaction,which leads to the com
putation of its support as the cardinality of the union of the
IPs'TID lists (APIP,6./17.).
The candidate APs with sucient support values become
frequent abstract patterns (APIP,6./17.).According to the
denition of APfrequent individual patterns,only those cip
that map to an abstract pattern now identied as frequent
are transferred to the set of frequent individual patterns
(APIP,7./18.).
Support counting and pruning are ecient because they
operate entirely on the TID lists;no access to the data and
no subgraph embedding are required.
This completes one round;the next largest patterns can
now be examined (APIP,8./19.).
Memory management.All (AP)frequent patterns are
output after they have been computed.The APfrequent
individual patterns of size 1 must be kept in memory because
they are needed in each iteration (APIP,12.),and the AP
frequent individual patterns of size k 1 are needed for
growing (APIP,11.).All other patterns are discarded to
save memory.
APIP also implements a frequency threshold ( 1) in
stead of the support threshold mentioned above.The ex
Figure 1:Top (a),(b):Impact of data set size and number of patterns in isolation;pS = 0.8;pL = 0.05;
minsupp =0.05;Bottom (c),(d):Combined impact of data set size and number of patterns;same parameters.
tension is straightforward because the support test operates
on transactionID lists anyway;it is therefore not shown.
In the current version,APIPtreats only undirected graphs
with node labels.The generalization to directed as well as
nodeandedge labeled graphs is straightforward.
In principle,APIPcould also work\backwards",by grow
ing at the AP level and then instantiating to IPs.This would
correspond to taxonomyfree frequent subgraph mining al
gorithms.However,this would require an investigation of at
least the same number of IPs (since all of them need to be
generated as instantiations,and possibly further ones that
also match a grown AP),and it would require either sophis
ticated bookkeeping of the growing history or an isomor
phism check with the AP from which an IP candidate was
generated.Simulation results indicate that this guaranteed
additional eort is not oset by any savings in candidate
generation and isomorphism testing at the AP level.
Time and space requirements.APIP's time and space
worstcase requirements are determined by (a) the size of
the dataset,(b) the number of patterns (abstract and in
dividual),and the (c) diversity and (d) length of patterns.
Other factors have indirect in uences on these magnitudes;
in particular,a lower support threshold or a larger con
cept branching factor generally lead to more,more diverse,
and/or longer patterns.However,more data also often lead
to more patterns,and more patterns are often longer pat
terns.
Described brie y,(a) in uences time and space require
ments linearly because APIP operates only on the sorted
TID lists.(b) has a linear in uence (CIP canonical form
construction:adding an edge to a sorted list;CIP duplicate
checking:lookup in a hashtable indexed by canonical form;
CIP abstraction by mapping each node).The in uence on
space is obvious because patterns are all that is stored (apart
fromTIDlists).The in uence of the number of abstract pat
terns is governed by (c) and (d).(c) can imply additional
time during AP duplicate detectionwhen many\similarly
looking"APs (apipgen,2.) exist,and when IPs are diverse
so that many dierent automorphism occur and require a
canonicalform computation.In principle,the number and
type of candidate patterns are decisive;in line with the lit
erature and the focus on data and patterns,we will however
treat the (AP)frequent patterns as an approximation.(d)
has a linear eect on space requirements and on the IPs'
time requirements,but can require more than linear time
because in AP canonical form construction,longer patterns
generally require the study of more permutations.
To investigate actual performance,simulated and real data
were analyzed.For reasons of space,we focus on perfor
mance with respect to simulated data only in this section
and on applicationoriented results with respect to real data
in the next section.(Main performance results were similar.)
APIP was implemented in Java and run on a Pentium
4 3GHz processor,1GB main memory PC under Debian
Linux.The simulated data were generated by a rstorder
Markov model (a popular model of Web navigation) varying
the number of transactions/sessions,the branching factor b
of the concepts in a onelevel taxonomy (and thus the num
ber of patterns found in a total of 100\URLs"),the diversity
of patterns (each node had a parametrized transition prob
ability pS to one other,randomly chosen node and an equal
distribution of transition probabilities to all other nodes),
and the length of patterns (the transition probability pL to
\exit"the average session length becomes 1=pL,and pat
Figure 2:Impact of pattern length:the steep curve in (b) belongs to the long patterns in (a).
tern length increases with it).
First,the in uence of data set size and number of patterns
were studied in isolation:1000 sessions were duplicated to
generate 2000 sessions with the same number of patterns,
concatenated 3 times to generate 3000 sessions,etc.Figure
1 (a) shows that time is linear in data set size,and that
the slope increases with the number of patterns (induced by
increasing the concept branching factor).Figure 1 (b) plots
the same data (plus one run with an intermediate branch
ing factor) by the number of patterns,keeping the number
of sessions constant along each line.It shows that in this
setting,time is sublinear in the number of patterns.Figure
1 (c) show the combined eects.The thick line illustrates
the eect of adding 1000 more,2000 more,etc.dierent ses
sions,while the thin lines show the eect of only duplicating
the data set (as in (a)).
Figure 1 (d) plots the additional time (= the vertical dif
ference between thinline endpoints and the thick line in (c))
against additional patterns,revealing a linear eect.
Similar graphs are obtained for a wide variety of param
eter settings,and also for real data.However,time re
quirements increase strongly when patterns get bigger:Fig.
2 (b) shows that the impact of additional patterns rises
more strongly for big than for small patterns (as = aver
age size/number of edges),which leads to a slightly above
linear increase in time with respect to data set size.Since
Web navigation data generally exhibit\short"or\medium"
patterns,this problem concerns them only rarely.
The program required substantial memory (up to 500MB
for the large patterns in Fig.2),which is very undesirable
and partly due to Java's incomplete garbage collection.We
plan a reimplementation in C++ to substantially improve
both memory consumption and the absolute level of run
times.
4.CASE STUDY (1)
Data.Navigation data from a large and heavily frequented
eHealth information site were investigated.The site oers
dierent searching and browsing options to access a database
of diagnoses (roughly corresponding to products in an eCom
merce product catalog).Each diagnosis has a number of dif
ferent information modes (textual description,pictures,and
further links),and diagnoses are hyperlinked via a medical
ontology (links go to dierential diagnoses).The interface
is identical for each diagnosis,and it is identical in each
language that the site oers (at the time of data collection:
English,Spanish,Portuguese,and German).
The initial data set consisted of close to 4 million requests
collected in the site's server log between November 2001 and
November 2002.In order to investigate the impact of lan
guage and culture on (sequential) search and navigation be
haviour,the log was partitioned into accesses by country,
and the usual data preprocessing steps were applied [15,16].
For the present study,the log partition containing accesses
from France was chosen as the sample log;the other coun
try logs of the same size exhibited the same patterns,dif
fering only in the support values.The sample log consisted
of 20333 requests in 1397 sessions.The concept hierarchy
and two mappings m developed for [16] were used to ensure
comparability.
In the following,selected results from the analysis are re
ported that outline the advantages of treating sessions and
patterns as graphs rather than as sets,bags,or sequences of
requests.The focus is on patterns with more than 2 nodes,
since the latter are by denition chains and can also be found
using sequence mining.
Basic statistics.The log is an example of concepts corre
sponding to a high number of individual URLs (84 concepts
with,on average,82 individual URLs),and highly diverse
behaviour at the individual level.This is re ected in a high
ratio of the number of IPs to APs (compared to the sim
ulated data above):At a support value of 0.2,there were
7 frequent APs of size 1{5 with 55603 APfrequent IPs,at
0.1,the values were:size 1{8,34 APs,414685 IPs;at 0.05,
the values for K = 6 were:104 APs,297466 IPs.This
search was not extended because no further (semantically)
interesting patterns were found.
Linear information gathering With a minimal support
of 0.2 or higher,the only patterns with more than two nodes
were chains of diagnoses.These patterns were very frequent:
a chain of 6 diagnoses had support 22.2% (5:26.7%,4:
32.7%,3:39.3%).
Lowering the support threshold showed that even longer
chains were still comparatively frequent (13.8% support for
chains of 9 diagnoses) and reveals that alphabetical search
was the most frequent entry point to these chains (alphabet
ical search followed by 1,2,or 3 diagnoses had 19%,15.5%,
and 11.5% support).
Investigation of the most frequent individual patterns in
alphabetical search showed a possible eect of layout:the
toplevel page of diagnoses starting with\A"was the most
frequent entrypoint.While this pattern is in itself not too
Figure 3:Visualization of abstract patterns,subpatterns,superpatterns,and individual patterns.
interesting,it can help to lter out accesses of\people just
browsing"in order to focus further search for patterns on
more directed information search.
Figure 3 shows a screenshot of the APIP visualization.In
a preprocessing step,the patterns that only occurred once
in the data were ltered out,which explains the compara
tively low number of individual patterns associated with an
abstract pattern (in the example:47).It turned out that
individual patterns with frequency = 1 constituted 99% of
all patterns.
A Closer Look at Linear Navigation Patterns The
nding that people just\surfed through"long chains of
diagnoses is surprising and dicult to reconcile with the
knowledge about search behaviour collected in the direct
observations of users who usually targeted a small number
of diseases for information collection.Also,the inspection
of the most frequent individual patterns involving several
diagnoses showed pairs or chains of two pictures of the same
disease.
An inspection of single sessions revealed that often,visi
tors return to the same content in the following sense:They
request one picture,or the picture overview,of a diagnosis,
then study a dierent diagnosis (e.g.,a dierential diagno
sis),and then return to read,in detail,the text of the rst
diagnosis.In other words,although they visit 3 dierent
URLs,forming the abstract pattern\diagnosis { diagnosis
{ diagnosis",they in fact gather information on the concepts
\diagnosis 1 { diagnosis 2 { diagnosis 1".
This nonlinear behaviour cannot be identied in the cur
rent framework.The next section will investigate the prob
lem and propose a solution.
5.MININGIN CONCEPT SPACE
The problem of itemlevel patterns vs.conceptlevel pat
terns can be illustrated using the following example:
Consider two transactions [a
1
a
2
a
3
a
4
] and [a
5
a
6
a
7
a
8
] and let the a
i
be instances of the following concepts:
a
1
;a
4
7!b
1
,a
2
7!b
2
,a
3
7!b
3
,a
5
;a
8
7!b
4
,a
6
7!b
5
,
a
7
7!b
6
.All b
j
are instances of concept c.Let minsupp be
1.Using the framework of [23,24] that is also the basis of
[11],the only graphs that are frequent are the chains c c,
c c c,and c c c c.This is because only graphs
that are isomorphic to subgraphs of the original transactions
can be patterns.(The problem does not occur in sequence
mining,but an analogous transfer of the algorithm in [24]
to the problem of subgraphs suggests that this isomorphism
to subgraphs of items is intended.In [11],the problem is
not dicussed,but the article suggests that only isomorphic
graphs can be concepthierarchy abstractions of parts of the
transactions.)
However,both rstlevel abstractions b
1
b
2
b
3
b
1
and
b
4
b
5
b
6
b
4
form a ring of three distinct subconcepts
of c.Alternatively,they form rings of superconcepts of the
as involved.
In analogy to the use of the term\item"in [23],we call the
standard form of mining with taxonomies\mining in item
space".We will refer to the alternative form as\mining in
concept space".
To be able to use APIP,we translate the problem\nd
ing all frequent conceptspace abstract subgraphs in the set
of transactions D"into the equivalent problem\nding all
frequent itemspace abstract subgraphs in the set of trans
actions D
0
",with D
0
given by
Figure 4:Nonlinear patterns found when mining in concept space.
Denition 2.The concept space representation D
0
of
a set of transactions Dis dened as follows:Let r:I 7!C[I
be a function s.t.for each i,either r(i) is a descendant of
m(i) in the taxonomy T as described in Denition 1 (this
holds for at least one i),r(i) = m(i),or r(i) = i.Let D
0
be a
set of transactions obtained from D by,for each transaction
X,(1) replacing the label function l
i
X
by r l
i
X
and (2)
coarsening the graph X such that all vertices with the same
label are merged and the label function is again bijective.
During mining,the function m in Denition 1 must be
replaced by a function m
0
s.t.m
0
(r(i)) = m(i).Dierent
levels of abstraction for mining in concept space can be de
ned by varying the number of nodes on the paths between
r(i) and m(i).In the following example,we keep both func
tions xed,with a path distance of 1 for concepts of interest.
When analyzing Web server log data,the transformation
to concept space representation is trivial;it requires only
one pass through the data to apply the mapping r,before
the subsequent input routine transforms the sequential log
format into graphs as in the itemspace problem.
6.CASE STUDY (2)
The log was analyzed again in concept space.The func
tion r transformed all information types on a diagnosis X
into\information on X",and mapped the search functions
to their abstract values (\LOKAL1"= rst step of localiza
tion search,etc.).Remaining URLs were left unchanged.
Information gathering in concept space The analysis
showed that most of the chains of diagnoses were in fact nav
igation structures which involved revisits to\key"diagnoses
that served as\hubs"for navigation.
Chains of diagnoses lost support:a chain of 6 diagnoses
had support 7.2%(5:9.2%,4:13%,3:18.9%).Instead,pat
terns like diagnoses with 3 or more other diagnoses branch
ing o the\hub"diagnosis shown at the left of Fig.4 (sup
port 5.3%) became frequent.Rings also occurred at slightly
lower support thresholds (see Fig.4,second from left).
Search options:linear vs.hubandspoke Search in
concept space also showed the functioning of search options
more clearly.
First,the use of the search engine appeared only in very
few patterns (only in 2node patterns:4.2% for search
engine and a diagnosis,3.5% for searchengine and alpha
betical searchprobably a subsequent switch to the second,
more popular search option).This was because the search
engine was less popular than the other search options (used
about 1/10 as often),but far more ecient in the sense that
searches generally ended after the rst diagnosis found (as
suming that nding a diagnosis was the goal of all search
sessions).This is consistent with results from our other
studies of search behaviour [2].
The alphabetical search option generally prompted a\hub
andspoke navigation",as shown in the third example from
the left of Fig.4 (support 6.4%).In contrast,localization
search generally proceeded in a linear or depthrst fash
ion,as shown on the right of Fig.4 (support 5%;with one
diagnosis less:6.9%).
This may be interpreted as follows:Localization search
prompts the user to specify,on a clickable map,the body
parts that contain the sought disease.This is in itself a
search that can be rened (LOKAL1 { LOKAL2 in the g
ure;a similar pattern of LOKAL1 { LOKAL2 { LOKAL3,
followed by 2 diagnoses,had a support of 5.1%).This
narrowingdown of the medical problem by an aspect of its
surface symptoms (localization on the body) helps the user
to identify one approximately correct diagnosis and to nd
the correct one,or further ones,by retaining the focus on
symptoms and nding further diagnoses by following the
dierentialdiagnosis links.This means that,in particular,
nonexpert users can focus on surface features that have
meaning in the domain (and acquire some medical knowl
edge in the process).
Alphabetical search,on the other hand,leads to lists of
diseases that are not narrowed down by domain constraints,
but only by their name starting with the same letter.Nav
igation choices may be wrong due to a mistaken memory of
the disease's name.This requires backtracking to the listof
diseases page and the choice of a similarlynamed diagnosis.
This interpretation is supported by ndings from a study
of navigation in the same site in which participants specied
whether they were physicians or patients.Content search
was preferred by patients most often,whereas physicians
used alphabetical search or the search engine more often.
Investigation of the most frequent individual patterns in
localizationbased search showed that the combination of the
human head as the location of the illness and a diagnosis
which is visually prominently placed on the result page was
most frequent.This possible eect of layout on navigation is
one example of patterns that can be detected by APIP but
would have gone unnoticed otherwise because the individual
pattern itself was below the statistical threshold.
Note that the temporal interpretation,as shown in the
toptobottom ordering in Fig.4,is justied by site topol
ogy (one cannot go from a diagnosis to coarser and coarser
search options,unless by backtracking).The temporal inter
pretation of\backtracking and going to another diagnosis"
is likewise justied by site topology,but the lefttoright or
dering of diagnoses in the gure is of course arbitrary.
Note also that there may be an arbitrary number of steps
between the transition from the ALPH node in Fig.4 to
the rst diagnosis and the transition fromthe ALPHnode to
the second (or third) diagnosis.This intentional abstraction
fromtime serves to better underline the\hub"nature of the
ALPH page as a conceptual centre of navigation.
7.CONCLUSIONS AND OUTLOOK
We have introduced a new mining problem,the APIP
mining problem,and the APIP algorithm that solves it.
APIP uses a taxonomy and searches for frequent patterns
at an abstract level,but also returns the individual sub
graphs constituting them.This is motivated by the context
induced interestingness of these individual subgraphs:While
they generally occur far more seldom than a chosen support
threshold,they are interesting as instantiations of the fre
quent abstract patterns.APIP can mine subgraphs at the
item level and at the concept level;the latter is often more
adequate to nd patterns of semantic recurrence.
One important open question is how to extend the xed
URLconcept mapping to the exibility provided by the
taxonomyincluding algorithms of [23,24,11] without an
explosion in the search space and unclear semantics for the
mining in concept space.The experience with our data sug
gests that the standard mechanisms would not be sucient
to solve this problem (in fact,the problem of\overgener
alized"patterns [11] never occurred in our data).In fact,
relying on the common statistical measures of interesting
ness eliminates nearly all APfrequent individual patterns
and therefore limits the expressive power of APIP signi
cantly.On the other hand,many of these IPs are in fact
very rare,and their high number is a major limitation for
the present algorithm.Adequate measures of pattern in
terestingness need to be dened in order to address this
problem.We expect that depending on domain and analy
sis question,these methods may involve limiting the class of
patterns sought,more user interaction for the determination
of interesting abstract and individual patterns,or heuristics
such as parallel but dierent support thresholds.
Another research direction concerns the structure of the
patterns found.Frequent subgraph mining becomes worth
while if rings and other cyclic patterns are frequent in the
data.If most frequent patterns are chains or trees (as was
the case in the present data),sequence and tree mining may
be a more ecient choice for analysis.This is exploited
in [22],which searches for patterns of increasing complex
ity and can signicantly speed up mining.Corresponding
extensions of APIP will be investigated in future work.
An inspection of single sessions in item and concept space
suggests yet a dierent approach.Rings of diagnoses in
our log often contained\other"contents such as naviga
tion/search pages in between.Thus,wildcard options as in
[9] would be very useful for identifying patterns.
Last but not least,interesting questions arise with the
increasing dynamics of the Web:when content changes,it
suces to extend the URLconcept mapping,but what is
to be done when semantics evolve?One approach would be
to use ontology mapping techniques to make graph patterns
comparable and to store more and more abstracted repre
sentation of patterns as they move further into the past.
8.REFERENCES
[1] Agrawal,R.,Imielinski,T.,& Swami,A.N.(1993).
Mining association rules between sets of items in large
databases.In Proc.ACM SIGMOD Int.Conf.Mgt.Data
(pp.207216).
[2] Berendt,B.(2002).Using site semantics to analyze,
visualize and support navigation.Data Mining and
Knowledge Discovery,6(1):37{59.
[3] Berendt,B.,Hotho,A.,& Stumme,G.(2004).Usage
mining for and on the semantic web.In H.Kargupta et
al.(Eds.),Data Mining:Next Generation Challenges
and Future Directions.Menlo Park,CA:AAAI/MIT
Press.
[4] Berendt,B.,& Spiliopoulou,M.(2000).Analysis of
navigation behaviour in web sites integrating multiple
information systems.The VLDB Journal,9(1):56{75.
[5] Borgelt,C.,& Berthold,M.R.(2002).Mining
molecular fragments:Finding relevant substructures of
molecules.In Proc.ICDM (pp.51{58).
[6] Dai,H.,& Mobasher,B.(2001).Using ontologies to
discover domainlevel web usage proles.In Proceedings
of the Second Semantic Web Mining Workshop at
PKDD 2001.
[7] Deshpande,M.,Kuramochi,M.,& Karypis,G.(2002).
Automated approaches for classifying structures.In
Proc.BIOKDD'02 (pp.11{18).
[8] Eirinaki,M.,Vazirgiannis,M.,& Varlamis,I.(2003).
Sewep:Using site semantics and a taxonomy to enhance
the web personalization process.In KDD2003 (pp.
99{108).
[9] Hofer,H.,Borgelt,C.,& Berthold,M.R.(2003).Large
scale mining of molecular fragments with wildcards.In
Advances in Intelligent Data Analysis V.(pp.380{389).
LNCS.
[10] Huan,J.,Wang,W.,& Prins,J.(2003).Ecient
mining of frequent subgraphs in the presence of
isomorphisms.In Proc.ICDM (pp.549{552).
[11] Inokuchi,A.(2004).Mining generalized substructures
from a set of labeled graphs.In Proc.ICDM'04 (pp.
414{418).
[12] Inokuchi,I.,Washio,T.,& Motoda,H.(2000).An
aprioribased algorithm for mining frequent
substructures from graph data.In Proc.PKDD 2000
(pp.13{23).
[13] Inokuchi,I.,Washio,T.,Nishimura,K.,& Motoda,H.
(2002).A fast algorithm for mining frequent connected
subgraphs.Research Report,IBM Research,Tokyo.
[14] Jin,X.,Zhou,Y.,& Mobasher,B.(2004).A unied
approach to personalization based on probabilistic latent
semantic models of web usage and content.In [21].
[15] Kralisch,A.,& Berendt,B.(2004).Cultural
determinants of search behaviour on websites.In Proc.
IWIPS 2004.(pp.6174).
[16] Kralisch,A.,Eisend,M.,& Berendt,B.(2005).
Impact of culture on website navigation behaviour.In
Proc.HCIInternational 2005.
[17] Kuramochi,M.,& Karypis,G.(2001).Frequent
subgraph discovery.In Proc.ICDM (pp.313320).
[18] McEneaney,J.E.(2001).Graphic and numerical
methods to assess navigation in hypertext.Int.J.of
HumanComputer Studies,55,761{786.
[19] Meinl,Th.,Borgelt,Ch.,& Berthold,M.R.(2004).
Mining fragments with fuzzy chains in molecular
databases.In Proc.Workshop W7 on Mining Graphs,
Trees and Sequences at ECML/PKDD 2004 (pp.49{60).
[20] Meo,R.,Lanzi,P.L.,Matera,M.,& Esposito,R.
(2004).Integrating web conceptual modeling and web
usage mining.In Proc.of the WebKDD Workshop on
Web Mining and Web Usage Analysis (pp.105{115).
[21] Mobasher,B.,Anand,S.S.,Berendt,B.,& Hotho,A.
(Eds.) (2004).Semantic Web Personalization.Papers
from the AAAI Workshop.Technical Report WS0409.
Menlo Park,CA:AAAI Press.
[22] Nijssen,S.& Kok,J.N.(2004).A quickstart in
frequent structure mining can make a dierence.In
Proc.SIGKDD'04 (pp.647{652).
[23] Srikant,R.& Agrawal,R.(1995).Mining generalized
association rules.In Proceedings of the 21st International
Conference on Very Large Databases,pages 407{419.
[24] Srikant,R.& Agrawal,R.(1996).Mining sequential
patterns:Generalizations and performance
improvements.In Proc.EDBT (pp.3{17).
[25] Tauscher,L.& Greenberg,S.(1997).Revisitation
patterns in World Wide Web navigation.In Proc.
CHI'97
[26] Yan,X.,& Han,J.(2002a).gSpan:Graphbased
substructure pattern mining.In Proc.ICDM (pp.
51{58).
[27] Yan,X.,& Han,J.(2002b).gSpan:Graphbased
substructure pattern mining.Technical Report
UIUCDCSR20022296,Dept.of Computer Science,
Univ.of Illinois at UrbanaChampaign.
[28] O.R.Za{ane,M.Xin,and J.Han.Discovering web
access patterns and trends by applying OLAP and data
mining technology on web logs.In Proc.ADL'98 (pp.
19{29).
[29] Zaki,M.J.(2002).Eciently mining trees in a forest.
In Proceedings of SIGKDD'02.
[30] Zaki,M.J.,Parthasarathy,S.,Ogihara,M.,& Li,W.
(1997).New algorithms for fast discovery of association
rules.In Proc.3rd Intl.Conf.KDD (pp.283{296).
AAAI Press.
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο