The semantics of frequent subgraphs: Mining and navigation pattern analysis

grassquantityAI and Robotics

Nov 15, 2013 (3 years and 8 months ago)

135 views

The semantics of frequent subgraphs:
Mining and navigation pattern analysis
Bettina Berendt
Institute of Information Systems,Humboldt University Berlin
Spandauer Str.1
D­10178 Berlin,Germany
http://www.wiwi.hu­berlin.de/berendt
ABSTRACT
The search for frequent subgraphs is a useful extension of
common approaches in Web mining.For example,it allows
the study of revisitation patterns in Web usage and the dis-
covery of richer navigation structures such as\landmarks"
or\hubs"that serve to organize a user's conceptual map of a
site or a part of the Web.Any use of graph structures in Web
usage mining,however,should also take into account that it
is essential to integrate background knowledge into the anal-
ysis,and that behaviour must be studied at dierent levels
of abstraction.To capture these needs,we propose to use
taxonomies in mining and to extend the standard notions of
interestingness frequency/support by the notion of context-
induced interestingness.The AP-IP mining problem then
consists of nding all frequent abstract patterns and the
individual patterns that constitute them and are therefore
interesting in this context (even though they may be infre-
quent).The paper presents the AP-IP algorithm that uses
a taxonomy to search for the abstract and individual pat-
terns,We also show that the search for label-abstracted but
isomorphic subgraphs does not always give an accurate im-
age of navigation strategies,and we develop a procedure for
mining at the concept level to solve this problem.A case
study of a real-life Web site shows the advantages of the
proposed solutions.
Categories and Subject Descriptors
H.2.8 [Database Management]:Database Applications|
data mining;H.5.4 [Information Interfaces and Presen-
tation]:Hypertext/Hypermedia|navigation,user issues
Keywords
Web mining,graph mining,integrating Web content and
semantics into Web mining
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for prot or commercial advantage and that copies
bear this notice and the full citation on the rst page.To copy otherwise,to
republish,to post on servers or to redistribute to lists,requires prior specic
permission and/or a fee.
WebKDD'05,August 21,Chicago,Illinois,USA.
Copyright 2005 ACM,ISBN:1­59593­214­3,$5.00.
1.INTRODUCTION
Background knowledge is an invaluable help in the min-
ing of Web data.Examples include the use of ontologies for
text mining and the exploitation of anchor texts for crawl-
ing,indexing,and ranking.In Web usage mining,back-
ground knowledge is needed because the raw data consist
of administration-oriented URLs,whereas site owners and
analysts are interested in events in the application domain.
In the past years,various algorithms for including back-
ground knowledge in mining have been proposed.In par-
ticular,the use of taxonomies in association rule,sequence,
and graph mining has been investigated.In Web usage min-
ing,taxonomies describe the application events by concepts
that abstract from specic URLs and that form one or more
concept hierarchies.
In the raw data,each individual set,sequence,or graph of
items is generally very infrequent.When relying on statisti-
cal measures of interestingness like support and condence,
the analyst either nds no patterns or a huge and unmanage-
able number.Taxonomies solve this problem and generally
produce a much more manageable number of patterns with
high values of the statistical measures.In addition,they al-
low the analyst to add semantic measures of interestingness
into the mining process.
A number of powerful and ecient algorithms and tools
exist that can mine patterns determined by a wide range
of measures.However,they ignore an important source of
interest:taxonomic context.What is behind a frequent ab-
stract pattern|what types of individual items/behaviour
constitute it,and how are they distributed (e.g.,equally
distributed or obeying Zipf's law)?To make patterns with
interestingness induced by the context of their abstraction
visible,a mining tool must support a\drill-down"into the
frequent patterns it has found,and it should provide simul-
taneous detail-and-context views on patterns.
This notion of detail and context has a number of ap-
plications that include a deepened understanding of data
and a highly sensitive monitoring of temporal changes in a
volatile Web.By allowing the analyst to focus on seman-
tically interesting areas (abstract patterns),it can support
the identication of patterns like an abstract-level stability
with a simultaneous individual-level drift|even if this drift
remains below the chosen statistical thresholds,or if this
drift consists of a change in the distribution of the individ-
ual patterns that constitute the abstract pattern.An exam-
ple are variations induced by seasonal in uences or changes
in a site's catalog.On the other hand,there may also be
variations on the abstract level,such as changing search be-
haviour due to the increasing Internet sophistication of a
site's users.
Such detailed analysis of patterns becomes particularly
interesting when patterns reveal a lot of structure,as in se-
quence or in graph mining.Viewing Web navigation as a
graph is interesting because it gives insight into the\men-
tal map"of a site created by its users:What pages serve
as orientation points to organize the site,which areas are
traversed in linear fashion?Do users cycle between content,
get lost in dead-end streets,or do they return to previously
visited content in order to start new and better-informed
searches from there?Last but not least,important dier-
ences between user groups (e.g.,novices and experts) can
be found by looking at their navigation graphs.User move-
ments can also be seen as votes for content relevance,thus
helping mining and ranking based on the hyperlink graph
structure (such as recent PageRank extensions).
The algorithm described in the present paper provides a
solution to these problems.It combines the need to mine
at abstract levels to nd statistically frequent patterns and
the interest in what is behind those patterns,by giving the
possibility of\drilling down"into the abstract patterns in
order to get information about their most typical individ-
ual instances.To do so,it introduces a new denition of
abstract patterns,individual patterns,and the relation be-
tween them.The algorithm is implemented in the AP-IP
tool that visualizes the results as context and detail.In ad-
dition,we argue that a conceptual view of navigation can
lead to patterns that have a dierent graph structure and are
more informative,but that cannot be found by traditional
mining in item space.We therefore extend the scheme to
mining in concept space.
The paper is organized as follows:Section 2 gives a short
overview of related work.Section 3 describes the algorithm.
Section 4 describes selected results froma case study of nav-
igation data from an information site,and it motivates the
mining in concept space that is described in more detail in
Section 5 and applied to the case study in Section 6 Sec-
tion 7 concludes the paper and gives an outlook on future
research.
2.RELATED WORK
The problem of frequent subgraph mining has been stud-
ied intensively in the past years,with many applications in
bioinformatics/chemistry,computer vision and image and
object retrieval,and Web mining.Algorithms are based on
the a priori principle stating that a graph pattern can only
be frequent in a set of transactions if all its subgraphs are
also frequent.Frequent subgraphs are generated during a
breadth-rst search following the Apriori algorithm [1] or in
a depth-rst fashion following Eclat [30].The main prob-
lem of frequent subgraph mining is that two essential steps
involve expensive isomorphism tests:First,during candi-
date generation the same graph may be generated multiple
times and has to be compared to already generated candi-
dates;second,the candidates have to be embedded in the
transactions to compute support.
Several algorithms have been developed to discover ar-
bitrary connected subgraphs.One research direction has
been to explore ways of computing canonical labels for can-
didate graphs to speed up duplicate-candidate detection [17,
26].Various schemes have been proposed to reduce the num-
ber of generated duplicates [26,10].To deal with the embed-
ding problem (NP-complete in the general form),[7] have
proposed to reduce the number of embeddings,while [5]
store them to allow a fast parallel generation of new can-
didates.
Other algorithms constrain the form of the sought sub-
graphs:[12,13] search for induced subgraphs (a pattern
can only be embedded in a transaction if its nodes are con-
nected by the same topology in the transaction as in the
pattern).[22] exploit the observation that most patterns are
rather simple,searching rst for sequences,then for trees,
and then for cyclic graphs.
Going beyond exact matching,[9] preprocess their data
to nd rings which they treat as special features and intro-
duce wildcards for node labels,and [19] nd\fuzzy chains",
sequences with wildcards.
The qualitative and quantitative analysis of single users'
navigation graphs has a long tradition (cf.the transfer of
Web graph metrics to navigation graphs [18]),but the anal-
ysis of (general) graph patterns via mining many users'navi-
gation graphs has received less attention.A number of stud-
ies have employed tree mining (e.g.,[29]) and extensions of
sequence mining [2].The latter allows one to detect patterns
of repeated visits to pages/concepts and cycling behaviour.
The former allows one to detect\hubs"of navigation in the
sense of [25]:centers of breadth-rst searches.
Taxonomies have been used to nd patterns at dierent
levels of abstraction for dierently-structured data.In [23,
24],frequent itemsets/association rules and frequent se-
quences are identied at the lowest possible levels of a tax-
onomy.Concepts are chosen dynamically,which may lead
to patterns linking concepts at dierent levels of abstrac-
tion (e.g.,\People who bought'Harry Potter 2'also bought
books by Tolkien").In [11],this idea was transferred to
frequent induced subgraph mining.
In Web usage mining,the use of concepts abstracting from
URLs is common (see [3] for an overview and the papers in
[21] for recent examples of the uses for personalization).The
general idea is to map a URL to the most signicant contents
and/or services that are delivered by this page,expressed as
concepts.One page may be mapped to one or to a set of
concepts.For example,it may be mapped to the set of
all concepts and relations that appear in its query string.
Alternatively,keywords from the page's text and from the
pages linked with it may be mapped to a domain ontology,
with a general-purpose ontology like WordNet serving as an
intermediary between the keywords found in the text and
the concepts of the ontology [8].
After the preprocessing steps in which access data are
mapped into taxonomies,subsequent mining techniques can
use these taxonomies statically or dynamically.In static
approaches,mining operates on concepts at a chosen level of
abstraction;each request is mapped to exactly one concept
or exactly one set of concepts.This approach is usually
combined with interactive control of the software,so that
the analyst can re-adjust the chosen level of abstraction after
viewing the results (e.g.,in the miner WUM;see [4] for a case
study).Alternatively,the taxonomy is used dynamically as
in [23,24,11].
The subsumption hierarchy of an existing ontology can
be employed for the simultaneous description of user inter-
ests at dierent levels of abstraction as in [20].In [6],a
scheme is presented for aggregating towards more general
Algorithm AP-IP(D;T;m;;K) (AP-IP frequent subgraph mining)1:CIP
1
;,CAP
1
;
2:for each edge e in D do
3:cip
1
feg
4:CIP
1
CIP
1
[ fcip
1
g
5:apip-gen(cip
1
)
6:FAP
1
fcap
1
2 CAP
1
j j
S
cip
1
2cap
1
:IPs
cip
1
:TIDj  g
7:FIP
1
fcip
1
2 CIP
1
j 9fap
1
2 FAP
1
:cip
1
2 fap
1
:IPsg
8:k 2
9:while FAP
k1
6=;and k  K do
10:CIP
k
;,CAP
k
;
11:for each fip
k1
2 FIP
k1
do
12:for each fip
1
2 FIP
1
that can be grown from fip
k1
as described in [26] do
13:cip
k
fip
k1
[ fip
1
14:if cip
k
has support > 0 & it is not automorphic to any element of CIP
k
then
15:CIP
k
CIP
k
[ fcip
k
g
16:apip-gen(cip
k
)
17:FAP
k
fcap
k
2 CAP
k
j j
S
cip
k
2cap
k
:IPs
cip
k
:TIDj  g
18;FIP
k
fcip
k
2 CIP
k
j 9fap
k
2 FAP
k
:cip
k
2 fap
k
:IPsg
19:k k +1Procedure apip-gen(cip
k
) (for simplicity of notation,CIP
k
;CAP
k
,T,and m are assumed global)1:cap
k
abstraction(cip
k
;T;m)
2:for each g 2 CAP
k
s.t.g and cap
k
have the same number of nodes and edges and the same node labels and degrees do
3:if cap
k
is a known automorphism of g then
4:g:IPs g:IPs [ fcip
k
g
5:return
6:for each g 2 CAP
k
s.t.g and cap
k
have the same number of nodes and edges and the same node labels and degrees do
7:if cap
k
:cl = g:cl then
8:g:IPs g:IPs [ fcip
k
g
9:return
10:cap
k
:IPs fcip
k
g
11:CAP
k
CAP
k
[ fcap
k
gconcepts when an explicit taxonomy is missing.Clustering
of sets of sessions identies related concepts at dierent lev-
els of abstraction.Sophisticated forms of content extraction
are made possible by latent semantic models [14].In [8],
association rules,taxonomies,and document clustering are
combined to generate concept-based recommendation sets.
Some approaches use multiple taxonomies related to OLAP:
objects (in this case,requests or requested URLs) are de-
scribed along a number of dimensions,and concept hier-
archies or lattices are formulated along each dimension to
allow more abstract views [28].
3.AP­IPFREQUENTSUBGRAPHMINING
Based on the requirements described in the introduction,
we formulate the AP-IPmining problem:Given a dataset
of graph transactions D consisting of items with labels from
a set I,a taxonomy T consisting of concepts from a set C,
a mapping m:I 7!C,and a minimum support  2 [0;1],
nd all frequent abstract subgraphs and their corresponding
AP-frequent individual subgraphs.
Denition 1.An abstract subgraph is a connected graph
G = (V;E) with node labels given by l
a
G
:V 7!C.A
frequent abstract subgraph is one that can be embedded
in at least jDj transactions.
An AP-frequent individual subgraph of G is dened
as follows:(a) It is a graph G
0
= (V
0
;E
0
) with node labels
given by l
i
G
0
:V
0
7!I.(b) There exists a frequent abstract
subgraph G such that the graph G
00
= (V
0
;E
0
) with node
labels given by m:I 7!C with m l
i
G
0 (v
0
) = m(l
i
G
0 (v
0
)),is
an automorphism of G.G
00
is also called the abstraction
of G
0
with respect to T and m.(c) G
0
is automorphic to at
least one subgraph of at least one transaction.
Gcan be embedded in a transaction d 2 Dif it has an AP-
frequent individual subgraph G
0
such that G
0
is automorphic
to at least one subgraph of d.
Frequent subgraphs are also called\patterns",or APs (ab-
stract patterns) and IPs (individual patterns).
An example shows the role of abstract and individual pat-
terns:If 10% of the transactions contain the chain (Harry
Potter 1,Lord of the Rings),5% contain (Harry Potter 2,
The Hobbit),and 7% contain (Harry Potter 1,Lord of the
Rings,The Hobbit,Harry Potter 2) (all sets disjoint),and
the minimal support is 20%,then (Rowlings book,Tolkien
book) has support 22%and is a frequent abstract pattern.Its
AP-frequent individual patterns are (Harry Potter 1,Lord of
the Rings) (17%) and (Harry Potter 2,The Hobbit) (12%).
The algorithm to solve this problem,AP-IP,is shown in
the table above.
The following notation is used:a k-(sub)graph g
k
is a (sub)graph
with k edges.CIP
k
(CAP
k
) is a set of IP (AP) candidates with
k edges.FAP
k
(FIP
k
) is the set of frequent abstract subgraphs
(AP-frequent individual subgraphs) with k edges.cip
k
;cap
k
;fip
k
;
fap
k
are elements of these sets.K is the maximal number of
edges in patterns sought.For simplicity of notation,we assume
that K  2.g:cl is the canonical label of g.g:IPs are the AP-
frequent individual subgraphs of g.g:TID is the transaction-ID
list of g,i.e.the ordered list of transactions that contain g.
AP-IP uses and extends various ideas to achieve a fast pro-
cessing of graphs:
Search space traversal.The search for patterns pro-
ceeds breadth-rst (see AP-IP,step 9./19.).This simplies
the search for possible duplicate APs (see below).Pattern
search terminates when no more patterns exceed the sup-
port threshold,or when the maximal sought pattern size is
exceeded.
IP candidate generation In each round,individual-sub-
graph candidates are generated:all subgraphs that occur at
least once in the data.For k = 1,these are all single edges in
the dataset (AP-IP,3.,4.).For larger k,a pattern candidate
is any set of k edges that can be grown from a frequent
pattern of size k 1 by extending it by an extra edge (AP-
IP,12.,13.,15.),eliminating those candidates that have no
support in the data and those that have been discovered
before (AP-IP,14.).
To reduce the number of generated candidates,frequent
patterns are extended by one edge at a time (\growing"),
and growing occurs only from the rightmost path [26].For
a proof of completeness,see [27].
Individual-level bijective label functions.The IP can-
didates are treated as sets of edges because it is assumed
that the graph of each transaction in D has a bijective label
functions l
i
.
This assumption is motivated by the domain:Graph anal-
ysis only becomes meaningful if each node is treated as
a unique\location in hyperspace"which can be revisited.
Thus,in any transaction/session each potential URL was
either requested or not,so each transaction's nodes (edges)
are a subset of I (I I).
This assumption implies that AP-IP can be transferred to other
domains in which this unique-names assumption also holds.So
for example,AP-IP could be used for texts dealing with proper
names,but not for most molecular-mining tasks.AP-IP could
easily be extended to also discover patterns in transactions with
a non-bijective label function,but it would require the same au-
tomorphism tests for duplicate detection as those carried out for
APs,and it would change the tests for embeddings in the data.
The eects on runtime remain to be investigated.
IP candidate duplicate removal Given the bijective la-
bel function,an IP's lexicographically sorted edges labeled
by their node pair,dene a canonical form that can be com-
puted eciently (list sorting in AP-IP,13.) and allows for
ecient duplicate detection (comparison of two ordered lists,
AP-IP,14.).
IP candidate support counting and pruning (part I)
Bijective label functions also allow for ecient identication
of embeddings:a graph G
0
is a subgraph of d i the canon-
ical form of G
0
is a subset of the canonical form of d.Each
edge is initialized with a sorted list of all transaction IDs
that contain this edge (not shown in the pseudocode),see
for example [17].Each IP candidate's TID list is then the
intersection of all of its edges'TID lists.This can be com-
puted in one step by intersecting two sorted lists:the TID
lists of the rst fip
k1
and fip
1
that are found to form this
candidate (AP-IP,13.).
Support pruning at the IP level lters out all IPs that
would in the best case not contribute support to their AP
abstractions and in the worst case lead to the generation of
redundant,zero-support APs.
One structural and one semantic type of pattern were
deemed uninteresting for the domain and are ltered out to
increase eciency:self-loops and edges involving items that
are not mapped to a concept by the taxonomy (in AP-IP,
3.;not shown in the pseudocode).
APcandidate generation and duplicate removal Next,
the AP candidate corresponding to the identied individual
pattern candidate is formed.First,the individual node la-
bels are mapped to their abstractions according to the tax-
onomy T and the concepts from T that are chosen by m
(apip-gen,1.).
Usually,there is more than one IP that maps to an AP.
To simplify duplicate checking,each IP is mapped to and
each AP is stored as a sparse adjacency-list representation.
To limit the number of tests,a newly considered AP candi-
date is compared only to known candidates to which it can
be automorphic.This step relies on vertex invariants,i.e.
standard procedures fromisomorphismtesting (apip-gen,2.,
6.).First,an equality test may show that the new candidate
has already been found in the same order of node labels and
edges,i.e.that it is a known automorphism(apip-gen,3.;see
also [17]),Alternatively,the new candidate may be an as-yet
unknown automorphism.To verify this,its canonical label is
computed (apip-gen,7.).Isomorphism/automorphism test-
ing by canonical-label comparison is generally,but of course
not always,fast;see the literature in Section 2 The canoni-
cal form is determined by the maximum label sequence and
the maximum upper triangular of the adjacency matrix as
in [13].In either case,the IP is added to the set of IPs
contributing to the AP candidate (apip-gen,4./5.,8./9.).
The graph cap
k
is only registered as a new candidate if
both tests failed (apip-gen,10.,11.).
This procedure ensures that all abstract pattern candi-
dates are found:(a) all individual pattern candidates with
at least one embedding in the data are found (see above),
(b) each individual pattern has a unique abstraction,and
(c) all individual patterns that map to the same AP are col-
lected into that AP's IPs set by the automorphism tests in
apip-gen.
Support counting and pruning.The AP can be embed-
ded in a transaction if and only if at least one of its IPs can
be embedded in that transaction,which leads to the com-
putation of its support as the cardinality of the union of the
IPs'TID lists (AP-IP,6./17.).
The candidate APs with sucient support values become
frequent abstract patterns (AP-IP,6./17.).According to the
denition of AP-frequent individual patterns,only those cip
that map to an abstract pattern now identied as frequent
are transferred to the set of frequent individual patterns
(AP-IP,7./18.).
Support counting and pruning are ecient because they
operate entirely on the TID lists;no access to the data and
no subgraph embedding are required.
This completes one round;the next largest patterns can
now be examined (AP-IP,8./19.).
Memory management.All (AP-)frequent patterns are
output after they have been computed.The AP-frequent
individual patterns of size 1 must be kept in memory because
they are needed in each iteration (AP-IP,12.),and the AP-
frequent individual patterns of size k  1 are needed for
growing (AP-IP,11.).All other patterns are discarded to
save memory.
AP-IP also implements a frequency threshold ( 1) in-
stead of the support threshold mentioned above.The ex-
Figure 1:Top (a),(b):Impact of data set size and number of patterns in isolation;pS = 0.8;pL = 0.05;
minsupp =0.05;Bottom (c),(d):Combined impact of data set size and number of patterns;same parameters.
tension is straightforward because the support test operates
on transaction-ID lists anyway;it is therefore not shown.
In the current version,AP-IPtreats only undirected graphs
with node labels.The generalization to directed as well as
node-and-edge labeled graphs is straightforward.
In principle,AP-IPcould also work\backwards",by grow-
ing at the AP level and then instantiating to IPs.This would
correspond to taxonomy-free frequent subgraph mining al-
gorithms.However,this would require an investigation of at
least the same number of IPs (since all of them need to be
generated as instantiations,and possibly further ones that
also match a grown AP),and it would require either sophis-
ticated bookkeeping of the growing history or an isomor-
phism check with the AP from which an IP candidate was
generated.Simulation results indicate that this guaranteed
additional eort is not oset by any savings in candidate
generation and isomorphism testing at the AP level.
Time and space requirements.AP-IP's time and space
worst-case requirements are determined by (a) the size of
the dataset,(b) the number of patterns (abstract and in-
dividual),and the (c) diversity and (d) length of patterns.
Other factors have indirect in uences on these magnitudes;
in particular,a lower support threshold or a larger con-
cept branching factor generally lead to more,more diverse,
and/or longer patterns.However,more data also often lead
to more patterns,and more patterns are often longer pat-
terns.
Described brie y,(a) in uences time and space require-
ments linearly because AP-IP operates only on the sorted
TID lists.(b) has a linear in uence (CIP canonical form
construction:adding an edge to a sorted list;CIP duplicate
checking:lookup in a hashtable indexed by canonical form;
CIP abstraction by mapping each node).The in uence on
space is obvious because patterns are all that is stored (apart
fromTIDlists).The in uence of the number of abstract pat-
terns is governed by (c) and (d).(c) can imply additional
time during AP duplicate detection|when many\similarly-
looking"APs (apip-gen,2.) exist,and when IPs are diverse
so that many dierent automorphism occur and require a
canonical-form computation.In principle,the number and
type of candidate patterns are decisive;in line with the lit-
erature and the focus on data and patterns,we will however
treat the (AP-)frequent patterns as an approximation.(d)
has a linear eect on space requirements and on the IPs'
time requirements,but can require more than linear time
because in AP canonical form construction,longer patterns
generally require the study of more permutations.
To investigate actual performance,simulated and real data
were analyzed.For reasons of space,we focus on perfor-
mance with respect to simulated data only in this section
and on application-oriented results with respect to real data
in the next section.(Main performance results were similar.)
AP-IP was implemented in Java and run on a Pentium
4 3GHz processor,1GB main memory PC under Debian
Linux.The simulated data were generated by a rst-order
Markov model (a popular model of Web navigation) varying
the number of transactions/sessions,the branching factor b
of the concepts in a one-level taxonomy (and thus the num-
ber of patterns found in a total of 100\URLs"),the diversity
of patterns (each node had a parametrized transition prob-
ability pS to one other,randomly chosen node and an equal
distribution of transition probabilities to all other nodes),
and the length of patterns (the transition probability pL to
\exit"|the average session length becomes 1=pL,and pat-
Figure 2:Impact of pattern length:the steep curve in (b) belongs to the long patterns in (a).
tern length increases with it).
First,the in uence of data set size and number of patterns
were studied in isolation:1000 sessions were duplicated to
generate 2000 sessions with the same number of patterns,
concatenated 3 times to generate 3000 sessions,etc.Figure
1 (a) shows that time is linear in data set size,and that
the slope increases with the number of patterns (induced by
increasing the concept branching factor).Figure 1 (b) plots
the same data (plus one run with an intermediate branch-
ing factor) by the number of patterns,keeping the number
of sessions constant along each line.It shows that in this
setting,time is sublinear in the number of patterns.Figure
1 (c) show the combined eects.The thick line illustrates
the eect of adding 1000 more,2000 more,etc.dierent ses-
sions,while the thin lines show the eect of only duplicating
the data set (as in (a)).
Figure 1 (d) plots the additional time (= the vertical dif-
ference between thin-line endpoints and the thick line in (c))
against additional patterns,revealing a linear eect.
Similar graphs are obtained for a wide variety of param-
eter settings,and also for real data.However,time re-
quirements increase strongly when patterns get bigger:Fig.
2 (b) shows that the impact of additional patterns rises
more strongly for big than for small patterns (as = aver-
age size/number of edges),which leads to a slightly above-
linear increase in time with respect to data set size.Since
Web navigation data generally exhibit\short"or\medium"
patterns,this problem concerns them only rarely.
The program required substantial memory (up to 500MB
for the large patterns in Fig.2),which is very undesirable
and partly due to Java's incomplete garbage collection.We
plan a reimplementation in C++ to substantially improve
both memory consumption and the absolute level of run-
times.
4.CASE STUDY (1)
Data.Navigation data from a large and heavily frequented
eHealth information site were investigated.The site oers
dierent searching and browsing options to access a database
of diagnoses (roughly corresponding to products in an eCom-
merce product catalog).Each diagnosis has a number of dif-
ferent information modes (textual description,pictures,and
further links),and diagnoses are hyperlinked via a medical
ontology (links go to dierential diagnoses).The interface
is identical for each diagnosis,and it is identical in each
language that the site oers (at the time of data collection:
English,Spanish,Portuguese,and German).
The initial data set consisted of close to 4 million requests
collected in the site's server log between November 2001 and
November 2002.In order to investigate the impact of lan-
guage and culture on (sequential) search and navigation be-
haviour,the log was partitioned into accesses by country,
and the usual data preprocessing steps were applied [15,16].
For the present study,the log partition containing accesses
from France was chosen as the sample log;the other coun-
try logs of the same size exhibited the same patterns,dif-
fering only in the support values.The sample log consisted
of 20333 requests in 1397 sessions.The concept hierarchy
and two mappings m developed for [16] were used to ensure
comparability.
In the following,selected results from the analysis are re-
ported that outline the advantages of treating sessions and
patterns as graphs rather than as sets,bags,or sequences of
requests.The focus is on patterns with more than 2 nodes,
since the latter are by denition chains and can also be found
using sequence mining.
Basic statistics.The log is an example of concepts corre-
sponding to a high number of individual URLs (84 concepts
with,on average,82 individual URLs),and highly diverse
behaviour at the individual level.This is re ected in a high
ratio of the number of IPs to APs (compared to the sim-
ulated data above):At a support value of 0.2,there were
7 frequent APs of size 1{5 with 55603 AP-frequent IPs,at
0.1,the values were:size 1{8,34 APs,414685 IPs;at 0.05,
the values for K = 6 were:104 APs,297466 IPs.This
search was not extended because no further (semantically)
interesting patterns were found.
Linear information gathering With a minimal support
of 0.2 or higher,the only patterns with more than two nodes
were chains of diagnoses.These patterns were very frequent:
a chain of 6 diagnoses had support 22.2% (5:26.7%,4:
32.7%,3:39.3%).
Lowering the support threshold showed that even longer
chains were still comparatively frequent (13.8% support for
chains of 9 diagnoses) and reveals that alphabetical search
was the most frequent entry point to these chains (alphabet-
ical search followed by 1,2,or 3 diagnoses had 19%,15.5%,
and 11.5% support).
Investigation of the most frequent individual patterns in
alphabetical search showed a possible eect of layout:the
top-level page of diagnoses starting with\A"was the most
frequent entrypoint.While this pattern is in itself not too
Figure 3:Visualization of abstract patterns,subpatterns,superpatterns,and individual patterns.
interesting,it can help to lter out accesses of\people just
browsing"in order to focus further search for patterns on
more directed information search.
Figure 3 shows a screenshot of the AP-IP visualization.In
a preprocessing step,the patterns that only occurred once
in the data were ltered out,which explains the compara-
tively low number of individual patterns associated with an
abstract pattern (in the example:47).It turned out that
individual patterns with frequency = 1 constituted 99% of
all patterns.
A Closer Look at Linear Navigation Patterns The
nding that people just\surfed through"long chains of
diagnoses is surprising and dicult to reconcile with the
knowledge about search behaviour collected in the direct
observations of users who usually targeted a small number
of diseases for information collection.Also,the inspection
of the most frequent individual patterns involving several
diagnoses showed pairs or chains of two pictures of the same
disease.
An inspection of single sessions revealed that often,visi-
tors return to the same content in the following sense:They
request one picture,or the picture overview,of a diagnosis,
then study a dierent diagnosis (e.g.,a dierential diagno-
sis),and then return to read,in detail,the text of the rst
diagnosis.In other words,although they visit 3 dierent
URLs,forming the abstract pattern\diagnosis { diagnosis
{ diagnosis",they in fact gather information on the concepts
\diagnosis 1 { diagnosis 2 { diagnosis 1".
This non-linear behaviour cannot be identied in the cur-
rent framework.The next section will investigate the prob-
lem and propose a solution.
5.MININGIN CONCEPT SPACE
The problem of item-level patterns vs.concept-level pat-
terns can be illustrated using the following example:
Consider two transactions [a
1
a
2
a
3
a
4
] and [a
5
a
6

a
7
a
8
] and let the a
i
be instances of the following concepts:
a
1
;a
4
7!b
1
,a
2
7!b
2
,a
3
7!b
3
,a
5
;a
8
7!b
4
,a
6
7!b
5
,
a
7
7!b
6
.All b
j
are instances of concept c.Let minsupp be
1.Using the framework of [23,24] that is also the basis of
[11],the only graphs that are frequent are the chains c c,
c  c  c,and c  c  c  c.This is because only graphs
that are isomorphic to subgraphs of the original transactions
can be patterns.(The problem does not occur in sequence
mining,but an analogous transfer of the algorithm in [24]
to the problem of subgraphs suggests that this isomorphism
to subgraphs of items is intended.In [11],the problem is
not dicussed,but the article suggests that only isomorphic
graphs can be concept-hierarchy abstractions of parts of the
transactions.)
However,both rst-level abstractions b
1
b
2
b
3
b
1
and
b
4
b
5
b
6
b
4
form a ring of three distinct sub-concepts
of c.Alternatively,they form rings of super-concepts of the
as involved.
In analogy to the use of the term\item"in [23],we call the
standard form of mining with taxonomies\mining in item
space".We will refer to the alternative form as\mining in
concept space".
To be able to use AP-IP,we translate the problem\nd-
ing all frequent concept-space abstract subgraphs in the set
of transactions D"into the equivalent problem\nding all
frequent item-space abstract subgraphs in the set of trans-
actions D
0
",with D
0
given by
Figure 4:Nonlinear patterns found when mining in concept space.
Denition 2.The concept space representation D
0
of
a set of transactions Dis dened as follows:Let r:I 7!C[I
be a function s.t.for each i,either r(i) is a descendant of
m(i) in the taxonomy T as described in Denition 1 (this
holds for at least one i),r(i) = m(i),or r(i) = i.Let D
0
be a
set of transactions obtained from D by,for each transaction
X,(1) replacing the label function l
i
X
by r  l
i
X
and (2)
coarsening the graph X such that all vertices with the same
label are merged and the label function is again bijective.
During mining,the function m in Denition 1 must be
replaced by a function m
0
s.t.m
0
(r(i)) = m(i).Dierent
levels of abstraction for mining in concept space can be de-
ned by varying the number of nodes on the paths between
r(i) and m(i).In the following example,we keep both func-
tions xed,with a path distance of 1 for concepts of interest.
When analyzing Web server log data,the transformation
to concept space representation is trivial;it requires only
one pass through the data to apply the mapping r,before
the subsequent input routine transforms the sequential log
format into graphs as in the item-space problem.
6.CASE STUDY (2)
The log was analyzed again in concept space.The func-
tion r transformed all information types on a diagnosis X
into\information on X",and mapped the search functions
to their abstract values (\LOKAL1"= rst step of localiza-
tion search,etc.).Remaining URLs were left unchanged.
Information gathering in concept space The analysis
showed that most of the chains of diagnoses were in fact nav-
igation structures which involved revisits to\key"diagnoses
that served as\hubs"for navigation.
Chains of diagnoses lost support:a chain of 6 diagnoses
had support 7.2%(5:9.2%,4:13%,3:18.9%).Instead,pat-
terns like diagnoses with 3 or more other diagnoses branch-
ing o the\hub"diagnosis shown at the left of Fig.4 (sup-
port 5.3%) became frequent.Rings also occurred at slightly
lower support thresholds (see Fig.4,second from left).
Search options:linear vs.hub-and-spoke Search in
concept space also showed the functioning of search options
more clearly.
First,the use of the search engine appeared only in very
few patterns (only in 2-node patterns:4.2% for search-
engine and a diagnosis,3.5% for search-engine and alpha-
betical search|probably a subsequent switch to the second,
more popular search option).This was because the search
engine was less popular than the other search options (used
about 1/10 as often),but far more ecient in the sense that
searches generally ended after the rst diagnosis found (as-
suming that nding a diagnosis was the goal of all search
sessions).This is consistent with results from our other
studies of search behaviour [2].
The alphabetical search option generally prompted a\hub-
and-spoke navigation",as shown in the third example from
the left of Fig.4 (support 6.4%).In contrast,localization
search generally proceeded in a linear or depth-rst fash-
ion,as shown on the right of Fig.4 (support 5%;with one
diagnosis less:6.9%).
This may be interpreted as follows:Localization search
prompts the user to specify,on a clickable map,the body
parts that contain the sought disease.This is in itself a
search that can be rened (LOKAL1 { LOKAL2 in the g-
ure;a similar pattern of LOKAL1 { LOKAL2 { LOKAL3,
followed by 2 diagnoses,had a support of 5.1%).This
narrowing-down of the medical problem by an aspect of its
surface symptoms (localization on the body) helps the user
to identify one approximately correct diagnosis and to nd
the correct one,or further ones,by retaining the focus on
symptoms and nding further diagnoses by following the
dierential-diagnosis links.This means that,in particular,
non-expert users can focus on surface features that have
meaning in the domain (and acquire some medical knowl-
edge in the process).
Alphabetical search,on the other hand,leads to lists of
diseases that are not narrowed down by domain constraints,
but only by their name starting with the same letter.Nav-
igation choices may be wrong due to a mistaken memory of
the disease's name.This requires backtracking to the list-of-
diseases page and the choice of a similarly-named diagnosis.
This interpretation is supported by ndings from a study
of navigation in the same site in which participants specied
whether they were physicians or patients.Content search
was preferred by patients most often,whereas physicians
used alphabetical search or the search engine more often.
Investigation of the most frequent individual patterns in
localization-based search showed that the combination of the
human head as the location of the illness and a diagnosis
which is visually prominently placed on the result page was
most frequent.This possible eect of layout on navigation is
one example of patterns that can be detected by AP-IP but
would have gone unnoticed otherwise because the individual
pattern itself was below the statistical threshold.
Note that the temporal interpretation,as shown in the
top-to-bottom ordering in Fig.4,is justied by site topol-
ogy (one cannot go from a diagnosis to coarser and coarser
search options,unless by backtracking).The temporal inter-
pretation of\backtracking and going to another diagnosis"
is likewise justied by site topology,but the left-to-right or-
dering of diagnoses in the gure is of course arbitrary.
Note also that there may be an arbitrary number of steps
between the transition from the ALPH node in Fig.4 to
the rst diagnosis and the transition fromthe ALPHnode to
the second (or third) diagnosis.This intentional abstraction
fromtime serves to better underline the\hub"nature of the
ALPH page as a conceptual centre of navigation.
7.CONCLUSIONS AND OUTLOOK
We have introduced a new mining problem,the AP-IP
mining problem,and the AP-IP algorithm that solves it.
AP-IP uses a taxonomy and searches for frequent patterns
at an abstract level,but also returns the individual sub-
graphs constituting them.This is motivated by the context-
induced interestingness of these individual subgraphs:While
they generally occur far more seldom than a chosen support
threshold,they are interesting as instantiations of the fre-
quent abstract patterns.AP-IP can mine subgraphs at the
item level and at the concept level;the latter is often more
adequate to nd patterns of semantic recurrence.
One important open question is how to extend the xed
URL-concept mapping to the exibility provided by the
taxonomy-including algorithms of [23,24,11] without an
explosion in the search space and unclear semantics for the
mining in concept space.The experience with our data sug-
gests that the standard mechanisms would not be sucient
to solve this problem (in fact,the problem of\overgener-
alized"patterns [11] never occurred in our data).In fact,
relying on the common statistical measures of interesting-
ness eliminates nearly all AP-frequent individual patterns
and therefore limits the expressive power of AP-IP signi-
cantly.On the other hand,many of these IPs are in fact
very rare,and their high number is a major limitation for
the present algorithm.Adequate measures of pattern in-
terestingness need to be dened in order to address this
problem.We expect that depending on domain and analy-
sis question,these methods may involve limiting the class of
patterns sought,more user interaction for the determination
of interesting abstract and individual patterns,or heuristics
such as parallel but dierent support thresholds.
Another research direction concerns the structure of the
patterns found.Frequent subgraph mining becomes worth-
while if rings and other cyclic patterns are frequent in the
data.If most frequent patterns are chains or trees (as was
the case in the present data),sequence and tree mining may
be a more ecient choice for analysis.This is exploited
in [22],which searches for patterns of increasing complex-
ity and can signicantly speed up mining.Corresponding
extensions of AP-IP will be investigated in future work.
An inspection of single sessions in item and concept space
suggests yet a dierent approach.Rings of diagnoses in
our log often contained\other"contents such as naviga-
tion/search pages in between.Thus,wildcard options as in
[9] would be very useful for identifying patterns.
Last but not least,interesting questions arise with the
increasing dynamics of the Web:when content changes,it
suces to extend the URL-concept mapping,but what is
to be done when semantics evolve?One approach would be
to use ontology mapping techniques to make graph patterns
comparable and to store more and more abstracted repre-
sentation of patterns as they move further into the past.
8.REFERENCES
[1] Agrawal,R.,Imielinski,T.,& Swami,A.N.(1993).
Mining association rules between sets of items in large
databases.In Proc.ACM SIGMOD Int.Conf.Mgt.Data
(pp.207-216).
[2] Berendt,B.(2002).Using site semantics to analyze,
visualize and support navigation.Data Mining and
Knowledge Discovery,6(1):37{59.
[3] Berendt,B.,Hotho,A.,& Stumme,G.(2004).Usage
mining for and on the semantic web.In H.Kargupta et
al.(Eds.),Data Mining:Next Generation Challenges
and Future Directions.Menlo Park,CA:AAAI/MIT
Press.
[4] Berendt,B.,& Spiliopoulou,M.(2000).Analysis of
navigation behaviour in web sites integrating multiple
information systems.The VLDB Journal,9(1):56{75.
[5] Borgelt,C.,& Berthold,M.R.(2002).Mining
molecular fragments:Finding relevant substructures of
molecules.In Proc.ICDM (pp.51{58).
[6] Dai,H.,& Mobasher,B.(2001).Using ontologies to
discover domain-level web usage proles.In Proceedings
of the Second Semantic Web Mining Workshop at
PKDD 2001.
[7] Deshpande,M.,Kuramochi,M.,& Karypis,G.(2002).
Automated approaches for classifying structures.In
Proc.BIOKDD'02 (pp.11{18).
[8] Eirinaki,M.,Vazirgiannis,M.,& Varlamis,I.(2003).
Sewep:Using site semantics and a taxonomy to enhance
the web personalization process.In KDD2003 (pp.
99{108).
[9] Hofer,H.,Borgelt,C.,& Berthold,M.R.(2003).Large
scale mining of molecular fragments with wildcards.In
Advances in Intelligent Data Analysis V.(pp.380{389).
LNCS.
[10] Huan,J.,Wang,W.,& Prins,J.(2003).Ecient
mining of frequent subgraphs in the presence of
isomorphisms.In Proc.ICDM (pp.549{552).
[11] Inokuchi,A.(2004).Mining generalized substructures
from a set of labeled graphs.In Proc.ICDM'04 (pp.
414{418).
[12] Inokuchi,I.,Washio,T.,& Motoda,H.(2000).An
apriori-based algorithm for mining frequent
substructures from graph data.In Proc.PKDD 2000
(pp.13{23).
[13] Inokuchi,I.,Washio,T.,Nishimura,K.,& Motoda,H.
(2002).A fast algorithm for mining frequent connected
subgraphs.Research Report,IBM Research,Tokyo.
[14] Jin,X.,Zhou,Y.,& Mobasher,B.(2004).A unied
approach to personalization based on probabilistic latent
semantic models of web usage and content.In [21].
[15] Kralisch,A.,& Berendt,B.(2004).Cultural
determinants of search behaviour on websites.In Proc.
IWIPS 2004.(pp.61-74).
[16] Kralisch,A.,Eisend,M.,& Berendt,B.(2005).
Impact of culture on website navigation behaviour.In
Proc.HCI-International 2005.
[17] Kuramochi,M.,& Karypis,G.(2001).Frequent
subgraph discovery.In Proc.ICDM (pp.313-320).
[18] McEneaney,J.E.(2001).Graphic and numerical
methods to assess navigation in hypertext.Int.J.of
Human-Computer Studies,55,761{786.
[19] Meinl,Th.,Borgelt,Ch.,& Berthold,M.R.(2004).
Mining fragments with fuzzy chains in molecular
databases.In Proc.Workshop W7 on Mining Graphs,
Trees and Sequences at ECML/PKDD 2004 (pp.49{60).
[20] Meo,R.,Lanzi,P.L.,Matera,M.,& Esposito,R.
(2004).Integrating web conceptual modeling and web
usage mining.In Proc.of the WebKDD Workshop on
Web Mining and Web Usage Analysis (pp.105{115).
[21] Mobasher,B.,Anand,S.S.,Berendt,B.,& Hotho,A.
(Eds.) (2004).Semantic Web Personalization.Papers
from the AAAI Workshop.Technical Report WS-04-09.
Menlo Park,CA:AAAI Press.
[22] Nijssen,S.& Kok,J.N.(2004).A quickstart in
frequent structure mining can make a dierence.In
Proc.SIGKDD'04 (pp.647{652).
[23] Srikant,R.& Agrawal,R.(1995).Mining generalized
association rules.In Proceedings of the 21st International
Conference on Very Large Databases,pages 407{419.
[24] Srikant,R.& Agrawal,R.(1996).Mining sequential
patterns:Generalizations and performance
improvements.In Proc.EDBT (pp.3{17).
[25] Tauscher,L.& Greenberg,S.(1997).Revisitation
patterns in World Wide Web navigation.In Proc.
CHI'97
[26] Yan,X.,& Han,J.(2002a).gSpan:Graph-based
substructure pattern mining.In Proc.ICDM (pp.
51{58).
[27] Yan,X.,& Han,J.(2002b).gSpan:Graph-based
substructure pattern mining.Technical Report
UIUCDCS-R-2002-2296,Dept.of Computer Science,
Univ.of Illinois at Urbana-Champaign.
[28] O.R.Za{ane,M.Xin,and J.Han.Discovering web
access patterns and trends by applying OLAP and data
mining technology on web logs.In Proc.ADL'98 (pp.
19{29).
[29] Zaki,M.J.(2002).Eciently mining trees in a forest.
In Proceedings of SIGKDD'02.
[30] Zaki,M.J.,Parthasarathy,S.,Ogihara,M.,& Li,W.
(1997).New algorithms for fast discovery of association
rules.In Proc.3rd Intl.Conf.KDD (pp.283{296).
AAAI Press.