The semantics of frequent subgraphs:

Mining and navigation pattern analysis

Bettina Berendt

Institute of Information Systems,Humboldt University Berlin

Spandauer Str.1

D10178 Berlin,Germany

http://www.wiwi.huberlin.de/berendt

ABSTRACT

The search for frequent subgraphs is a useful extension of

common approaches in Web mining.For example,it allows

the study of revisitation patterns in Web usage and the dis-

covery of richer navigation structures such as\landmarks"

or\hubs"that serve to organize a user's conceptual map of a

site or a part of the Web.Any use of graph structures in Web

usage mining,however,should also take into account that it

is essential to integrate background knowledge into the anal-

ysis,and that behaviour must be studied at dierent levels

of abstraction.To capture these needs,we propose to use

taxonomies in mining and to extend the standard notions of

interestingness frequency/support by the notion of context-

induced interestingness.The AP-IP mining problem then

consists of nding all frequent abstract patterns and the

individual patterns that constitute them and are therefore

interesting in this context (even though they may be infre-

quent).The paper presents the AP-IP algorithm that uses

a taxonomy to search for the abstract and individual pat-

terns,We also show that the search for label-abstracted but

isomorphic subgraphs does not always give an accurate im-

age of navigation strategies,and we develop a procedure for

mining at the concept level to solve this problem.A case

study of a real-life Web site shows the advantages of the

proposed solutions.

Categories and Subject Descriptors

H.2.8 [Database Management]:Database Applications|

data mining;H.5.4 [Information Interfaces and Presen-

tation]:Hypertext/Hypermedia|navigation,user issues

Keywords

Web mining,graph mining,integrating Web content and

semantics into Web mining

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for prot or commercial advantage and that copies

bear this notice and the full citation on the rst page.To copy otherwise,to

republish,to post on servers or to redistribute to lists,requires prior specic

permission and/or a fee.

WebKDD'05,August 21,Chicago,Illinois,USA.

Copyright 2005 ACM,ISBN:1595932143,$5.00.

1.INTRODUCTION

Background knowledge is an invaluable help in the min-

ing of Web data.Examples include the use of ontologies for

text mining and the exploitation of anchor texts for crawl-

ing,indexing,and ranking.In Web usage mining,back-

ground knowledge is needed because the raw data consist

of administration-oriented URLs,whereas site owners and

analysts are interested in events in the application domain.

In the past years,various algorithms for including back-

ground knowledge in mining have been proposed.In par-

ticular,the use of taxonomies in association rule,sequence,

and graph mining has been investigated.In Web usage min-

ing,taxonomies describe the application events by concepts

that abstract from specic URLs and that form one or more

concept hierarchies.

In the raw data,each individual set,sequence,or graph of

items is generally very infrequent.When relying on statisti-

cal measures of interestingness like support and condence,

the analyst either nds no patterns or a huge and unmanage-

able number.Taxonomies solve this problem and generally

produce a much more manageable number of patterns with

high values of the statistical measures.In addition,they al-

low the analyst to add semantic measures of interestingness

into the mining process.

A number of powerful and ecient algorithms and tools

exist that can mine patterns determined by a wide range

of measures.However,they ignore an important source of

interest:taxonomic context.What is behind a frequent ab-

stract pattern|what types of individual items/behaviour

constitute it,and how are they distributed (e.g.,equally

distributed or obeying Zipf's law)?To make patterns with

interestingness induced by the context of their abstraction

visible,a mining tool must support a\drill-down"into the

frequent patterns it has found,and it should provide simul-

taneous detail-and-context views on patterns.

This notion of detail and context has a number of ap-

plications that include a deepened understanding of data

and a highly sensitive monitoring of temporal changes in a

volatile Web.By allowing the analyst to focus on seman-

tically interesting areas (abstract patterns),it can support

the identication of patterns like an abstract-level stability

with a simultaneous individual-level drift|even if this drift

remains below the chosen statistical thresholds,or if this

drift consists of a change in the distribution of the individ-

ual patterns that constitute the abstract pattern.An exam-

ple are variations induced by seasonal in uences or changes

in a site's catalog.On the other hand,there may also be

variations on the abstract level,such as changing search be-

haviour due to the increasing Internet sophistication of a

site's users.

Such detailed analysis of patterns becomes particularly

interesting when patterns reveal a lot of structure,as in se-

quence or in graph mining.Viewing Web navigation as a

graph is interesting because it gives insight into the\men-

tal map"of a site created by its users:What pages serve

as orientation points to organize the site,which areas are

traversed in linear fashion?Do users cycle between content,

get lost in dead-end streets,or do they return to previously

visited content in order to start new and better-informed

searches from there?Last but not least,important dier-

ences between user groups (e.g.,novices and experts) can

be found by looking at their navigation graphs.User move-

ments can also be seen as votes for content relevance,thus

helping mining and ranking based on the hyperlink graph

structure (such as recent PageRank extensions).

The algorithm described in the present paper provides a

solution to these problems.It combines the need to mine

at abstract levels to nd statistically frequent patterns and

the interest in what is behind those patterns,by giving the

possibility of\drilling down"into the abstract patterns in

order to get information about their most typical individ-

ual instances.To do so,it introduces a new denition of

abstract patterns,individual patterns,and the relation be-

tween them.The algorithm is implemented in the AP-IP

tool that visualizes the results as context and detail.In ad-

dition,we argue that a conceptual view of navigation can

lead to patterns that have a dierent graph structure and are

more informative,but that cannot be found by traditional

mining in item space.We therefore extend the scheme to

mining in concept space.

The paper is organized as follows:Section 2 gives a short

overview of related work.Section 3 describes the algorithm.

Section 4 describes selected results froma case study of nav-

igation data from an information site,and it motivates the

mining in concept space that is described in more detail in

Section 5 and applied to the case study in Section 6 Sec-

tion 7 concludes the paper and gives an outlook on future

research.

2.RELATED WORK

The problem of frequent subgraph mining has been stud-

ied intensively in the past years,with many applications in

bioinformatics/chemistry,computer vision and image and

object retrieval,and Web mining.Algorithms are based on

the a priori principle stating that a graph pattern can only

be frequent in a set of transactions if all its subgraphs are

also frequent.Frequent subgraphs are generated during a

breadth-rst search following the Apriori algorithm [1] or in

a depth-rst fashion following Eclat [30].The main prob-

lem of frequent subgraph mining is that two essential steps

involve expensive isomorphism tests:First,during candi-

date generation the same graph may be generated multiple

times and has to be compared to already generated candi-

dates;second,the candidates have to be embedded in the

transactions to compute support.

Several algorithms have been developed to discover ar-

bitrary connected subgraphs.One research direction has

been to explore ways of computing canonical labels for can-

didate graphs to speed up duplicate-candidate detection [17,

26].Various schemes have been proposed to reduce the num-

ber of generated duplicates [26,10].To deal with the embed-

ding problem (NP-complete in the general form),[7] have

proposed to reduce the number of embeddings,while [5]

store them to allow a fast parallel generation of new can-

didates.

Other algorithms constrain the form of the sought sub-

graphs:[12,13] search for induced subgraphs (a pattern

can only be embedded in a transaction if its nodes are con-

nected by the same topology in the transaction as in the

pattern).[22] exploit the observation that most patterns are

rather simple,searching rst for sequences,then for trees,

and then for cyclic graphs.

Going beyond exact matching,[9] preprocess their data

to nd rings which they treat as special features and intro-

duce wildcards for node labels,and [19] nd\fuzzy chains",

sequences with wildcards.

The qualitative and quantitative analysis of single users'

navigation graphs has a long tradition (cf.the transfer of

Web graph metrics to navigation graphs [18]),but the anal-

ysis of (general) graph patterns via mining many users'navi-

gation graphs has received less attention.A number of stud-

ies have employed tree mining (e.g.,[29]) and extensions of

sequence mining [2].The latter allows one to detect patterns

of repeated visits to pages/concepts and cycling behaviour.

The former allows one to detect\hubs"of navigation in the

sense of [25]:centers of breadth-rst searches.

Taxonomies have been used to nd patterns at dierent

levels of abstraction for dierently-structured data.In [23,

24],frequent itemsets/association rules and frequent se-

quences are identied at the lowest possible levels of a tax-

onomy.Concepts are chosen dynamically,which may lead

to patterns linking concepts at dierent levels of abstrac-

tion (e.g.,\People who bought'Harry Potter 2'also bought

books by Tolkien").In [11],this idea was transferred to

frequent induced subgraph mining.

In Web usage mining,the use of concepts abstracting from

URLs is common (see [3] for an overview and the papers in

[21] for recent examples of the uses for personalization).The

general idea is to map a URL to the most signicant contents

and/or services that are delivered by this page,expressed as

concepts.One page may be mapped to one or to a set of

concepts.For example,it may be mapped to the set of

all concepts and relations that appear in its query string.

Alternatively,keywords from the page's text and from the

pages linked with it may be mapped to a domain ontology,

with a general-purpose ontology like WordNet serving as an

intermediary between the keywords found in the text and

the concepts of the ontology [8].

After the preprocessing steps in which access data are

mapped into taxonomies,subsequent mining techniques can

use these taxonomies statically or dynamically.In static

approaches,mining operates on concepts at a chosen level of

abstraction;each request is mapped to exactly one concept

or exactly one set of concepts.This approach is usually

combined with interactive control of the software,so that

the analyst can re-adjust the chosen level of abstraction after

viewing the results (e.g.,in the miner WUM;see [4] for a case

study).Alternatively,the taxonomy is used dynamically as

in [23,24,11].

The subsumption hierarchy of an existing ontology can

be employed for the simultaneous description of user inter-

ests at dierent levels of abstraction as in [20].In [6],a

scheme is presented for aggregating towards more general

Algorithm AP-IP(D;T;m;;K) (AP-IP frequent subgraph mining)1:CIP

1

;,CAP

1

;

2:for each edge e in D do

3:cip

1

feg

4:CIP

1

CIP

1

[ fcip

1

g

5:apip-gen(cip

1

)

6:FAP

1

fcap

1

2 CAP

1

j j

S

cip

1

2cap

1

:IPs

cip

1

:TIDj g

7:FIP

1

fcip

1

2 CIP

1

j 9fap

1

2 FAP

1

:cip

1

2 fap

1

:IPsg

8:k 2

9:while FAP

k1

6=;and k K do

10:CIP

k

;,CAP

k

;

11:for each fip

k1

2 FIP

k1

do

12:for each fip

1

2 FIP

1

that can be grown from fip

k1

as described in [26] do

13:cip

k

fip

k1

[ fip

1

14:if cip

k

has support > 0 & it is not automorphic to any element of CIP

k

then

15:CIP

k

CIP

k

[ fcip

k

g

16:apip-gen(cip

k

)

17:FAP

k

fcap

k

2 CAP

k

j j

S

cip

k

2cap

k

:IPs

cip

k

:TIDj g

18;FIP

k

fcip

k

2 CIP

k

j 9fap

k

2 FAP

k

:cip

k

2 fap

k

:IPsg

19:k k +1Procedure apip-gen(cip

k

) (for simplicity of notation,CIP

k

;CAP

k

,T,and m are assumed global)1:cap

k

abstraction(cip

k

;T;m)

2:for each g 2 CAP

k

s.t.g and cap

k

have the same number of nodes and edges and the same node labels and degrees do

3:if cap

k

is a known automorphism of g then

4:g:IPs g:IPs [ fcip

k

g

5:return

6:for each g 2 CAP

k

s.t.g and cap

k

have the same number of nodes and edges and the same node labels and degrees do

7:if cap

k

:cl = g:cl then

8:g:IPs g:IPs [ fcip

k

g

9:return

10:cap

k

:IPs fcip

k

g

11:CAP

k

CAP

k

[ fcap

k

gconcepts when an explicit taxonomy is missing.Clustering

of sets of sessions identies related concepts at dierent lev-

els of abstraction.Sophisticated forms of content extraction

are made possible by latent semantic models [14].In [8],

association rules,taxonomies,and document clustering are

combined to generate concept-based recommendation sets.

Some approaches use multiple taxonomies related to OLAP:

objects (in this case,requests or requested URLs) are de-

scribed along a number of dimensions,and concept hier-

archies or lattices are formulated along each dimension to

allow more abstract views [28].

3.APIPFREQUENTSUBGRAPHMINING

Based on the requirements described in the introduction,

we formulate the AP-IPmining problem:Given a dataset

of graph transactions D consisting of items with labels from

a set I,a taxonomy T consisting of concepts from a set C,

a mapping m:I 7!C,and a minimum support 2 [0;1],

nd all frequent abstract subgraphs and their corresponding

AP-frequent individual subgraphs.

Denition 1.An abstract subgraph is a connected graph

G = (V;E) with node labels given by l

a

G

:V 7!C.A

frequent abstract subgraph is one that can be embedded

in at least jDj transactions.

An AP-frequent individual subgraph of G is dened

as follows:(a) It is a graph G

0

= (V

0

;E

0

) with node labels

given by l

i

G

0

:V

0

7!I.(b) There exists a frequent abstract

subgraph G such that the graph G

00

= (V

0

;E

0

) with node

labels given by m:I 7!C with m l

i

G

0 (v

0

) = m(l

i

G

0 (v

0

)),is

an automorphism of G.G

00

is also called the abstraction

of G

0

with respect to T and m.(c) G

0

is automorphic to at

least one subgraph of at least one transaction.

Gcan be embedded in a transaction d 2 Dif it has an AP-

frequent individual subgraph G

0

such that G

0

is automorphic

to at least one subgraph of d.

Frequent subgraphs are also called\patterns",or APs (ab-

stract patterns) and IPs (individual patterns).

An example shows the role of abstract and individual pat-

terns:If 10% of the transactions contain the chain (Harry

Potter 1,Lord of the Rings),5% contain (Harry Potter 2,

The Hobbit),and 7% contain (Harry Potter 1,Lord of the

Rings,The Hobbit,Harry Potter 2) (all sets disjoint),and

the minimal support is 20%,then (Rowlings book,Tolkien

book) has support 22%and is a frequent abstract pattern.Its

AP-frequent individual patterns are (Harry Potter 1,Lord of

the Rings) (17%) and (Harry Potter 2,The Hobbit) (12%).

The algorithm to solve this problem,AP-IP,is shown in

the table above.

The following notation is used:a k-(sub)graph g

k

is a (sub)graph

with k edges.CIP

k

(CAP

k

) is a set of IP (AP) candidates with

k edges.FAP

k

(FIP

k

) is the set of frequent abstract subgraphs

(AP-frequent individual subgraphs) with k edges.cip

k

;cap

k

;fip

k

;

fap

k

are elements of these sets.K is the maximal number of

edges in patterns sought.For simplicity of notation,we assume

that K 2.g:cl is the canonical label of g.g:IPs are the AP-

frequent individual subgraphs of g.g:TID is the transaction-ID

list of g,i.e.the ordered list of transactions that contain g.

AP-IP uses and extends various ideas to achieve a fast pro-

cessing of graphs:

Search space traversal.The search for patterns pro-

ceeds breadth-rst (see AP-IP,step 9./19.).This simplies

the search for possible duplicate APs (see below).Pattern

search terminates when no more patterns exceed the sup-

port threshold,or when the maximal sought pattern size is

exceeded.

IP candidate generation In each round,individual-sub-

graph candidates are generated:all subgraphs that occur at

least once in the data.For k = 1,these are all single edges in

the dataset (AP-IP,3.,4.).For larger k,a pattern candidate

is any set of k edges that can be grown from a frequent

pattern of size k 1 by extending it by an extra edge (AP-

IP,12.,13.,15.),eliminating those candidates that have no

support in the data and those that have been discovered

before (AP-IP,14.).

To reduce the number of generated candidates,frequent

patterns are extended by one edge at a time (\growing"),

and growing occurs only from the rightmost path [26].For

a proof of completeness,see [27].

Individual-level bijective label functions.The IP can-

didates are treated as sets of edges because it is assumed

that the graph of each transaction in D has a bijective label

functions l

i

.

This assumption is motivated by the domain:Graph anal-

ysis only becomes meaningful if each node is treated as

a unique\location in hyperspace"which can be revisited.

Thus,in any transaction/session each potential URL was

either requested or not,so each transaction's nodes (edges)

are a subset of I (I I).

This assumption implies that AP-IP can be transferred to other

domains in which this unique-names assumption also holds.So

for example,AP-IP could be used for texts dealing with proper

names,but not for most molecular-mining tasks.AP-IP could

easily be extended to also discover patterns in transactions with

a non-bijective label function,but it would require the same au-

tomorphism tests for duplicate detection as those carried out for

APs,and it would change the tests for embeddings in the data.

The eects on runtime remain to be investigated.

IP candidate duplicate removal Given the bijective la-

bel function,an IP's lexicographically sorted edges labeled

by their node pair,dene a canonical form that can be com-

puted eciently (list sorting in AP-IP,13.) and allows for

ecient duplicate detection (comparison of two ordered lists,

AP-IP,14.).

IP candidate support counting and pruning (part I)

Bijective label functions also allow for ecient identication

of embeddings:a graph G

0

is a subgraph of d i the canon-

ical form of G

0

is a subset of the canonical form of d.Each

edge is initialized with a sorted list of all transaction IDs

that contain this edge (not shown in the pseudocode),see

for example [17].Each IP candidate's TID list is then the

intersection of all of its edges'TID lists.This can be com-

puted in one step by intersecting two sorted lists:the TID

lists of the rst fip

k1

and fip

1

that are found to form this

candidate (AP-IP,13.).

Support pruning at the IP level lters out all IPs that

would in the best case not contribute support to their AP

abstractions and in the worst case lead to the generation of

redundant,zero-support APs.

One structural and one semantic type of pattern were

deemed uninteresting for the domain and are ltered out to

increase eciency:self-loops and edges involving items that

are not mapped to a concept by the taxonomy (in AP-IP,

3.;not shown in the pseudocode).

APcandidate generation and duplicate removal Next,

the AP candidate corresponding to the identied individual

pattern candidate is formed.First,the individual node la-

bels are mapped to their abstractions according to the tax-

onomy T and the concepts from T that are chosen by m

(apip-gen,1.).

Usually,there is more than one IP that maps to an AP.

To simplify duplicate checking,each IP is mapped to and

each AP is stored as a sparse adjacency-list representation.

To limit the number of tests,a newly considered AP candi-

date is compared only to known candidates to which it can

be automorphic.This step relies on vertex invariants,i.e.

standard procedures fromisomorphismtesting (apip-gen,2.,

6.).First,an equality test may show that the new candidate

has already been found in the same order of node labels and

edges,i.e.that it is a known automorphism(apip-gen,3.;see

also [17]),Alternatively,the new candidate may be an as-yet

unknown automorphism.To verify this,its canonical label is

computed (apip-gen,7.).Isomorphism/automorphism test-

ing by canonical-label comparison is generally,but of course

not always,fast;see the literature in Section 2 The canoni-

cal form is determined by the maximum label sequence and

the maximum upper triangular of the adjacency matrix as

in [13].In either case,the IP is added to the set of IPs

contributing to the AP candidate (apip-gen,4./5.,8./9.).

The graph cap

k

is only registered as a new candidate if

both tests failed (apip-gen,10.,11.).

This procedure ensures that all abstract pattern candi-

dates are found:(a) all individual pattern candidates with

at least one embedding in the data are found (see above),

(b) each individual pattern has a unique abstraction,and

(c) all individual patterns that map to the same AP are col-

lected into that AP's IPs set by the automorphism tests in

apip-gen.

Support counting and pruning.The AP can be embed-

ded in a transaction if and only if at least one of its IPs can

be embedded in that transaction,which leads to the com-

putation of its support as the cardinality of the union of the

IPs'TID lists (AP-IP,6./17.).

The candidate APs with sucient support values become

frequent abstract patterns (AP-IP,6./17.).According to the

denition of AP-frequent individual patterns,only those cip

that map to an abstract pattern now identied as frequent

are transferred to the set of frequent individual patterns

(AP-IP,7./18.).

Support counting and pruning are ecient because they

operate entirely on the TID lists;no access to the data and

no subgraph embedding are required.

This completes one round;the next largest patterns can

now be examined (AP-IP,8./19.).

Memory management.All (AP-)frequent patterns are

output after they have been computed.The AP-frequent

individual patterns of size 1 must be kept in memory because

they are needed in each iteration (AP-IP,12.),and the AP-

frequent individual patterns of size k 1 are needed for

growing (AP-IP,11.).All other patterns are discarded to

save memory.

AP-IP also implements a frequency threshold ( 1) in-

stead of the support threshold mentioned above.The ex-

Figure 1:Top (a),(b):Impact of data set size and number of patterns in isolation;pS = 0.8;pL = 0.05;

minsupp =0.05;Bottom (c),(d):Combined impact of data set size and number of patterns;same parameters.

tension is straightforward because the support test operates

on transaction-ID lists anyway;it is therefore not shown.

In the current version,AP-IPtreats only undirected graphs

with node labels.The generalization to directed as well as

node-and-edge labeled graphs is straightforward.

In principle,AP-IPcould also work\backwards",by grow-

ing at the AP level and then instantiating to IPs.This would

correspond to taxonomy-free frequent subgraph mining al-

gorithms.However,this would require an investigation of at

least the same number of IPs (since all of them need to be

generated as instantiations,and possibly further ones that

also match a grown AP),and it would require either sophis-

ticated bookkeeping of the growing history or an isomor-

phism check with the AP from which an IP candidate was

generated.Simulation results indicate that this guaranteed

additional eort is not oset by any savings in candidate

generation and isomorphism testing at the AP level.

Time and space requirements.AP-IP's time and space

worst-case requirements are determined by (a) the size of

the dataset,(b) the number of patterns (abstract and in-

dividual),and the (c) diversity and (d) length of patterns.

Other factors have indirect in uences on these magnitudes;

in particular,a lower support threshold or a larger con-

cept branching factor generally lead to more,more diverse,

and/or longer patterns.However,more data also often lead

to more patterns,and more patterns are often longer pat-

terns.

Described brie y,(a) in uences time and space require-

ments linearly because AP-IP operates only on the sorted

TID lists.(b) has a linear in uence (CIP canonical form

construction:adding an edge to a sorted list;CIP duplicate

checking:lookup in a hashtable indexed by canonical form;

CIP abstraction by mapping each node).The in uence on

space is obvious because patterns are all that is stored (apart

fromTIDlists).The in uence of the number of abstract pat-

terns is governed by (c) and (d).(c) can imply additional

time during AP duplicate detection|when many\similarly-

looking"APs (apip-gen,2.) exist,and when IPs are diverse

so that many dierent automorphism occur and require a

canonical-form computation.In principle,the number and

type of candidate patterns are decisive;in line with the lit-

erature and the focus on data and patterns,we will however

treat the (AP-)frequent patterns as an approximation.(d)

has a linear eect on space requirements and on the IPs'

time requirements,but can require more than linear time

because in AP canonical form construction,longer patterns

generally require the study of more permutations.

To investigate actual performance,simulated and real data

were analyzed.For reasons of space,we focus on perfor-

mance with respect to simulated data only in this section

and on application-oriented results with respect to real data

in the next section.(Main performance results were similar.)

AP-IP was implemented in Java and run on a Pentium

4 3GHz processor,1GB main memory PC under Debian

Linux.The simulated data were generated by a rst-order

Markov model (a popular model of Web navigation) varying

the number of transactions/sessions,the branching factor b

of the concepts in a one-level taxonomy (and thus the num-

ber of patterns found in a total of 100\URLs"),the diversity

of patterns (each node had a parametrized transition prob-

ability pS to one other,randomly chosen node and an equal

distribution of transition probabilities to all other nodes),

and the length of patterns (the transition probability pL to

\exit"|the average session length becomes 1=pL,and pat-

Figure 2:Impact of pattern length:the steep curve in (b) belongs to the long patterns in (a).

tern length increases with it).

First,the in uence of data set size and number of patterns

were studied in isolation:1000 sessions were duplicated to

generate 2000 sessions with the same number of patterns,

concatenated 3 times to generate 3000 sessions,etc.Figure

1 (a) shows that time is linear in data set size,and that

the slope increases with the number of patterns (induced by

increasing the concept branching factor).Figure 1 (b) plots

the same data (plus one run with an intermediate branch-

ing factor) by the number of patterns,keeping the number

of sessions constant along each line.It shows that in this

setting,time is sublinear in the number of patterns.Figure

1 (c) show the combined eects.The thick line illustrates

the eect of adding 1000 more,2000 more,etc.dierent ses-

sions,while the thin lines show the eect of only duplicating

the data set (as in (a)).

Figure 1 (d) plots the additional time (= the vertical dif-

ference between thin-line endpoints and the thick line in (c))

against additional patterns,revealing a linear eect.

Similar graphs are obtained for a wide variety of param-

eter settings,and also for real data.However,time re-

quirements increase strongly when patterns get bigger:Fig.

2 (b) shows that the impact of additional patterns rises

more strongly for big than for small patterns (as = aver-

age size/number of edges),which leads to a slightly above-

linear increase in time with respect to data set size.Since

Web navigation data generally exhibit\short"or\medium"

patterns,this problem concerns them only rarely.

The program required substantial memory (up to 500MB

for the large patterns in Fig.2),which is very undesirable

and partly due to Java's incomplete garbage collection.We

plan a reimplementation in C++ to substantially improve

both memory consumption and the absolute level of run-

times.

4.CASE STUDY (1)

Data.Navigation data from a large and heavily frequented

eHealth information site were investigated.The site oers

dierent searching and browsing options to access a database

of diagnoses (roughly corresponding to products in an eCom-

merce product catalog).Each diagnosis has a number of dif-

ferent information modes (textual description,pictures,and

further links),and diagnoses are hyperlinked via a medical

ontology (links go to dierential diagnoses).The interface

is identical for each diagnosis,and it is identical in each

language that the site oers (at the time of data collection:

English,Spanish,Portuguese,and German).

The initial data set consisted of close to 4 million requests

collected in the site's server log between November 2001 and

November 2002.In order to investigate the impact of lan-

guage and culture on (sequential) search and navigation be-

haviour,the log was partitioned into accesses by country,

and the usual data preprocessing steps were applied [15,16].

For the present study,the log partition containing accesses

from France was chosen as the sample log;the other coun-

try logs of the same size exhibited the same patterns,dif-

fering only in the support values.The sample log consisted

of 20333 requests in 1397 sessions.The concept hierarchy

and two mappings m developed for [16] were used to ensure

comparability.

In the following,selected results from the analysis are re-

ported that outline the advantages of treating sessions and

patterns as graphs rather than as sets,bags,or sequences of

requests.The focus is on patterns with more than 2 nodes,

since the latter are by denition chains and can also be found

using sequence mining.

Basic statistics.The log is an example of concepts corre-

sponding to a high number of individual URLs (84 concepts

with,on average,82 individual URLs),and highly diverse

behaviour at the individual level.This is re ected in a high

ratio of the number of IPs to APs (compared to the sim-

ulated data above):At a support value of 0.2,there were

7 frequent APs of size 1{5 with 55603 AP-frequent IPs,at

0.1,the values were:size 1{8,34 APs,414685 IPs;at 0.05,

the values for K = 6 were:104 APs,297466 IPs.This

search was not extended because no further (semantically)

interesting patterns were found.

Linear information gathering With a minimal support

of 0.2 or higher,the only patterns with more than two nodes

were chains of diagnoses.These patterns were very frequent:

a chain of 6 diagnoses had support 22.2% (5:26.7%,4:

32.7%,3:39.3%).

Lowering the support threshold showed that even longer

chains were still comparatively frequent (13.8% support for

chains of 9 diagnoses) and reveals that alphabetical search

was the most frequent entry point to these chains (alphabet-

ical search followed by 1,2,or 3 diagnoses had 19%,15.5%,

and 11.5% support).

Investigation of the most frequent individual patterns in

alphabetical search showed a possible eect of layout:the

top-level page of diagnoses starting with\A"was the most

frequent entrypoint.While this pattern is in itself not too

Figure 3:Visualization of abstract patterns,subpatterns,superpatterns,and individual patterns.

interesting,it can help to lter out accesses of\people just

browsing"in order to focus further search for patterns on

more directed information search.

Figure 3 shows a screenshot of the AP-IP visualization.In

a preprocessing step,the patterns that only occurred once

in the data were ltered out,which explains the compara-

tively low number of individual patterns associated with an

abstract pattern (in the example:47).It turned out that

individual patterns with frequency = 1 constituted 99% of

all patterns.

A Closer Look at Linear Navigation Patterns The

nding that people just\surfed through"long chains of

diagnoses is surprising and dicult to reconcile with the

knowledge about search behaviour collected in the direct

observations of users who usually targeted a small number

of diseases for information collection.Also,the inspection

of the most frequent individual patterns involving several

diagnoses showed pairs or chains of two pictures of the same

disease.

An inspection of single sessions revealed that often,visi-

tors return to the same content in the following sense:They

request one picture,or the picture overview,of a diagnosis,

then study a dierent diagnosis (e.g.,a dierential diagno-

sis),and then return to read,in detail,the text of the rst

diagnosis.In other words,although they visit 3 dierent

URLs,forming the abstract pattern\diagnosis { diagnosis

{ diagnosis",they in fact gather information on the concepts

\diagnosis 1 { diagnosis 2 { diagnosis 1".

This non-linear behaviour cannot be identied in the cur-

rent framework.The next section will investigate the prob-

lem and propose a solution.

5.MININGIN CONCEPT SPACE

The problem of item-level patterns vs.concept-level pat-

terns can be illustrated using the following example:

Consider two transactions [a

1

a

2

a

3

a

4

] and [a

5

a

6

a

7

a

8

] and let the a

i

be instances of the following concepts:

a

1

;a

4

7!b

1

,a

2

7!b

2

,a

3

7!b

3

,a

5

;a

8

7!b

4

,a

6

7!b

5

,

a

7

7!b

6

.All b

j

are instances of concept c.Let minsupp be

1.Using the framework of [23,24] that is also the basis of

[11],the only graphs that are frequent are the chains c c,

c c c,and c c c c.This is because only graphs

that are isomorphic to subgraphs of the original transactions

can be patterns.(The problem does not occur in sequence

mining,but an analogous transfer of the algorithm in [24]

to the problem of subgraphs suggests that this isomorphism

to subgraphs of items is intended.In [11],the problem is

not dicussed,but the article suggests that only isomorphic

graphs can be concept-hierarchy abstractions of parts of the

transactions.)

However,both rst-level abstractions b

1

b

2

b

3

b

1

and

b

4

b

5

b

6

b

4

form a ring of three distinct sub-concepts

of c.Alternatively,they form rings of super-concepts of the

as involved.

In analogy to the use of the term\item"in [23],we call the

standard form of mining with taxonomies\mining in item

space".We will refer to the alternative form as\mining in

concept space".

To be able to use AP-IP,we translate the problem\nd-

ing all frequent concept-space abstract subgraphs in the set

of transactions D"into the equivalent problem\nding all

frequent item-space abstract subgraphs in the set of trans-

actions D

0

",with D

0

given by

Figure 4:Nonlinear patterns found when mining in concept space.

Denition 2.The concept space representation D

0

of

a set of transactions Dis dened as follows:Let r:I 7!C[I

be a function s.t.for each i,either r(i) is a descendant of

m(i) in the taxonomy T as described in Denition 1 (this

holds for at least one i),r(i) = m(i),or r(i) = i.Let D

0

be a

set of transactions obtained from D by,for each transaction

X,(1) replacing the label function l

i

X

by r l

i

X

and (2)

coarsening the graph X such that all vertices with the same

label are merged and the label function is again bijective.

During mining,the function m in Denition 1 must be

replaced by a function m

0

s.t.m

0

(r(i)) = m(i).Dierent

levels of abstraction for mining in concept space can be de-

ned by varying the number of nodes on the paths between

r(i) and m(i).In the following example,we keep both func-

tions xed,with a path distance of 1 for concepts of interest.

When analyzing Web server log data,the transformation

to concept space representation is trivial;it requires only

one pass through the data to apply the mapping r,before

the subsequent input routine transforms the sequential log

format into graphs as in the item-space problem.

6.CASE STUDY (2)

The log was analyzed again in concept space.The func-

tion r transformed all information types on a diagnosis X

into\information on X",and mapped the search functions

to their abstract values (\LOKAL1"= rst step of localiza-

tion search,etc.).Remaining URLs were left unchanged.

Information gathering in concept space The analysis

showed that most of the chains of diagnoses were in fact nav-

igation structures which involved revisits to\key"diagnoses

that served as\hubs"for navigation.

Chains of diagnoses lost support:a chain of 6 diagnoses

had support 7.2%(5:9.2%,4:13%,3:18.9%).Instead,pat-

terns like diagnoses with 3 or more other diagnoses branch-

ing o the\hub"diagnosis shown at the left of Fig.4 (sup-

port 5.3%) became frequent.Rings also occurred at slightly

lower support thresholds (see Fig.4,second from left).

Search options:linear vs.hub-and-spoke Search in

concept space also showed the functioning of search options

more clearly.

First,the use of the search engine appeared only in very

few patterns (only in 2-node patterns:4.2% for search-

engine and a diagnosis,3.5% for search-engine and alpha-

betical search|probably a subsequent switch to the second,

more popular search option).This was because the search

engine was less popular than the other search options (used

about 1/10 as often),but far more ecient in the sense that

searches generally ended after the rst diagnosis found (as-

suming that nding a diagnosis was the goal of all search

sessions).This is consistent with results from our other

studies of search behaviour [2].

The alphabetical search option generally prompted a\hub-

and-spoke navigation",as shown in the third example from

the left of Fig.4 (support 6.4%).In contrast,localization

search generally proceeded in a linear or depth-rst fash-

ion,as shown on the right of Fig.4 (support 5%;with one

diagnosis less:6.9%).

This may be interpreted as follows:Localization search

prompts the user to specify,on a clickable map,the body

parts that contain the sought disease.This is in itself a

search that can be rened (LOKAL1 { LOKAL2 in the g-

ure;a similar pattern of LOKAL1 { LOKAL2 { LOKAL3,

followed by 2 diagnoses,had a support of 5.1%).This

narrowing-down of the medical problem by an aspect of its

surface symptoms (localization on the body) helps the user

to identify one approximately correct diagnosis and to nd

the correct one,or further ones,by retaining the focus on

symptoms and nding further diagnoses by following the

dierential-diagnosis links.This means that,in particular,

non-expert users can focus on surface features that have

meaning in the domain (and acquire some medical knowl-

edge in the process).

Alphabetical search,on the other hand,leads to lists of

diseases that are not narrowed down by domain constraints,

but only by their name starting with the same letter.Nav-

igation choices may be wrong due to a mistaken memory of

the disease's name.This requires backtracking to the list-of-

diseases page and the choice of a similarly-named diagnosis.

This interpretation is supported by ndings from a study

of navigation in the same site in which participants specied

whether they were physicians or patients.Content search

was preferred by patients most often,whereas physicians

used alphabetical search or the search engine more often.

Investigation of the most frequent individual patterns in

localization-based search showed that the combination of the

human head as the location of the illness and a diagnosis

which is visually prominently placed on the result page was

most frequent.This possible eect of layout on navigation is

one example of patterns that can be detected by AP-IP but

would have gone unnoticed otherwise because the individual

pattern itself was below the statistical threshold.

Note that the temporal interpretation,as shown in the

top-to-bottom ordering in Fig.4,is justied by site topol-

ogy (one cannot go from a diagnosis to coarser and coarser

search options,unless by backtracking).The temporal inter-

pretation of\backtracking and going to another diagnosis"

is likewise justied by site topology,but the left-to-right or-

dering of diagnoses in the gure is of course arbitrary.

Note also that there may be an arbitrary number of steps

between the transition from the ALPH node in Fig.4 to

the rst diagnosis and the transition fromthe ALPHnode to

the second (or third) diagnosis.This intentional abstraction

fromtime serves to better underline the\hub"nature of the

ALPH page as a conceptual centre of navigation.

7.CONCLUSIONS AND OUTLOOK

We have introduced a new mining problem,the AP-IP

mining problem,and the AP-IP algorithm that solves it.

AP-IP uses a taxonomy and searches for frequent patterns

at an abstract level,but also returns the individual sub-

graphs constituting them.This is motivated by the context-

induced interestingness of these individual subgraphs:While

they generally occur far more seldom than a chosen support

threshold,they are interesting as instantiations of the fre-

quent abstract patterns.AP-IP can mine subgraphs at the

item level and at the concept level;the latter is often more

adequate to nd patterns of semantic recurrence.

One important open question is how to extend the xed

URL-concept mapping to the exibility provided by the

taxonomy-including algorithms of [23,24,11] without an

explosion in the search space and unclear semantics for the

mining in concept space.The experience with our data sug-

gests that the standard mechanisms would not be sucient

to solve this problem (in fact,the problem of\overgener-

alized"patterns [11] never occurred in our data).In fact,

relying on the common statistical measures of interesting-

ness eliminates nearly all AP-frequent individual patterns

and therefore limits the expressive power of AP-IP signi-

cantly.On the other hand,many of these IPs are in fact

very rare,and their high number is a major limitation for

the present algorithm.Adequate measures of pattern in-

terestingness need to be dened in order to address this

problem.We expect that depending on domain and analy-

sis question,these methods may involve limiting the class of

patterns sought,more user interaction for the determination

of interesting abstract and individual patterns,or heuristics

such as parallel but dierent support thresholds.

Another research direction concerns the structure of the

patterns found.Frequent subgraph mining becomes worth-

while if rings and other cyclic patterns are frequent in the

data.If most frequent patterns are chains or trees (as was

the case in the present data),sequence and tree mining may

be a more ecient choice for analysis.This is exploited

in [22],which searches for patterns of increasing complex-

ity and can signicantly speed up mining.Corresponding

extensions of AP-IP will be investigated in future work.

An inspection of single sessions in item and concept space

suggests yet a dierent approach.Rings of diagnoses in

our log often contained\other"contents such as naviga-

tion/search pages in between.Thus,wildcard options as in

[9] would be very useful for identifying patterns.

Last but not least,interesting questions arise with the

increasing dynamics of the Web:when content changes,it

suces to extend the URL-concept mapping,but what is

to be done when semantics evolve?One approach would be

to use ontology mapping techniques to make graph patterns

comparable and to store more and more abstracted repre-

sentation of patterns as they move further into the past.

8.REFERENCES

[1] Agrawal,R.,Imielinski,T.,& Swami,A.N.(1993).

Mining association rules between sets of items in large

databases.In Proc.ACM SIGMOD Int.Conf.Mgt.Data

(pp.207-216).

[2] Berendt,B.(2002).Using site semantics to analyze,

visualize and support navigation.Data Mining and

Knowledge Discovery,6(1):37{59.

[3] Berendt,B.,Hotho,A.,& Stumme,G.(2004).Usage

mining for and on the semantic web.In H.Kargupta et

al.(Eds.),Data Mining:Next Generation Challenges

and Future Directions.Menlo Park,CA:AAAI/MIT

Press.

[4] Berendt,B.,& Spiliopoulou,M.(2000).Analysis of

navigation behaviour in web sites integrating multiple

information systems.The VLDB Journal,9(1):56{75.

[5] Borgelt,C.,& Berthold,M.R.(2002).Mining

molecular fragments:Finding relevant substructures of

molecules.In Proc.ICDM (pp.51{58).

[6] Dai,H.,& Mobasher,B.(2001).Using ontologies to

discover domain-level web usage proles.In Proceedings

of the Second Semantic Web Mining Workshop at

PKDD 2001.

[7] Deshpande,M.,Kuramochi,M.,& Karypis,G.(2002).

Automated approaches for classifying structures.In

Proc.BIOKDD'02 (pp.11{18).

[8] Eirinaki,M.,Vazirgiannis,M.,& Varlamis,I.(2003).

Sewep:Using site semantics and a taxonomy to enhance

the web personalization process.In KDD2003 (pp.

99{108).

[9] Hofer,H.,Borgelt,C.,& Berthold,M.R.(2003).Large

scale mining of molecular fragments with wildcards.In

Advances in Intelligent Data Analysis V.(pp.380{389).

LNCS.

[10] Huan,J.,Wang,W.,& Prins,J.(2003).Ecient

mining of frequent subgraphs in the presence of

isomorphisms.In Proc.ICDM (pp.549{552).

[11] Inokuchi,A.(2004).Mining generalized substructures

from a set of labeled graphs.In Proc.ICDM'04 (pp.

414{418).

[12] Inokuchi,I.,Washio,T.,& Motoda,H.(2000).An

apriori-based algorithm for mining frequent

substructures from graph data.In Proc.PKDD 2000

(pp.13{23).

[13] Inokuchi,I.,Washio,T.,Nishimura,K.,& Motoda,H.

(2002).A fast algorithm for mining frequent connected

subgraphs.Research Report,IBM Research,Tokyo.

[14] Jin,X.,Zhou,Y.,& Mobasher,B.(2004).A unied

approach to personalization based on probabilistic latent

semantic models of web usage and content.In [21].

[15] Kralisch,A.,& Berendt,B.(2004).Cultural

determinants of search behaviour on websites.In Proc.

IWIPS 2004.(pp.61-74).

[16] Kralisch,A.,Eisend,M.,& Berendt,B.(2005).

Impact of culture on website navigation behaviour.In

Proc.HCI-International 2005.

[17] Kuramochi,M.,& Karypis,G.(2001).Frequent

subgraph discovery.In Proc.ICDM (pp.313-320).

[18] McEneaney,J.E.(2001).Graphic and numerical

methods to assess navigation in hypertext.Int.J.of

Human-Computer Studies,55,761{786.

[19] Meinl,Th.,Borgelt,Ch.,& Berthold,M.R.(2004).

Mining fragments with fuzzy chains in molecular

databases.In Proc.Workshop W7 on Mining Graphs,

Trees and Sequences at ECML/PKDD 2004 (pp.49{60).

[20] Meo,R.,Lanzi,P.L.,Matera,M.,& Esposito,R.

(2004).Integrating web conceptual modeling and web

usage mining.In Proc.of the WebKDD Workshop on

Web Mining and Web Usage Analysis (pp.105{115).

[21] Mobasher,B.,Anand,S.S.,Berendt,B.,& Hotho,A.

(Eds.) (2004).Semantic Web Personalization.Papers

from the AAAI Workshop.Technical Report WS-04-09.

Menlo Park,CA:AAAI Press.

[22] Nijssen,S.& Kok,J.N.(2004).A quickstart in

frequent structure mining can make a dierence.In

Proc.SIGKDD'04 (pp.647{652).

[23] Srikant,R.& Agrawal,R.(1995).Mining generalized

association rules.In Proceedings of the 21st International

Conference on Very Large Databases,pages 407{419.

[24] Srikant,R.& Agrawal,R.(1996).Mining sequential

patterns:Generalizations and performance

improvements.In Proc.EDBT (pp.3{17).

[25] Tauscher,L.& Greenberg,S.(1997).Revisitation

patterns in World Wide Web navigation.In Proc.

CHI'97

[26] Yan,X.,& Han,J.(2002a).gSpan:Graph-based

substructure pattern mining.In Proc.ICDM (pp.

51{58).

[27] Yan,X.,& Han,J.(2002b).gSpan:Graph-based

substructure pattern mining.Technical Report

UIUCDCS-R-2002-2296,Dept.of Computer Science,

Univ.of Illinois at Urbana-Champaign.

[28] O.R.Za{ane,M.Xin,and J.Han.Discovering web

access patterns and trends by applying OLAP and data

mining technology on web logs.In Proc.ADL'98 (pp.

19{29).

[29] Zaki,M.J.(2002).Eciently mining trees in a forest.

In Proceedings of SIGKDD'02.

[30] Zaki,M.J.,Parthasarathy,S.,Ogihara,M.,& Li,W.

(1997).New algorithms for fast discovery of association

rules.In Proc.3rd Intl.Conf.KDD (pp.283{296).

AAAI Press.

## Σχόλια 0

Συνδεθείτε για να κοινοποιήσετε σχόλιο