Web Personalization Integrating Content Semantics and Navigational Patterns

blaredsnottyAI and Robotics

Nov 15, 2013 (3 years and 8 months ago)

90 views

Web Personalization Integrating Content Semantics
and Navigational Patterns
Magdalini Eirinaki, Charalampos Lampos, Stratos Paulakis, Michalis Vazirgiannis
Athens University of Economics and Business
Department of Informatics
Patision 76, Athens, 10434, GREECE
(30210) 8203513
{eirinaki, lampos, paulakis, mvazirg}@aueb.gr

ABSTRACT

The amounts of information residing on web sites make users’
navigation a hard task. To address this problem, web sites provide
recommendations to the end users, based on similar users’
navigational patterns mined from past visits. In this paper we
introduce a recommendation method, which integrates usage data
recorded in web logs, and the conceptual relationships between
web documents. In the proposed framework, the usage-oriented
URI representation of web pages and users’ behavior is
augmented with content-based semantics expressed using domain-
ontology terms. Since the number of multilingual web sites is
constantly increasing, we also propose an automatic method for
uniformly characterizing a web site’s documents using a common
vocabulary. Both methods are integrated in the semantic web
personalization system SEWeP.
Categories
H.2.8 [Database Management]: Database Applications - Data
Mining; H.3.5 [Information Storage and Retrieval]: Information
Services - Web-based services
General Terms
Algorithms, Experimentation
Keywords

Semantic Web Personalization, Semantic Web Mining, Web
Content Semantics, Concept Hierarchies
1. INTRODUCTION
Web personalization is defined as any action that adapts the
information or services provided by a web site to the needs of a
user. This is performed by exploiting the knowledge gained from
the users’ navigational behavior and individual interests, as it is
recorded in the web usage logs, in combination with the content
and/or the structure of the web site. A personalized web site may
include new index pages, provide personalized web search results,
or dynamically create recommendations (i.e. page links) to the
users. In this work, we focus on the latter case, that of producing a
set of recommendations that may be of interest to each individual
user (or group of users) in order to assist her navigation.
Several web usage mining techniques have been proposed in the
past to address this issue, by exploiting the usage data residing on
the web logs. As noted in [10], however, usage-based
personalization alone can be problematic when, for example, there
is not enough usage data in order to extract patterns related to
certain categories, or when the site content changes and new
pages are added but are not yet included in the web logs.
Consider, for example, that we want to personalize a web portal
called “Sportal” that specializes on sports activities (we use the
imaginary site “www.sportal.com”.) Let us assume that after
applying association rules mining to the web logs of “Sportal”, we
have discovered the following rule:
www.sportal.com/events/ski.html,
www.sportal.com/travel/ski_resorts.html 
www.sportal.com/equipment/ski_boots.html.
Therefore, following the “traditional” usage-based personalization
process, the next time a user navigates through “Sportal” and
visits the first two URIs, the personalized site will dynamically
recommend to him the right hand side (RHS) of the rule, based on
the previous users’ navigational behavior. Assume now that
Sportal’s content is updated and that a new offer for ski
equipment appears in the web page
/equipment/ski_boot_offers.html. Moreover, assume that Sportal
also hosts a service /weather/snowreport.html, about the snow
conditions in several ski resorts. Even though both pages are very
similar to the user’s interests, and what is currently recommended,
they are not included in the derived association rules. This might
happen since the first one is new and not included in high ratio in
the web logs, and the second one is not frequently visited by the
users since no links exist from ski pages to this site. Finally,
consider the case when the user visits the web pages
www.sportal.com/sports/winter_sports/ski.html,
www.sportal.com/travel/winter/hotels.html.
This visit is semantically similar to the one that is contained in the
rule; the system however, will not provide the same
recommendations to the user, since it will not identify this
similarity.
In this paper we address these issues by proposing a
recommendation method that integrates content semantics and
navigational data. In short, our technique uses the terms of a

Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
WIDM’04, November 12-13, 2004, Washington, DC, USA.
Copyright 2004 ACM 1-58113-978-0/04/0011…$5.00.
domain-ontology
1
in order to uniformly characterize both the
content and the users’ navigational patterns, and thus produce a
set of recommendations that are semantically related to the user’s
current visit. Recent studies [3,5,12] have proposed similar
techniques for the same problem, but our work addresses several
of the key issues that arise in practice, such as automatically
characterizing web pages with ontology objects, computing
similarity between domain objects, and “matching” user profiles
during the recommendation phase. Moreover, to the best of our
knowledge, our study is the first to deal with the important
problem of multilingualism, which arises when the content of a
web site appears in more than one language. The methods
proposed in this paper are integrated in the web personalization
system SEWeP [5]. More specifically, our key contributions in
this paper are:
• A novel recommendation method which integrates web
content semantics with users’ navigational behavior. The web
pages are first processed to extract keywords, which are in turn
converted to a set of domain-ontology terms (categories)
through a mapping mechanism. This uniform characterization
enables the categorization of the web pages into semantically
coherent clusters, as well as the semantic enhancement of the
web logs. These two enhanced sources of knowledge are then
used by the proposed method to produce recommendations to
the end user that are semantically relevant to his current
navigational behavior.
• A method for processing multilingual content. All web
documents, without regard to the language they are written in,
should be characterized by a set of terms belonging to a domain-
ontology. Therefore, an intermediate step is needed, where all
documents’ keywords are mapped to a common language. Such
an approach, however, is not straightforward, since the words
should be translated depending on the document’s content. We
propose an automated keyword translation method based on the
document’s context.
• A set of preliminary user-based experiments (blind tests)
which demonstrate the effectiveness of the proposed methods,
when integrated in a web personalization framework.
The remainder of the paper is organized as follows: Section 2
briefly presents SEWeP’s architecture, whereas Section 3 and 4
describe in more detail the translation and recommendation
methods respectively. In Section 5 we present a set of experiments
that demonstrate the effectiveness of the proposed personalization
framework. Section 6 reviews related work. Finally, in Section 7
we conclude and present our plans for future work.
2. THE SEWeP PERSONALIZATION
SYSTEM
As mentioned earlier, the automated translation algorithm and our
novel recommendation method have been implemented as part of
the SEWeP [5] personalization project. In order to provide some
context for the presentation of our proposed methods, this Section
briefly describes the key components of the SEWeP architecture
and provides some basic definitions that will be used later. We


1
We use the term domain-ontology to refer to a domain-specific
concept hierarchy (i.e. taxonomy). We will use these terms
interchangeably.
should stress, however, that our techniques are generic and can be
applied in other system architectures that follow a similar
approach to the web personalization process.
SEWeP is a web personalization system based on the semantic
characterization of the web content, and the enhancement of the
web usage logs with these content semantics. The functional
modules of SEWeP, depicted in Figure 1, are briefly described in
the subsequent paragraphs. Due to space restrictions, more details
can be found in [5] and the full version of this paper [6].

Figure 1. SEWeP functional architecture
Content characterization is performed using terms that belong to
a concept hierarchy (taxonomy). This process limits the
vocabulary used to a few words that are part of a hierarchical
structure. Such a representation enables the employment of
similarity measures that incorporate the hierarchical relationships
between the terms. The first step in the content characterization
process involves the keyword extraction from every document d
in the collection. A set of IR techniques are employed, such as
extracting keywords from an anchor-window of links to d
(inlinks), or documents that are pointed by d (outlinks), in
addition to raw term frequency of d itself, based on the
assumption that such documents are descriptive of d’s content and
more objective. Prior to selecting the most frequent ones, all non-
English keywords are translated (we refer to this process in
Section 3). At the end of this phase, each document d is
characterized by a weighted set of keywords
2
, where the weight
represents their frequency.
In order to assist the remainder of the personalization process (C-
logs creation, semantic document clustering, semantic
recommendations) the n most frequent keywords that where
extracted in the previous phase, are mapped to the categories T =
{c
1,
…, c
k
.} of the taxonomy. The taxonomy is given as input to
the system and may be created by a domain-expert, or be a
publicly available concept hierarchy, such as the one provided by
the Open Directory Project
3
(the process of defining the taxonomy
is orthogonal to our proposed methods and is outside the scope of
this paper). The mapping is performed using Wordnet [16] as a
thesaurus. The system finds the “closest” (i.e. most similar)
taxonomy term (category) to the keyword through the
mechanisms provided by the thesaurus and the use of a similarity
measure. Since the keywords carry weights according to their
frequency, the categories are also updated with weights, based on
the following formula:





=
ij
ij
ck
j
ck
jj
i
w
sw
r
)(

where w
j
is the weight assigned to keyword k
j
for document D and
s
j
the similarity with which k
j
is mapped to c
i.
. As shown in
previous studies [11], this metric is effective in capturing the
relevance of a category in a given document. At the end of this
process, each document d is represented as a weighted set of the
form D = {(c
i
, r
i
)}, r
i
∈[0,1]. The complete details of the content
characterization process can be found in [6].
After the content characterization process, all web documents are
semantically annotated with domain-ontology terms. These
content-bearing semantics are then used to categorize them into
semantic document clusters, by grouping together documents that
are characterized by semantically “close” terms, i.e. neighboring
categories in the taxonomy. The generated clusters capture
semantic relationships that may not be obvious at first sight, for
example documents that are not “structurally” close (i.e. under the
same path). After the document clustering, each cluster is labeled
by the categories of the documents it contains. Depending on the
application, the cluster might be labeled by all, or the highest
ranked categories that characterize it.
Additionally, each record in the web log is augmented with the
categories of the corresponding URI, thus leading to the creation
of C-logs (concept-logs). Data mining algorithms are then applied
to this enriched version of web logs, resulting in a set of
recommendations that include thematic categories, in addition to
recommendations including URIs. The recommendation set is
then expanded to contain similar documents, i.e. documents that
belong to the same semantic cluster. The recommendation process
is thoroughly presented in Section 5.
As already mentioned, SEWeP exploits the expressive power of
content semantics, them being represented by ontology terms.
Using such a representation, the similarity between documents is


2
A stopwords’ list and Porter stemming [13] are used to remove common
words and term suffixes respectively.
3

http://www.dmoz.org
reduced to distance between terms that are part of a hierarchy. The
need for such a similarity measure is encountered throughout the
personalization process, namely content characterization, keyword
translation, document clustering, and recommendations’ creation.
In our approach, based on the findings of [11], we adopt the
Wu&Palmer similarity measure [17] for calculating the distance
between terms that belong to a tree. Intuitively, the Wu&Palmer
distance is proportional to the shortest path that connects two
terms of the ontology.
3. KEYWORD TRANSLATION
As already mentioned, the recommendation process is based on
the characterization of all web documents using a common
representation. Since many web sites contain content written in
more than one language, this raises the issue of mapping
keywords from different languages to the terms of a common
domain-ontology.
Consider, for example, the web site of a Computer Science
department, or of a research group in Greece. This site will
contain information addressed to the students, which will be
written in Greek, research papers, which will be written in
English, and course material, which will be written in both
languages. Since the outcome of the keyword extraction process is
a mixed set of English and Greek words, the translation of all
Greek keywords to English should be performed, prior to
selecting the most frequent ones. By using any dictionary, each
Greek word (after stemming and transforming to the nominative)
will be mapped to a set of English synonyms; the most
appropriate synonym, however, will depend heavily on the
context of the web page’s content. A naive approach would be to
keep all possible translations, or a subset, but this would result in
a high number of keywords and would lead to inaccurate results.
Another approach would be to keep the “first” translation given
by the dictionary, which is the most common one. The “first”
translation, however, is not always the best. For example the
words “plan”, “schedule” and “program” are some of the
translations of the same Greek word (“πρόγραµµα”), however in
the CS context the word “program” is the one that should be
selected.
To address this important issue, we propose to determine the most
precise synonym based on the remaining keywords of the same
web page. The key intuition is that the set of keywords will be
descriptive of the web page’s content, and thus we will arrive to a
good set of synonyms by comparing their semantics. We note that
this context-sensitive automatic translation method could be
applied to any language, provided that a dictionary and its
inflection rules are available. In our system implementation we
applied it for the Greek language.
The translation method is depicted in Figure 2. The input is the set
of English and Greek keywords (En(D) and Gr(D) respectively) of
document D. The output is a set of English keywords K that
“best” characterize the web page. Let En(g) = {english
translations of g, g ∈ Gr(D)} and Sn(g) = {Wordnet senses of
keywords in En(g)}. For every translated word’s sense (as defined
by Wordnet), the algorithm computes the sum of the maximum
similarity between this sense and the senses of the remaining
keywords (let WPsim denote the Wu&Palmer distance between
two senses
4
). Finally, it selects the English translation that has the
sense with maximum score. Empirical results have shown that this
algorithm works well in practice and assigns the most relevant
synonym to each Greek word [8]. Again, the complete details can
be found in the full version of this paper [6].
Figure 2. The translation procedure
Our experiments have shown promising results for the proposed
approach, but several issues remain open. For instance, our
technique makes an implicit assumption of “one sense per
discourse”, i.e., that multiple appearances of the same word will
have the same meaning within a document. This assumption
might not hold in several cases, thus leading to erroneous
translations. Our technique constitutes a first step toward the
automated mapping of keywords to the terms of a common
taxonomy; clearly, more research is required in order to provide a
complete solution.
4. RECOMMENDATIONS FRAMEWORK
We now turn our attention to the problem of producing
recommendations for a user that navigates a web site. Overall, our
approach is to mine the enhanced web logs (C-logs) and compute
category-based rules and itemsets that capture common
navigational patterns (again, in terms of abstract categories and
not specific URIs). The user’s current navigational behavior is
matched against the mined rules, and a set of semantically related
URIs is recommended. In what follows, we first illustrate the
advantages of our approach with a hypothetical example, and then
discuss in more detail the proposed recommendation method.
4.1 Category-based Recommendations
Assume that a user navigates through a web portal, which serves
as a city guide (containing information on cinemas, movies,
concerts, etc.). Suppose that the user has already visited pages P =
{amusicnewspage.htm, asingerpage.htm, acinemapage.htm}
5
.
The “amusicnewspage.htm” page is characterized by the
taxonomy categories {music, pop}, the “asingerpage.htm” by the
categories {singer, music} and the “acinemapage.htm” is
characterized by the categories {movie, director, oscar}. We


4

The Wu&Palmer metric is included in the Appendix

5
We assume that all URIs fall under the same domain name, say for
example www.fooportal.com.

consider three types of recommendations, namely, original,
semantic, and category-based. The system may use
recommendations from one of the aforementioned types, or a
combination of all three of them:
1. Original Recommendations. For each incoming user, a sliding
window of her past n visits is matched to the URI-based
association rules in the database, and the m most similar ones are
selected. The system recommends the URIs included in the rules,
but not visited by the user so far. Returning to our example,
assume that the 3 most similar association rules to P are: i)
amusicnewspage.htm & acinemapage.htm & asingerpage.htm →
aconcertspage.htm, ii) amusicnewspage.htm & a singerpage.htm
& acinemapage.htm → anothermusicpage.htm, and iii)
amusicnewspage.htm & acinemapage.htm →
athirdmusicpage.htm. These association rules are the top-ranked
ones probably because these are permanent pages in the portal
(i.e. do not change their content frequently) and thus many users
have visited them in the past. The system recommends the RHS of
these rules, namely, {aconcertspage.htm, anothermusicpage.htm,
athirdmusicpage.htm}.
2. Semantic Recommendations. For each incoming user, a sliding
window of her past n visits is matched to the URI-based
association rules in the database, and the single most similar one
is selected. The system finds the URIs included in the rule but not
yet visited by the user (let A) and recommends the m most similar
documents that are in the same semantic cluster as A. In our
example, based on P the most similar association rule is retrieved:
amusicnewspage.htm & acinemapage.htm & asingerpage.htm →
aconcertspage.htm. The URI not yet visited by the user is
“aconcertspage.htm”. This URI is under the same semantic
cluster (based on the categories that characterize it) with
“anewsoundtrackpage.htm” and “atourannouncementpage.htm”.
These pages are in the same context with the ones contained in the
rule, however may not exist in the rules database because they
were recently added in the portal’s content. The system
recommends both sets: {aconcertspage.htm,
anewsoundtrackpage.htm, atourannouncementpage.htm}.
3. Category-based Recommendations. For each incoming user, a
sliding window of her past n visits is matched to the category-
based association rules in the database, and the most similar is
selected. The system finds the most relevant document cluster
(using similarity between category terms) and recommends the
documents that are not yet visited by the user. Returning to our
example, the user’s navigational behavior is now expressed using
categories: music, singer, movie, oscar. Assume that the most
similar category-based rule is {music, cinema → singer,
soundtrack}. Based on this, the system will recommend URIs that
are contained in the cluster that is most relevant to {singer,
soundtrack}. The most similar semantic cluster is the one labeled
{singer, movie, cd}. The system either provides the categories
linked with relevant pages or the pages under this cluster:
{thesingernewcdpage.htm, asoundtrackpage.htm,
asingerpage.htm, etc.}.
The first option describes the “straightforward” way of producing
recommendations, simply relying in the usage data of a web site.
The second option is the one that we explored in our previous
work ([5]). More specifically, we proposed extracting semantic
recommendations (2nd option) in order to further enhance the
web personalization process. Those recommendations are in the
Procedure translateW(Gr,En)
1. K ← Ø ;
2. for all g ∈ Gr(D) do
3. for all s ∈ Sn(g) do
4. score[s] = 0;
5. for all w ∈ En(D)U Gr(D)-{g} do
6. sim = max(WPsim(s, Sn(w)));
7. score[s] += sim;
8. done
9. done
10. s
max
= s;
(score[s] = max(score[s]), s ∈ Sn(g))
11. K ← e, e ∈ En(g), e contains s
max
;
12.
done
same format as the original ones but incorporate knowledge
derived from the content of the web site. In this work, we examine
the third approach, that of extracting category-based
recommendations. The intuition behind category-based
recommendations is the same as with semantic recommendations,
namely, incorporating content and usage data in the
recommendation process. The proposed method, however, takes
this notion one step forward: users’ navigational behavior is now
expressed using a more abstract, yet semantically meaningful way.
The category-based approach offers several advantages compared
to previously proposed methods. Pattern matching to the current
user’s navigational behavior is no longer exact since it utilizes the
semantic relationships between the categories, as expressed by
their topology in the domain-specific taxonomy. In most
approaches (including collaborative filtering systems as well), the
current user’s behavior is usually expressed by a vector including
the URIs (or web objects) she has previously visited. This
“profile” vector is compared to the ones already stored in the
system’s knowledge base. The distance measures used in such
cases use exact matching techniques for calculating vector
similarity. Thus, a vector is “close” to another, if both have many
equal items. If the vectors have, however, similar but not
identical items, then the similarity between them is small. In our
approach, the current user’s profile is expressed by categories.
The same happens for the navigational patterns’ knowledge base.
Since these categories are terms belonging to the same taxonomy,
the distance between two vectors can be calculated using an
appropriate similarity measure, which also takes into account the
hierarchical structure and relations of the terms, thus producing a
more reliable result (for example, a page characterized by the
words “movie, music” is similar to one characterized by the
words “cinema, singer”).
An additional advantage of our proposed method is that the final
set of recommendations can be organized using meaningful
semantic descriptions of the content. The final set of
recommendations might be either the categories (with links to the
related documents), or the (not yet visited) documents in the
document clusters labeled by these categories. The category-based
recommendation engine is described in more detail in the
following subsection.
4.2 Recommendation Engine
As already mentioned, after the document characterization and
clustering processes have been completed, each document d is
represented by a set of weighted terms (categories) that are part of
the taxonomy: d = {(c
i
, r
i
)}, c
i
∈ T, r
i
∈ [0,1] (T is the taxonomy,
r
i
is c
i
’s weight). We now describe how this information is used in
combination with the information residing on C-Logs, in order to
produce a set of category-based association rules and itemsets
which are subsequently matched to each user’s navigational
pattern, expressed in ontology terms.
Navigational patterns. We use an adaptation of the apriori
algorithm [1] to discover frequent itemsets and/or association
rules including categories. We consider that each distinct user
session represents a different transaction. Instead of taking as
input the distinct URIs visited (as items of the transaction), we
replace them with the respective categories. We keep the most
important ones, based on their frequency (since the same category
may characterize more than one documents). We then apply the
apriori algorithm using categories as items. We will use S = { I
k
},
to denote the final set of frequent itemsets/association rules, where
I
k
= {(c
i,
, w
i
)}, c
i
∈ T, w
i
∈ [0,1] (w
i
reflects the frequency of c
i
).
Recommendations. The recommendation method takes as input
the user’s current visit, expressed in weighted category terms: CV
= {(c
j
, f
j
)}, c
j
∈ T, f
j
∈ [0,1] (f
j
is frequency of c
j
in current visit -
normalized). The method finds the itemset in S that is most similar
to CV, creates a generalization of it and recommends the
documents (labeled by related categories) belonging to the most
similar document cluster Cl
n
∈ Cl (Cl is the set of document
clusters). To find the similarity between categories we use the
Wu&Palmer metric [17], whereas in order to find similarity
between sets of categories, we use the THESUS metric [15]
(denoted as WPsim and THESIM respectively
6
). This procedure is
shown in Figure 3.
Figure 3. The recommendation method
The same procedure can be run by omitting the weights in one or
all the phases of the algorithm. On the other hand, in case weights
are used, an extension of the apriori algorithm, which incorporates
weights in the association rules mining process, such as [13], can
be used. Let us also stress that even though this description of the
method focuses on sets’ representation (derive frequent itemsets
and use them in the recommendation method), it also works (with
no further modification) using the association rules that can be
derived by those sets. If association rules are derived, then the
user’s activity is matched to the LHS of the rule (step 2), and
recommendations are produced using the RHS of the rule (step 7).
5. EMPIRICAL EVALUATION
So far, we have described the framework for enhancing the
recommendation process through content semantics. In this
section, we present a preliminary experimental study, based on
blind testing with real users, in order to validate the effectiveness
of our approach. The results indicate that the effectiveness of each
recommendation set (namely Original, Semantic, Category),
depends on the context of the visit and the users’ interests. What
is evident, however, is that a hybrid model, incorporating all three
types of recommendations, produces the most effective results.
5.1 Methodology
Data Set. We chose the web logs of the DB-NET web site
(http://www.db-net.aueb.gr). This is the site of a research team,
which hosts various academic pages, such as course information,


6

Both similarity measures are included in the Appendix

Procedure CategoryRec(CV)
1. I
k
= maxarg
I∈
∈∈
∈S
THESIM(I,CV);
2. for all c
j
∈ CV do
3. c
i
= maxarg
c∈
∈∈
∈Ik
WPsim(c,c
j
);
4. c
n
= least_common_ancestor(c
i
,c
j
),
r
n
= max(r
i
,r
j
);
5. CI ← (c
n
, r
n
);
6. done
7. return D = {d}, {d} ∈ Cl
n
,
maxar
g
Cln∈
∈∈
∈Cl
WPsim(CL
n
,CI);
research publications, as well as members’ home pages. The two
key advantages of using this data set are that the web site contains
web pages in several formats (such as pdf, html, ppt, doc, etc.),
written both in Greek and English and a domain-specific
taxonomy is available (the web administrator created a concept-
hierarchy of 150 categories that describe the site’s content. A
fraction of the taxonomy is included in Figure 4 in the Appendix).
On the other hand, its context is rather narrow, as opposed to web
portals, and its visitors are divided into two main groups: students
and researchers. Therefore, the subsequent analysis (e.g.
association rules) uncovers these trends: visits to course material,
or visits to publications and researcher details. The need for
processing online (up-to-date) content, however, made it
impossible for us to use other publicly available web log sets,
since all of them were collected many years ago and the relevant
sites’ content is no longer available.
To overcome these problems, we collected web logs over a 1-year
period (1/11/02 – 31/10/03). We created two different test sets,
one including all successful hits on the site, and one excluding
hits from within the University (to downgrade student hits). After
preprocessing, the total web logs’ size was approximately 10
5
hits
including a set of over 67.500 distinct anonymous user sessions
on a total of 357 web pages. The sessionizing was performed
using distinct IP & time limit considerations (setting 20 minutes
as the maximum time between consecutive hits from the same
user).
Keyword Extraction – Category Mapping. We extracted up to 7
keywords from each web page using a combination of all three
methods (raw term frequency, inlinks, outlinks). We then mapped
these keywords to taxonomy categories and kept at most 5 for
each page.
Document Clustering. We used the clustering scheme described in
[15], i.e. DBSCAN clustering algorithm [7] and THESUS
similarity measure for sets of keywords. However, other web
document clustering schemes (algorithm & similarity measure)
may be employed as well.
Association Rules Mining. We created both URI-based and
category-based frequent itemsets and association rules. We
subsequently used the ones over a 40% confidence threshold.
5.2 Experimental Results
In order to evaluate the effectiveness of the system’s outcome, we
created two different experiments that were presented to 17 blind
testers for evaluation.
For the first experiment, we chose three popular paths followed by
users in the past, each one with different “orientation”; one (A)
containing visits to contextually irrelevant pages (random surfer),
a second (B) including a small path to very specialized pages
(information seeking visitor), and a third one (C) including visits
to top-level, yet research-oriented pages (topic-oriented visitor).
We subsequently created three different sets of recommendations
termed Original, Semantic, and Category (the sets are named after
the respective recommendation methods – see also Section 5). We
presented the users with the paths and the three sets (unlabeled) in
random order and asked them to rate them as “indifferent” (i),
“useful” (u) or “very useful” (vu). The outcome is summed in
Table 1 (each cell lists the percentage of users that gave the
particular rating for the corresponding path):
Table 1. First blind test
A
B
C
Path
Rec.Set
i
u
vu
i
u
vu
i
u
vu
Original
20 10 70 20 40 40 20 60 20
Semantic
20 70 10 0
60
40
0
30
70
Category
0
70
30
70 10 20 70 10 20
The results of the first experiment revealed the fact that depending
on the context and purpose of the visit the users profit from
different source of recommendations. More specifically, in visit
A, both Original and Category sets are useful, but Category is
slightly better since it’s the one that recommends 3 “hub” pages,
which seems to be the best after a “random walk” on the site. On
the other hand, in visits B and C, Semantic performs better. In
visit B, the path was focused to specific pages and the same held
for the recommendations’ preferences. In visit C the
recommendations that were more relevant to the topics previously
visited were preferred.
In a second experiment, we evaluated the performance of a hybrid
method that incorporates all three types of recommendations. For
the same three paths, we created two sets of recommendations,
one based on Original, and a Hybrid one containing the top
recommended URIs from each of the three methods (Original,
Semantic, Category). We then asked the users to rank the two
recommendation sets in terms of their usefulness. The paths are
included in the Appendix. The outcome is shown in Table 2
(again, each cell lists the percentage of users ranked the
corresponding path):
Table 2. Second blind test
Rec.Set\Path
A
B
C
Original
33% 42% 50%
Hybrid
67% 58% 50%
The results of this latter experiment verify our intuition that the
users benefit from the semantic enhancement of the
recommendations. Again, this depends on each visit’s purpose,
but in total users rate the hybrid SEWeP outcome as equal as or
better than the usage-based one.
6. RELATED WORK
Several research studies have focused on web usage mining and
web personalization. Since a more detailed description of related
research efforts is out of the scope of this paper, we refer the
reader to related surveys [4, 5] for an extensive overview of the
most important initiatives. In this paper we focus on systems that
combine usage and content knowledge in order to perform web
mining, (semi)automatically modify a web site, or provide
recommendations, utilizing taxonomies/ontologies.
Middleton et. al [9] explore the use of ontologies in the user
profiling process within recommender systems. Trying to address
the problem of recommending online academic research papers,
they represent the acquired user profiles in terms of research paper
ontology (is-a hierarchy). Research papers are also classified
using ontological classes. The intuition behind the use of
ontologies is the same as ours, i.e. they use the ontological
relationships between topics of interest to infer other topics not
yet browsed and recommend them to the users. The proposed
hybrid recommender system, however, is based on collaborative
and content-based recommendation techniques, and not web
usage mining (performing personalization).
Berendt et al. introduce “service based” concept hierarchies [2],
for analysing the search behaviour of visitors, i.e. “how they
navigate rather than what they retrieve”. Concept hierarchies are
the basic method of grouping web pages together. The site’s
semantics are exploited to semi-automatically generate interesting
queries for usage mining, and to create visualizations of the
resulting usage paths. This work focuses mainly on the creation of
navigational patterns rather than recommendations and the use of
conceptual hierarchies is, as already mentioned, service (and not
usage/content)-based.
The idea of enhancing usage mining by registering the user
behavior in terms of an ontology is described by Oberle et.al.
[12]. This framework is based on a (semantic) web site built on an
underlying ontology. The web logs are semantically enriched with
ontology concepts. Data mining may then be performed on these
semantic web logs to extract knowledge about groups of users,
users’ preferences, and rules. This framework is similar to our
approach as far as the enrichment of the web logs is concerned.
Since this process is based on a semantic web knowledge portal,
however, the web content is semantically annotated exploiting the
portal’s inherent RDF annotations, and no further automation is
provided. Moreover, the proposed framework focuses solely on
web mining and thus does not perform any further processing in
order to support web personalization.
In our previous work [5], we introduced the web personalization
system SEWeP. In that system, the web site’s pages are
semantically characterized by categories (concepts), i.e. terms that
belong to a domain-specific taxonomy. Moreover, the web usage
logs are augmented with these semantics. This abstraction favors
the system’s functionality in several ways; document clustering,
web log mining and recommendation processes are no more based
on exact keyword matching, but on semantic similarity (i.e.
similarity between terms of a concept hierarchy) instead. In
SEWeP, the recommendation set that would be proposed to the
user if traditional usage mining techniques would be used is
expanded to contain “semantically similar” content (i.e. web
pages belonging to the same cluster). In that work, however, we
did not fully exploit the knowledge residing in the augmented
usage logs (C-logs), since the semantic “expansion” was
performed after deriving (standard) URI-based association rules
from the usage logs. Moreover, the mid-step of translating the
extracted non-English keywords was performed by a human.
Dai et al. [3] proposed independently a web personalization
framework that incorporates usage profiles and domain
ontologies. The usage profiles can be transformed to “domain-
level” aggregate profiles by representing each page with a set of
related ontology objects. Recommendations can then be generated
by matching the current user’s profile with them. Their work,
however, provides a general framework for personalization
systems. In this paper, we address several of the key issues in
using such a framework, such as characterizing web pages with
ontology objects, addressing multilingualism, computing
similarity between domain objects, and “matching” user profiles
during the recommendation phase.
7. CONCLUSIONS – FUTURE WORK
In this paper we introduce a recommendation method that extends
the notion of URI-based association rules/patterns by producing a
set of category-based rules and itemsets. This method solves the
problems arising from solely usage-based personalization,
whereas it also encompasses content semantics throughout the
web personalization process. We also present a method for a
context-based mapping of multilingual documents to a common
concept hierarchy. We have demonstrated the effectiveness of the
proposed approach by an initial experimental evaluation. We plan
on focusing on the category-based rules production, incorporating
other parameters (such as frequency of occurrence), as well as
carrying on with real-time personalization experiments.
8. ACKNOWLEDGEMENTS
This research work was partially supported by the IST-2001-
32645/DBGLOBE R&D project funded by the European Union.
We would like to thank our blind testers Evrypides, Maria,
Yannis, Iraklis, Apostolos, Murat, Gabriel, Teresa, Christos,
Dimitris, Nikos, Michalis, Alkis, Pavlos, Effie, Stratis and Nikos.
We would also like to thank Alkis Polyzotis for his very useful
comments on earlier drafts of this paper.
9. REFERENCES
[1] R. Agrawal, R. Srikant, Fast Algorithms for Mining
Association Rules, in Proc. of 20
th
VLDB Conference, 1994
[2] B. Berendt, Understanding Web usage at different levels of
abstraction: coarsening and visualizing sequences, in Proc. of
WEBKDD’01 workshop, 2001
[3] H. Dai, B. Mobasher: Using Ontologies to Discover Domain-
Level Web Usage Profiles, in Proc. of the Intl. Conf. on
Internet Computing 2003 (IC’03), 2003
[4] M. Eirinaki, M. Vazirgiannis, Web Mining for Web
Personalization, in ACM Transactions on Internet
Technology (TOIT), Feb. 2003/ Vol.3, No.1, 1-27
[5] M. Eirinaki, M. Vazirgiannis, I. Varlamis, SEWeP: Using
Site Semantics and a Taxonomy to Enhance the Web
Personalization Process, in Proc. of the 9th SIGKDD Conf.,
2003
[6] M. Eirinaki, M. Vazirgiannis, Semantic Web
Personalization, available at http://www.db-net.aueb.gr
[7] M. Ester, H.P. Kriegel, J. Sander, M. Wimmer and X. Xu,
Incremental Clustering for Mining in a Data Warehousing
Environment, in Proc. of the 24
th
VLDB Conf., 1998
[8] C. Lampos, M. Eirinaki, D. Jevtuchova, M. Vazirgiannis,
Archiving the Greek Web, in Proc. of IWAW04
[9] S.E. Middleton, N.R. Shadbolt, D.C. De Roure, Ontological
User Profiling in Recommender Systems, ACM Transactions
on Information Systems (TOIS), Jan. 2004/ Vol.22, No. 1,
54-88
[10] B. Mobasher, H. Dai, T. Luo, Y. Sung, J. Zhu, Integrating
Web Usage and Content Mining for More Effective
Personalization, in Proc. of ECWeb, 2000
[11] B. Nguyen, M. Vazirgiannis, I. Varlamis, M. Halkidi.
"Organizing Web Documents into Thematic Subsets using an
Ontology", VLDB journal, vol.12, No.4, 320-332, Nov.
2003
[12] D.Oberle, B.Berendt, A.Hotho, J.Gonzalez, Conceptual User
Tracking, in Proc. of the 1
st
AWIC Conf., 2003
[13] M. Porter, An algorithm for suffix stripping, Program
(1980)/ Vol. 14, No. 3, 130-137
[14] F. Tao, F. Murtagh, M. Farid, Weighted Association Rule
Mining using Weighted Support and Significance
Framework, in Proc. of the 9th SIGKDD Conf, 2003
[15] I. Varlamis, M. Vazirgiannis, M. Halkidi, B. Nguyen.
THESUS: Effective Thematic Selection And Organization Of
Web Document Collections Based On Link Semantics, in
IEEE TKDE Journal, Vol.16, No.6, June 2004
[16] WordNet, A lexical database for the English language,
http://www.cogsci.princeton.edu/~wn/
[17] Z. Wu, M. Palmer: Verb Semantics and Lexical Selection,
32
nd
Annual Meetings of the Assoc. for Computational
Linguistics, 1994
APPENDIX A

Figure 4. Fraction of the DB-NET Web site taxonomy
APPENDIX B
Wu&Palmer similarity measure [17]
Given a tree, and two nodes a, b of this tree, find the deepest (in
terms of tree depth) common ancestor c. The similarity is
computed as follows:

)(*2
)()(
),(
cdepth
bdepthadepth
baWPsim
+
=

THESUS similarity measure [15]
Given an ontology T and two sets of weighted terms A={(w
i
, k
i
)}
and B={(v
i
, h
i
)}, w
i
, v
i
∈ T, their similarity is defined as:
( )( )
( )( )











×+











×
=
∑∑
=

=

||
1
,
|]|,1[
||
1
,
|]|,1[
,max
1
,max
1
2
1
)(
B
A
A
B
BA,
i
jiji
j
i
jiji
j
khWPsim
H
hkWPsim
K
THEsim
µλ
where
( )
ji
ji
ji
vw
vw
,max
,
×
+
=
2
λ
and

=
=
||
)(,
A
i
ixi
K
1
λ
, with
( )
(
)
(
)
jixi
Bj
xixi
hkWPsimhkWPsimxix,max,|)(
,
|]|,1[
,
×=×=

λ
λ
.
APPENDIX C
This is the 2
nd
blind test that was given to the users. The two
options include a set of original recommendations (normal font)
and a set of hybrid recommendations (italics).
Visit A
http://www.db-net.aueb.gr/people.htm &
http://www.db-net.aueb.gr/links.htm &
http://www.db-net.aueb.gr/courses/courses.htm ->
1) http://www.db-net.aueb.gr/pubs.php
http://www.db-net.aueb.gr/research.htm
http://www.db-net.aueb.gr/courses/postgrdb/asilomar.html
2) http://www.db-net.aueb.gr/pubs.php
http://www.db-net.aueb.gr/pubsearch.php
http://www.db-net.aueb.gr/research.htm
Visit B
http://www.db-net.aueb.gr/people/michalis.htm &
http://www.db-net.aueb.gr/mhalk/CV_maria.htm ->
1) http://www.db-net.aueb.gr/mhalk/Publ_maria.htm
http://www.db-net.aueb.gr/research.htm
http://www.db-net.aueb.gr/magda/papers/webmining_survey.pdf
2) http://www.db-net.aueb.gr/mhalk/Publ_maria.htm
http://www.db-net.aueb.gr/papers/gr_book/Init_frame.htm
http://www.db-net.aueb.gr/papers/gr_book/Contents.htm
Visit C
http://www.db-net.aueb.gr/index.php &
http://www.db-net.aueb.gr/research.htm &
http://www.db-net.aueb.gr/people.htm ->
1) http://www.db-net.aueb.gr/projects.htm
http://www.db-net.aueb.gr/courses/courses.htm
http://www.db-net.aueb.gr/courses/courses.php?ancid=dm
2) http://www.db-net.aueb.gr/projects.htm
http://www.db-net.aueb.gr/courses/courses.htm
http://www.db-net.aueb.gr/courses/POSTGRDB/ballp.pdf