Mining the Structure of User

mudlickfarctateΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

67 εμφανίσεις

Mining the Structure of User
Activity using Cluster Stability

Jeffrey Heer, Ed H. Chi

Palo Alto Research Center, Inc.

2002.04.13


SIAM Web Analytics Workshop

Motivation


Want to understanding the composition of
web user traffic.


What are users’ information goals?


Leads to improved site design, content, and
performance



Strategy: Content, Usage, and Topology

User Session Clustering


Cluster user sessions into common activities
such as product browsing and job seeking.


A number of approaches have been
proposed
([Shahabi97], [Fu99], [Banerjee01], and [Heer01])


These require specifying the number of
clusters in advance or browsing a large
cluster hierarchy.




Can we automatically infer the structure of
user activity?

Overview


System Description


Clustering Method


Stability Analysis


Case Studies


Discussion

System Description


Use web access logs and web site content to
generate a user profile for each site visitor.


How: Build a multi
-
featured vector space model of
user activity (multi
-
modal clustering).



Group user profiles into common activities
like “product browsing” and “job seeking”


How: Apply clustering algorithms to user profiles

System Description

Web Crawl

Access Logs

Document
Model

User
Sessions

User Profiles

Clustered

Profiles

1.
Process Access Logs

2.
Crawl Web Site

3.
Build Document Model

4.
Extract User Sessions

5.
Build User Profiles

6.
Cluster Profiles

Document Model


Web site is crawled, relevant pages
listed in web logs are retrieved.


Retrieved data is represented as feature
vectors:

Content:


TF.IDF weighted keyword vector

URL:


Tokenized and TF.IDF weighted

Inlinks:


Column vectors in topology matrix

Outlinks:


Row vectors in topology matrix


These are concatenated to form a single
multi
-
modal

vector
P
d

for each document.



Web Crawl

Access Logs

Document
Model

User
Sessions

User Profiles

Clustered

Profiles

User Sessions


Sessions are extracted from web logs,
and represented by an attribute vector


For path
i

= A

B

D,
s
i

= <1,1,0,1,0>

»
(For site with 5 documents <A,B,C,D,E>)


Experimented with various weightings for
s
,
including viewing
-
times and path position.


Viewing times achieved highest accuracy in
empirical studies.


A
10s

B
20s

D
15s
,
s
i

= <10,20,0,15,0>

Web Crawl

Access Logs

Document
Model

User
Sessions

User Profiles

Clustered

Profiles

User Profiles



User profiles are created by linearly
combining the document and session models:




N
d
d
id
i
P
s
UP
1
Web Crawl

Access Logs

Document
Model

User
Sessions

User Profiles

Clustered

Profiles

Clustering


Similarity Metric is a weighted cosine
measure




Clustering is then done by recursive bisection, using
K
-
Means to perform the bisections [Karypis00,
Zhao01]. The corresponding criterion function is:





Modalites
m
m
j
m
i
m
j
i
UP
UP
w
UP
UP
d
)
,
cos(
)
,
(
1


m
m
w





k
r
r
S
UP
i
C
UP
d
I
r
i
1
2
)
,
(
Web Crawl

Access Logs

Document
Model

User
Sessions

User Profiles

Clustered

Profiles

User
population
breakdown

Detailed
stats

Keywords
describing
user groups

Frequent
documents
accessed
by group

Clustering Evaluation


Ran user study on www.xerox.com to evaluate
effectiveness of method [Heer02].


15 tasks, 5 task categories (104 user traces)


Using certain modalities and weighting
schemes we were able to achieve accuracies

as high as 99%!


Found that page content and
page viewing time significantly
contribute to clustering
accuracy.


OK, Great, but…


In real
-
world applications the number of
clusters is an undetermined variable.


Want a method for automatically choosing the
number of clusters.


After review of literature, decided to apply a
cluster stability technique recently proposed
by [BenHur02].

Measuring Clustering Similarity


For a given clustering of a data set X, define



C
ij

=
{



Two clusterings can then be compared using
a dot product:



This dot product can be normalized to get a
cosine metric:



j
i
ij
ij
C
C
C
C
,
2
1
2
1
,
2
2
1
1
2
1
2
1
,
,
,
)
,
(
C
C
C
C
C
C
C
C
cor

1 if x
i
, x
j

are in the same cluster and i


j

0 otherwise

Cluster Stability


for
k

= 2 to
kmax


for
i

= 1 to
n

»
S
i

= Subsample of data set
X

using sampling ratio
f

»
C
i

=
cluster(
S
i
,
k
)


Perform pairwise comparisons of all
C
i
, generating
a distribution of similarity values for the current
k


Analyze the resulting distributions to
determine the most stable clusterings.

Example

Stability Analysis


Example using 4
Gaussians [BenHur02]


Graph on right shows
plot of the cumulative
similarity distribution

Case Study 1


www.xerox.com

0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
similarity
cumulative
User Study 8/2001; 104 sessions

n

= 15,
f

= 0.8,
k

= 2 to 10

Case Study 2


guir.berkeley.edu

0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
similarity
cumulative
Nov. 1
-
16, 2001; 7700 sessions

n

= 30,
f

= 0.8,
k

= 2 to 15

Case Study 2


guir.berkeley.edu

0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
n

= 30,
f

= 0.8,
k

= 3 to 7

Cluster Contents (guir, k=5)

Cluster 1: DENIM Web Design Tool

Cluster 2: Research projects & publications

Cluster 3: Quiz
-
Bowl Competition Site

Cluster 4: CSCW (1 project + 1 course)

Cluster 5: Random pubs + project JavaDoc



At higher values of
k
, more concentrated clusters
appear


Personal pages (faculty, students) cluster emerges


JavaDoc separates into it’s own cluster

Discussion

Stability method shows some utility, but results
are far from conclusive… perhaps web data
is not particularly structured?


User Goals


Does the user have a specific goal?


Web Site Structure


Does the web site support user goals?


Task Structure


Level of generality

Possible Cases


User has task
-

Site supports task


www.xerox.com study


User has task
-

Site doesn’t support it


User w/o singular goals
-

Well designed site


Possibly guir.berkeley.edu


User w/o task
-

Poorly designed site

The Future…


More actionable empirical data


Need more users over a range of sites


Larger user study already begun


Alternative approaches


Human supervision


Augmented stability metric / criterion function


Other clustering methods

»
Fuzzy Clustering

Questions?


Suggestions?