Minor_Thesisx

tribecagamosisΤεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

106 εμφανίσεις





School of Computer and Information Science


2012

Minor Thesis

Analyzing the fragmentation of coselection data
due to volatile search results

Nathan Ronald
Williams


Abstract

Initial investigations have indicated coselections are an effective way to cluster web pages under a
shared meaning. The idea is that URLs coselected under the search term tend to be the result of the
same objective by the user.

Though there are some variances, it has been shown to be strongly
effective at generating sense
-
singular results given a high enough threshold.

While the clusters may be sense
-
singular,
there are frequently numerous

clusters generated
for
the
same sense.
A
pproximately one sense
-
singul
ar cluster per sense should be expected

and hence counting
clusters would indicate ambiguity in search terms. Howeve
r, in many of the cases,

search terms
appear
to be ambiguous because they have multiple clusters in the results,
even though

that should not be the
case
.

A
key factor speculated is
the effect of time on the top results as they are subject to change
.

T
his could
be causing

temporal fragmentation

of clusters since there is only a certain window of opportunity for
two URLs to
both
be high enough in the search results to be selected together



they have to be in the
top N (usually 10) results to be coselected
. Throug
h using the time

stamp associated

with the data, we
aim to uncover the evolution

of clusters in time
.

The initial proposal of this paper was
to
first to analyse the effect of time

and whether it is having a big
effect on

the segregation of clusters.

The activity
of links and clusters were

pl
otted out with analysis

of
whether there was enough activity in common between clusters to suggest they have had sufficient
coselection chance. Expanding on that was a proposal of a potential solution by
first perform
ing

clustering and then second join
i
ng

disparate clusters by lowering the threshold for clusters that have
few URLs active at the same time.

While results were inconclusive due to a lack of data to collect, it is hoped that the methodologies
formed will be relevant to future studies as great
er data is collected from various sources.





Table of Contents

1

Introduction

................................
................................
................................
................................
..........

1

1.1

Motivation

................................
................................
................................
................................
.....

2

1.2

Research Q
uestions

................................
................................
................................
......................

2

2

Literature Review

................................
................................
................................
................................
..

3

2.1

Background

................................
................................
................................
................................
...

3

2.1.1

Clickthrough

................................
................................
................................
..........................

3

2.1.2

Dete
cting ambiguity

................................
................................
................................
..............

5

2.2

Related Work

................................
................................
................................
................................

8

2.2.1

Coselections

................................
................................
................................
..........................

8

2.2.2

Search Engine Results

................................
................................
................................
...........

8

3

Methodology

................................
................................
................................
................................
.........

9

3.1

TimeStamping

................................
................................
................................
...............................

9

3.2

Measuring Activity

................................
................................
................................
......................

10

3.2.1

Loss of activity

................................
................................
................................
.....................

10

3.2.2

Loss of URLs

................................
................................
................................
.........................

10

3.3

Cluste
r Disparate

................................
................................
................................
.........................

11

4

Results and Discussion

................................
................................
................................
........................

12

4.1

Analysis of Data

................................
................................
................................
...........................

12

4.2

Measurement of activity

................................
................................
................................
.............

14

4.3

Measurement of Loss of URLs

................................
................................
................................
....

14

4.4

Cluster Disparate

................................
................................
................................
.........................

14

5

Conclusion

................................
................................
................................
................................
...........

15

6

References

................................
................................
................................
................................
..........

16

7

Appendix

................................
................................
................................
................................
.............

20

7.1

Appendix A: Co
selection Count of terms with at least one cluster

................................
............

20

7.2

Appendix B: Accuracy of cluster disparate on URLs with an existing clus
ter

............................

21

7.3

Appendix C: URL distribution

................................
................................
................................
......

23




1


1

Introduction

Trails of dat
a generated from

users interacting with search engines provide a signific
ant resource for
classifying

information on the
W
orld
W
ide
W
eb.

The p
atterns of user behaviour
found in the search logs
help indicate the context
a user is applying to a search term
.

It has been proposed that this information
can aid in

ambiguity and synonym detection

(Ashman et al
, 2011)
amongst other useful tools
.

Initial

progress
began

with clickthrou
gh data which proved a useful source for clustering resources
together

(Beeferman & Berger 2000)
.
The process involves gathering URLs selected by users under
a

search term.
Though
it was useful to find which search terms h
ad
URLs in common
, ultimately
coselection data would provide a more useful metric for indicating the relevance of URL to URL by
wrapping up URLs selected together in the one
query
.
While the exact relationship between URLs can
vary depending on

user intent, A
shman et al (2011
) have found that users
generally
search on a term
with a single
semantic
purpose in mind. Though users may occasionally choose something irrelevant to
an objective, the
majority
of coselected URLs seem to indicate a strong mutual relevanc
e, much more so
than many other selection methodologies.

This
process overco
mes two

significant hurdle
s

to past
terminology detection
.
The first one is that the
process
exploits the

i
mportant factor
that users are making a direct judgement on
the
information that
full fills

their needs for the terms they have specified
. By contrast, s
emanti
c

and lexical
analysis has

struggled

from being unguided and lacking human involvement

(
Tamir &
Rapp 2003)
.
Meanwhile
the
oldest me
thod of
using
human
judgement
is a ve
ry time consuming
to get a complete picture

(Riloff
1993)
.

Coselection overc
omes these two hurdles by
providing user
relevance judgement
from
an activity
people

perform ubiquitously

in their daily lives
.

Th
is
thesis proposes to detect ambiguity by
counting the number of clusters g
enerated. This
method of
detecting a
mbiguity

first
involves using
coselections as a similarity measure to aggregate
semantically
-
similar collections of URLs.

This is created by first forming
a term graph of URLs

for each te
rm

w
here

edge

weights

indicat
e
how many times a URL is selected in
common with another URL. Clusters are then
formed by aggregating URLs that are regularly coselected together enough to indicate they are of the
same sense
.

Each cluster should therefore rep
resent part or all of a

sense a search term can be used in
.




2


Figure
1
-
1

Term graph for "pernstejn" with vertices corresponding to Web resources (Asman et al 2011)

Experiments so far indicate that clusters ca
n be successfully resolved to sense singularity, however
,
currently

there are also many
clusters for a
single sense
.

This
research
aims to a
ddress issues that can
reduce the

number of clusters to something more meaningful.

One major issue that appears to be
creating more clusters is the effect of volatile top search results over time. If changes happen too
abruptly old URLs can’t be coselected with new URLs which create division in clusters. The true extent of
this effect w
ill be measured
and research possible solutions to reduce the effect
.

1.1

Motivation

Too many clusters for the one meaning results in a large amount of disparate chunks of data that are
difficult to analyse
,

however

that they were still mostly sense singular indicated a solid platform to build
on.

Ideally
,
by
aggregat
ing

those semantically
-
similar clusters, it
will
become feasible to judge whether
the term is ambiguous or not, by counting the number of clusters. It w
as postulated in Ashman et al.
(2011) that more clusters indicated more potential ambiguity although the data they investigated did
not confirm or deny this.

Ashman et al
(2011)
speculated that more data would br
ing together bigger clusters.
Though

such a
quantity of data would likely be met by major search engines for common terms, it is not however
available to current research.

An issue speculated to be breaking up the results is the effect of volatile
top search results
.

Since user’s mainly select from
the first page
(Jansen & Spink et al 2006)
, i
f there
aren’t smooth gr
adual transitions over time, there will be
breaking

in clusters due to lack of coselections
to join them together. To find the effect of time, this paper proposes to time stamp the data to uncover
the evolution of clusters in time, thereby discovering whether there are broken links between large
clusters

o
f the same meaning at a certain period in time.

This paper therefore aims to discover the effect this phenomenon has on the clustering and look into
possible solutions for overcoming this problem.

1.2

Research Questions

QUESTION
1

Does the effect of volatile t
op
results
create separation in clusters?

If users mostly select only f
ro
m the first page of
results

(Jansen & Spink et al 2006)
, we expect to find
t
hat rapid change means

the opportunity for coselections to be created drops significantly as soon as
one of

the part
icipating pair leaves
.

QUESTION
1
.1

Can clusters separated by time be brought together

and retain sense
-
singularity
?

We plan to compare the outputs
of standard clustering against
agg
regation of

temporally
-
separate clusters

to see whether it marks

an improvement on the whole
-
dataset cluster
aggregation process.

QUESTION 2

Given a set of these aggregated clusters, is it feasible to determine via cluster cardinality
whether or not a given term is ambiguous?



3


We postulate that
an improved clustering
mechanism will successfully aggregate at least some formerly
-
distinct clusters with the same semantics. This should result in a significant correlation between the
number of clusters for a term and a ground
-
truthed value for the ambiguity of that term.

2

Lit
erature Review

2.1

Background

2.1.1

Clickthrough

Clickthrough data provides a key resource on how search engines ar
e
used. It records a
history of a
user’s interactions however doesn’t
resolutely state what the user’s intentions are
a
nd how successful
they were
. Nev
ertheless, it has been speculated that in the great quantities that are possi
ble on the
World Wide Web,
a good analysis of the data can be even more accurate than using explicit feedback
from the user

(Dou et al 2008)
. As a result the data has been a key a
rea of interest since search engines
became a wide
spread means to browse the internet.

Applications

The first recorded use was when Lieberman (1995) applied the resource to dynamic personalized tools
for browsing the web in an application called Letizia.
Since then it has proven popular for a wide range
of different applications, of which, one of the most attractive has been to improve the output of the
search rankings. Many researchers have tackled this issue (Joachims 2002)(Agichtein et al
2006)(Carteret
te & Jones 2007)(Gao et al 2009)(Dupret & Liao 2010), all keen to increase the accuracy of
ranking automation to a universal range of knowledge that is so large that it
cannot

be done completely
manually. This field has been of great value to creating accu
rate interpretations of clickthrough data that
are
otherwise mixed by fuzzy interpretations of user decisions.

Numerous other components of search engines have also benefited from feeding back search log data.
Sun et al (2005) found they could improve the
captions of search results by determining significant
words in regularly clicked documents. That query’s with terms that are in the caption are more helpful is
also supported by Clarke et al (2007) who performed a wide analysis of what makes a good caption
,
supported by the click through data they applied.

More broadly the information collated can deconstruct how search engines are used.

Pass et al (2006)
derived a wide se
t of statistics that help create

a picture of user behavior, while Ashkan et al (2008)

went into more detail to uncover the intent of a user’s search as transactional, navigational or
commercial means which was proposed to assist associated advertisements.

Clickthrough data also has value as a means the extract meaningful associations betwe
en URLs and
categorize

the broad range of information on
the Internet
. For instance
X
u

et al (2009) performed
Named Entity Mining on the data,

compiling

the links from specific search terms i
nto various
information types. Another idea has been to cluster r
esources together where a number
of links are
commonly selected for multiple search terms

(Beeferman and Berger 2000)
. This idea of clustering links


4


into useful groups is a core foundation of the work Ashman et al

(2011)

have used that is
the

ground

work t
his paper is derived from
.

Clustering

The idea of clustering URLs and search terms was first initiated by Beeferman

and Berger (2000). Search
terms that had many of the same URLs selected were grouped together along with their associated
URLs. This resulted in clumps of information consisting of common ground.
Its

main limitation was that
it required top search results

to have the same URLs available for multiple terms, but nonetheless was
capable of grouping a vast array of terms for many data sets. A useful application for the results was
proposed to provide search term recommendations.

As a direct expansion to Beefer
man and Berger, Chan et al (2004) noted the algorithm was subject to
noise as a link only using very few clicks in common could be included. They proposed that a cut off
proportional to how many times the links have been chosen under those terms against th
e overall
number of document select
ions for those terms
. This effectively eliminated noise as the major source of
error.

Further, by considering each user’s actions separately Leung
et al (2008) found they could dis
ambiguate
a user’s intentions. If a user’
s choice fit the mould of a certain group of users that chose the fruit apple,
they could receive results on that specific option over apple computers which fit a different group of
users.

Similarly Gao et al (2010) suggest search terms that are clustered
together can be considered synonyms.
To do so they followed a particularly novel approach where queries are considered similar if the titles of
its

main URLs regularly selected in clickthrough data feature the same bi
-
phrases. This helped extend on
Beeferm
an and Berger’s concept by being able to cluster multiple search terms even if they don’t
feature common URLs in

the top search results while also

aiding

a common language metric between
search terms and titles/documents.

A unique form of clustering was fu
rther introduced by Ashman et al

(2011)
, which utilized coselections
as a similarity metric between URLs. The binding together under the one search sense was much
stronger in this scenario than that of using the entire search history of individual users as

a similarity
metric.
Greater detail can be found in
2.2.1

Coselections
.

Acccuracy

A key factor in clustering information sources is the presumption that documents selected tend to be
relevant to the user’s original search sense.
It has been speculated the usefulne
ss of abstracts may be an
issue
but the
y were verified to be helpful
82.6% (Joachims et al 2007). Moreover the larger issues found
a
re the user’s decision making and

their tendency to browse
,

which

acc
ount

for the total accuracy of
document
s

being

relevant to the search sense

at

only

52%

(Scholer et al 2008).

Nevertheless Dou et al
(2008) asserts that with a good analysis of the data combined with the huge quantities

possible on the
World Wide W
eb
,
it
may produce better results
that of direct user feedback.



5


One of the most common methodologies to smooth out the results is the idea that the higher the
document in the search results, the more likely it is to be picked. The main factor is the trust
bias
the
user has in the search engine that the higher results a
re more accurate
. Therefore the quality of the
retrieval system is a big factor in influencing which documents are selected.

Further Granka et al (2004)
verified through eye tracking that users have a tendency to analyze results from top to bottom most of
the time. As an extension of these idea
s Joachims et al (2002) created a methodology for analysis that
asserted any document selected is more relevant than all the documents above it that weren’t selected
by the user in the one session. This represents the

main methodology that has been extended upon to
determine relevance of the document to the search term.

Another consideration
suggests that image search may be substantially more accurate than text search.
The theory is that images are a more direct descr
ipt
ion than what captions
are, which

proposed to
lessen
the
problem
of
bad user
judgment
. These results were found to be extraordinarily accurate to
about 88%

(Smith & Ashman 2009). However recent research has suggested that text
search are in fact
just as

good and the lesser quantities o
f image search data is a challenge to resolve
.

2.1.2

Detecti
ng ambiguity

In 2007 Ashman et al proposed a Global Perpetual Dictionary of everything. A key component was that
search log data would be able to run an automatic scan f
or ambiguous terms without the need for
human involvement.
This research represents a shift towards involving user implicit judgments that are
carried out through a ubiquitous activity rather than analyzing the structure of a discourse.

Disambiguation

Disa
mbiguation
has been of interest to the field of computing since its early days in the 1950s. The first
identified field for disambiguation was in the context of machine translation of languages. Ambiguous
terms were one of the key constraints to otherwise
providing one to one translation.
T
he early
philosophy
of the task
was that one word at a time cannot determine the meaning of an ambiguous
term, but given the context it can be resolved (Weaver 1955).
Thus disambiguation involved forming a
methodology of
picking the right meaning of each ambiguous term based on what a sentence implied
.

A key aspect of disambiguation is that it relies on a res
ource for the

representations of

each meaning

in
ambiguous terms. The enormous knowledge base required across the en
tire diction has been a

significant hurdle to assigning a word its implied sense from a sentence

(Gale et al 1992).

One popular
source has been Machine
-
Readable Dictionaries (MRDs) but so far has not successfully accomplished the
automatic extraction of la
rge knowledge bases (Amsler 1980). WordNet is the only MRD that

i
s widely
available today and is limited by its hand creation.

Finding the ambiguous term senses would be the first
step towards a comprehensive MRD
, as is the case with WordNet which uses a
synset tree of words to
represent the meanings

(Tamir & Rapp 2003). Ashman et al’s (2007) proposal r
epresents a

way of
discovering

ambig
uous terms and potentially their

meanings

in a more comprehensive fashion
by
making valid interpretations on the way the
y are used in an implicit activity.

Machine Readable Dicitonaries




6


Machine readable dictionaries are currently the most common resource for the structure of word
senses. WordNet, the most fully formed today, features words grouped into synsets of the same
sense
refe
rencing other synsets with key
relationships. Ambiguity is represented where a term is found in
multiple synsets (senses). The biggest difference between each sense is they should reference a
different set of hypernyms implying they are a differe
nt kind of entity.


Nevertheless, using MRDs as a resource for ambiguity has been limited by the unclearly defined
boundaries of senses. There have been complaints MRDs often make unnecessary and difficult "forced
-
choices" (Dolan 1994). Attempts have been
made to address this

such as clustering with the aid

of a
thesaurus to help eliminate distinctions that are unnecessarily fine grained (Chen et al 1998).

T
he
se

tough distinctions made in an MRD can lead to too many unimportant senses which
clutter

up
tasks

such as disambiguation

(McCarthy et al
2004
)
. It has been suggested ranking of sense relevance
can therefore be of value to dis
tinguish which are most useful
. These sorts of findings pose challenges to
clustering coselection data as it is not clear whethe
r users will mainly coselect on major senses or if
minor ones will
be distinguished. Most likely
i
t is presumed the minor senses will build into major
ones

by the nat
ure of uses selecting even small similarities
, but the nature of the boundaries need to be

investigated.

Traditionally all the crafting of MRDs has been done manually by hand and research, but increasingly
there have been ways of finding relationships through more automated forms.

A common methodology
for determining hypernyms has been to look
through discourse for commonly occurring patterns:



“Bruises[,] wounds[,] broken bones [or other] injuries . . .”
:

where the nouns are implied to be
a type of injury (Hearst

1992
)



“Boeing[, a] defense contractor"
:
where “defense contractor” is an appositive

of Boeing
(Caraballo

1999
).

By finding hypernyms this way
, nouns sharing the same hypernyms likely indicate a shared sense
allowing somewhat of a synset to form (Caraballo

1999
). In another scenario, senses and their
description can be found more directly

using a complex set of trigger words, though is limited to
specialized topics of content by its less generic nature (Riloff

1993
). Meanwhile Agirre et al (
2000
) have
used the world wide web to enrich and refine the current content in WordNet.

Despite
these efforts, it is still a long way in breadth and accuracy from seeing the complete automation
of further MRD construction, beyond providing possible guides. This paper embarks on adding to the
knowledge of ambiguous terms with the clusters of coselecte
d URLs ideally representing the major
senses.

Word Sense Induction

The most direct field of finding ambiguity is word sense induction since the best way to find ambiguous
terms is to find

the distinct

senses
.
In word sense induction, t
hese senses are typic
ally

found by
assigning words commonly used with the target word to clusters which are formed by their usage in


7


discourse (Pantel and Lin 2002) (Dorrow & Widdows
2003)
.
Thus far an accuracy of 72% that a cluster
correlates to a correct sense has been achie
ved.

Increasingly, the World Wide W
eb represents a large scale resource that is easy to access for gathering
word senses. The use of Google to mine
the web for senses was suggested
by Tamir & Rapp (2003), they
were inspired by the work of both Gale et al (
1992) and Yarkowsky (1995) which suggest ambiguous
words only occur in one sense in a given document and that words close to a term give some indication
of the sense of the target word. This leads to the assumption that a good indicator of ambiguity is whe
n
two words commonly occur with the target word but rarely occur together with the target word.
Ultimately the resolution of finding two words that represent two different meanings of an ambiguous
term was successful, but often the association of the words

to the meaning is weak since regularly used
words are rare.

In spite of these efforts, t
he biggest difficulty so far is the need to narrow down terms and search
senses, the proces
ses so far are too complex for a full breadth
.
Since clustering coselection
search data
does not require analysis of masses of discourse in the same way, the use of implicit judgments as a one
to one relationship provides a way to streamline the complexity of such a task.

Disambiguation in IR

D
isambiguation
has also
featured

in th
e context of
Information Retrieval.


The philosophy has been that
the results provided by IR

can be more successful when all

entire
resource
s

are

cleared of ambiguity.
T
he first step involves r
eplacing all the ambiguous terms

in every discourse with words
that correlate to
the more specific meaning implied by

its context
(
Voorhees 1993)

(Sussna
1993)
.

However this proved
to be a futile attempt as it

only produced worse
search
results than the original
unedited recourse
.

Conversely Sanderson
(1997) has measu
red the effect of increased ambiguity within discourses on IR,
the results found that by appending random word
s to those in the discourse,

the added ambiguity did
not hinder the success of an

IR system as much as expected. Disambiguation of resources was t
herefore

a challenge to find small gains
,
and a difficult one due to the low
accuracy of current
disambiguator
s
.

A

knock on effect found by
Krovetz & Croft

(1992) on the weakness of disambiguation in IR related to
the way search terms were used. They found

two major challenges that were causing in effective
results
:

1.

M
ost

ambiguous terms have a dominant me
aning so

most results feature the resources that

most users were searc
hing for.

2.

S
earch terms often over come disambiguation

by collation, the more words found in the sea
rch
term, the more likely other words will imply their meaning like a sentence would
.
With all search
terms being searched on, more searches are quickly funn
eled into the context of
them

all
.

These two example
s hi
ghlight that queries need to be sho
rt for ambiguity to be an issue

in both the
term and the discourse. Further,
by the nature of a search engine,
small search terms can
overcome
their ambiguity by

be
ing

appl
ied back in with added words for more specifi
c

results.



8


2.2

Related Work

2.2.1

Coselections

A special use of clickthrough data immerged in collecting it as co selections where multiple URLs are
chosen in the one session

(Smith et
al 2009). It p
rovide
s

a direct similarity metric between URL to URL
under the ass
umption that users usually select with a specific search sense in mind. By extension of this
idea, each separate cluster should ideally be unambiguous and represent an individual meaning for the
search terms.

Through
the DBSCAN algorithm
, Ashman et al (201
1) successfully

reduced
clusters to
single sense

however multiple clusters
for the same sense
regula
rly immerged
. A major factor in the lack of accuracy
was the

small amount of data available. Additionally it was speculated that changes in top results woul
d
cause fragments in clusters since two URLs need to be there
at the same time
to have a good chance of
being coselected.
This paper aims to address this issue by di
scovering how significantly time

is
affecting

clusters and
provide
some potential solutions

to overcome it
.

Caon et al (201
2) further expanded on this work

by utilizing a cluster by overlap method to link

similar
clusters over different
search terms.

These re
lationships should indicate that

two terms
with a cluster in
common
have

a similar usage

which suggests

they are synonyms.
Through tweaking the DBSCAN
algorithm parameters, these results were successfully resolved to a large number of positives that

were
all verified in accuracy.

2.2.2

Search Engine Results

For two results to be coselected
together, their position in the results is important. Users only tend to
view the first page of results about 85.2% of the time, which has been increasing over the years as
search engine accuracy and the trust invested in them by users has grown (Silverste
en e
t al 1999)

(Jansen & Spink
2006). URLs therefore have a much smaller chance of being coselected if they aren’t
both on the first page.


This issue becomes more apparent with top results frequently evolving. New pages are regularly being
added and the
measurements used to rank them also fluctuate. With this evolution, many URLs have a
limited time to appear together on the first page before they lose their main chance of being coselected,
while many others will never spend time together. Often it is the

case results will show very little
consistency over time and can jump up and down with very little to suggest major change occurred

(Bar
-
Ilan 1999)
. As with the case with Google, there are many different measurements used for rankings such
as anchor pages

that link to the main page.

These anchor pages add to
a page’s significance but also
feature additional descriptions helpful f
or it
s relevance

(Brin & Page 1998)
. With so many variables,
results can change for many different reasons.



Nevertheless,
Bar
-
I
lan

et al (2006) plotted changes on daily intervals and found Google to be mostly
stable with small changes happening often but usually only in increments and rarely
making big jumps
(see figure

2
-
1
)
.
The effect of these changes
have
on clusters can be
mainly be deduced down to how
often URLs occur together in the first page irrespective of how many times it has gone

in and out, as this
indicates it’s coselection chance.

It is important to analyze the effect this has on the construction of
clusters.




9



Figure
2
-
1

The top
-
ten results of the query "DNA eveidence" on google (Bar
-
Ilan et al 2006)


An additional challenge posed to clustering coselections is that the vast majority of search terms are
used very few times.
Silversteen et al (1999) found, out of a
very
large set of 154 million queries, only
13.4% of them occurred more than 3 times. By co
ntra
st, the top 100 query terms tend to be

used as
much as 20% of the time (Spi
nk et al 2002). G
ood results are

therefore constrained to a smaller set of
terms

as many don’t even have a chance of being clustered, let alone having sufficient detail to be
su
ccessful. For broad
er

results, clustering

needs to be as efficient as possible and make best use of as
little data as possible.

3

Methodology

3.1

TimeStamping

In order to analyze when clusters are broken due to volatile top search results, each coselection was
t
ime stamped. All the data from the log files was reprocessed since the extracted coselections currently
do not contain the time information. Without the old parser this involved doing it all over again. The log
files were extracted from a server for a work
group of computers where users are more likely to have
similar search query. Out of all the interactions possible through the server, queries from Google, which
is by far the most popular search engine

(ebiz 2012)
, are extracted. Within each transaction ex
ists a time
stamp which is used as a record o
f when each interaction occurred
. This allows tracking the evolution of
clusters in time to determine how clusters have been broken. Though it is impossible to back track to say
when and where the URLs were part

of the top results, the activity of the URLs provide some
observations.





10


Figure
3
-
1

An example of server log data

3.2

Measuring Activity

The primary methodology is formed from the conjecture that URLs forming part of a coselection are in
or very close to the top search results and when no longer selected, may have fallen outside. The major
concern is how disparate clusters are, which is mea
sured by how often URLs selected in one cluster are
selected at the same time as URLs in another.
By measuring disparateness this way, the activity of any
two clusters can be correlated to one of three main outcomes:



Completely disparate clusters:

These cl
usters were likely impossible to join by the coselection
metric as they have not had two URLs selected in the same time frame. Too many of these
indicate volatility of results is causing disruption in the clustering.



Slightly disparate clusters:

These clus
ters still have some URLs selected at the same time but
only very rarely. Therefore they have less chance of gathering enough coselections to join and
with less chance, a smaller epsilon value may be more appropriate.



Non
-
disparate clusters:

These clusters

frequently overlap in activity but due to few coselections
in common, they are two distinct clusters. This is the ideal scenario which likely indicates a
strong correlation to clustering success. If multiple clusters still represent the same meaning in
th
is scenario, then it is a limitation of clustering coselections itself. Furthermore, we speculate
in this scenario the more coselections; the more likely it is to be accurate so how often two non
-
disparate clusters represent the same meaning should be mea
sured against the coselection size.

Since clusters representing the same meaning are the main factor being evaluated, individual URLs are
of less significance however the same can be applied to them, particularly for small amounts of data to
indicate wher
e they are situated.

3.2.1

Loss of activity

Since m
any URLs are active o
n and off over a period of time, a

measurement of complete inactivity
would
then
likely have to provide leeway of a small time period, allowing for a URL to chain together
activity over some

time. Scenarios where there have been regular activity on a URL followed by a
complete halt, provide our biggest indication that a

popular

URL has completely dropped off for good.
This is an alternative to disparateness t
hat is slightly clearer cut

a URL
has likely fallen outside of the top
results. The amount of URLs completely segregated between two clusters then provides an idea of how
many potential joins

were

greatly affected
by volatile results
.

3.2.2

Loss of URLs

An overall measurement of the loss of URLs

may also be formed for search terms with a lot of activity,
which may provide on average an indicator of how often URLs drop out of the top search results. To
analyze the results, information is gathered in time fragments that may be analyzed weekly, fort
nightl
y
or monthly. For each segment

of time there should be a fairly similar number of URLs chosen, mainly
those in the top results but also a small number outside. As time increases the total number of URLs
selected over all time increases while the numb
er selecte
d specifically in each time period

should
remain roughly the same. This provides a key correlation to how many URLs drop outside the top


11


results. Additionally in rare cases some URLs will come in and out of the top results, when they arrive
back,

they will also be added provided they reach the threshold of inactive iterations to be considered
completely inactive. An average number of

new URLs per iteration over
all the significant search terms
would then
determine how often top results change
.

3.3

Clu
ster Disparate

In order to improve clusters, a cluster disparate function has been proposed to find clusters that are
rarely active at the same time and allow a looser epsilon value. To determine disparateness, a timeline
of
URL usage is required where an
individual iteration

lasts a small period of time like a fortnight or a
month, the exact time being determined by how often URLs tend to drop off the top results discussed
earlier. Each pair of clusters that have few iterations selected in common with each

other are
considered disparate. Since these clusters have much less chance to gather the coselections necessary
to join, the epsilon value would therefore be reduced to make clustering easier.


Figure
3
-
2

An example of disparateness

To measure whether a pair is in fact disparate, there are two major cut offs. The first one is measuring
the

aggregating the number of URLs active in the cluster with

f
ewer URLs active for each iteration
. ie

Measuring Disparateness (Amount of times urls are selected)

Completely Disparate Clusters (No coselections possible)

Clusters

Iterations (Months)


URLs

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

Cluster1





















url1
-
1

3

4











2

1







url1
-
2


4

1











3






Cluster2





















url2
-
1




4


6

4

1

2

3











url2
-
2




3

2



3













url2
-
3





1

7


1

4

5











Disparate Clusters (lower epsilon may be helpful)

Clusters

Iterations (Months)


URLS

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

Cluster1





















url1
-
1

3

4









2


2

1







url1
-
2


4

1









1


3






Cluster2





















url2
-
1




4


6

4

1

2

3

1










url2
-
2




3

2



3




2









url2
-
3





1

7


1

4

5














12




The reason not every URL pair is measured is that it
would increase t
he count by multiple which is not

proportional

to the coselection chance that

is dependent on the amount of URLs available to co
-
select.
This measurement is referred to as
α

and is the maximum cut off possible for a pair of clusters to be
c
onsidered disparate.

The second measurement is a finer grained cut off that is proportional to the total number of URLs
featu
red in the smaller cluster. It

is only for small clusters that require a smaller cut off than the
maximum possible, since disparate
ness is less meaningful for smaller clusters that are much more likely
to have
less activity

than larger ones. The measurement finds the amount of
α

per number

of URLs in
the smaller cluster

and determines the rate it needs to increase since the more URLs,

the less need to
make the distinction.

If a pair of clusters passes

these two checks, they are determined to be disparate. In this scenario a
smaller epsilon threshold for joining may

be considered appropriate. T
he key is determining
what
values
constitu
tes disparateness and whether the lower epsilon value is valid for

not adding clusters of the
same sense.


4

Results

and Discussion

4.1

Analysis of Data

Unfortunately the server log data did not produce enough coselections as was speculated, only
approximately 1

Google URL request was found per 750 lines. The vast amount of interactions through
the server due to various applications and other tasks unexpectedly dominated web requests. As a
result for the barest minimum of epsilon 3 and minimum nodes 2, only one t
erm formed 2 clusters and
55 just one cluster. In most of these cases there was far less sufficient coselection information to make
valid judgements, at most only just meeting the epsilon and/or minimum nodes criteria.

The primary issue seems to be the qu
antity of searches available, the data set only uncovered a total of
57,920 searches, as a result the proportion of unique searches that occurred at least twice was very
small, 11.6%. This contrasts with the findings of Silversteen et al (1999) who had a m
uch larger data set
available by approximately 10,000 times, with 36.3% of unique searches occurring more than once. We
speculated that by having the data come from a select workgroup of computers with users having
similar tasks to perform, the repetition
of results should be much higher for the amount of data
available. While this may still be the case, 57,920 searches are still insufficient to gain enough repetition.

The challenge posed by such a small amount of data becomes accentuated when using coselec
tions.
Coselections only account for 23.9% of total searches. With searches rarely being selected multiple
times, the chance that it will be a second coselection is even rarer.

α

=


(
Min(
clusterA
.
URLsActive(
ti
)
,clusterB
.
URLsactive(
ti
)
)
)

ti=0

totalTimeIterations



13


Coselection Searches

Count

Total (non unique)

13858

Unique

12545

With at
least two occurrences

739 (5.89%)

With at least three occurrences

191 (1.52%)

With at least four occurrences

98 (0.78%)

Table
4
-
1

Coselection data collected

Searches

Count

Total (non unique)

57920

Unique

39525

With at least two occurrences

4582 (11.59%)

With at least three occurrences

1845 (4.67%)

With at least four occurrences

1115 (2.82%)

Table
4
-
2

Search Data collected

Coselections are also hampered by

the necessity for the same two URLs to be selected in two different
searches in order for the relationship to gain greater significance. Just having one of those URLs in a
second coselection search does not increase the relationship for any of those by mo
re than one. This
becomes an issue as out of the top 10 results there are 45 coselection relationships possible. The
problem posed is somewhat offset by large coselections that cover many of the relationships as well as
the tendency of users to more likel
y select from the top and that the most relevant URLs are more likely
to be selected. However it is particularly significant since 60.0% of coselections only feature 2 URLs and
the lack of data does not lessen its influence.

Coselection relationships

Count

Coselection searches

13858

Aggregate of URLs in each
coselection query

38908

Total coselection edges

50712

Total coselection weights

51528

Table
4
-
3

Coselection Relationships Data

Since coselection relationships grow triangular against the amount of URLs in a query, often the terms
with the highest aggregate coselection
-
weights were those with many coselection relationships of just a
weight of 1 each caused by a single query with ma
ny URLs. These results typically embodied a type of
content like “luxury holiday”
(406 coselections, 28 URLs in 1 search)
or “military aircraft map textures”,
(351 coselections, 26 URLs in 1 search),
where a user selects a category like a browse function a
nd
multiple items are compared for the most appropriate. In such instances a single coselection search
would far exceed the top 10 results typical of a first page.

Very few examples featured very high weights, one of the best results was “free sound effe
cts” with 125
coselection
-
weights,

but had

97 edges between 23 URLs. That leaves only leaves 28 coselections
between 97 edges to increase their weights above 1.
Rarely were the
s
e

proportions exceed, but with
more data such occurrences are expected to be mo
re regular and bigger in size as the effect of one or
two user clicking loads of links are outweighed by the majority.



14


By far the biggest activity over an extended period of time occurred on “teeside” and “teeside
university”, however very few of these inc
luded coselections. Approximately 3
-
4 links had consistent
activity over 3 years with other links being used only on occassions. Likely this is due to its navigational
sense in only needing to find one of the 3
-
4 major portals for Teeside University.

4.2

Measu
rement of activity

Little could be deduced about the activity due to such a small results set. With only one search term
featuring more than one cluster, nothing was conclusive on cluster disparateness

or URL segregation
.

Using the search term
with 2 clust
er
, it indicated disparateness may be a factor as it only measured an
α

of 4 with 2 URLs in the smallest cluster, but conversely that may no
t indicate much as there were not
any months where the less active
cluster was active at a different

time.

URLs indi
vidually provided more significant findings as
they are always expected to be low on data, since
as coselections grow, they would otherwise already be in clusters. Remarkably nearly all URLs were
active at the same time as another URL indicating a coselect
ion chance with the URLs featured, though
this rarel
y meant a coselection did occur
.

For terms where a cluster was found, only 54 out of 432 URLs
did not have a month active
in common with a
cluster and only 15 were active greater than 2 times with

the same cluster. Most of the

URLs therefore fall into the slightly disparate range where a lower epsilon
value may be valid, though this disparateness may not be a great indication as most of these URLs were
never active at times the clusters weren’t.

Cl
uster Disparate on URLs (
α<
= 2)

Count

Search Terms with a cluster

56

Potential URLs

363

URLs with > 2 months in common

15

URLs with no months in common

54

Table
4
-
4

Disparateness of URLs to existing cluster


4.3

Measurement of Loss of URLs

Since to gather the average loss of URLs in the way mentioned required an enormous amount of data
for a select few search terms, this was unable to be met.

4.4

Cluster Disparate

Since
nothing of influence could be found on the disparateness of clusters, findings are yet to show its
effectiveness. Of the one search term
, “free sound

effects


with more than two

cluster
s, it

did highlight
some potential in merging clusters since the two cl
usters were
distinctly of the same meaning
and
the
measurement of
α

was only 4 for 2 URLs in the smallest cluster. By halving the epsilon from 3 to 2, a join
was found between the 2 clusters which would indicate success of the function, however many more
results are needed to find it’s true significance.



15


With 56 search te
rms featuring at least one cluster, we proceeded to trial using cluster disparate on
merging individual URLs with a current cluster to gather its accuracy where there are small amounts of
coselections. In actual results, the disparate function proved to be

of little significance as out of 432
URLs only 15 were
discarded

for having more than 2 months in common with the cluster. In spite of this,
only 1 false positive URL was added to a cluster co
mpared with 47 true positives.

Cluster Disparate on URLs (
α<
= 2


epsilon=2)

Count

True positive URLs added to a cluster

47

False positive URLs added to a cluster

1

Unknown pages added to a cluster

1

Search Terms with a cluster

56

Table
4
-
5

Accuracy of Cluster Disparate Function on URLs

Such results would seem to suggest the strength of clustering coselections is so strong that a lower
epsilon of 2 rather than 3 may be sufficient for joining URLs irrespective

of disparateness. However,
these

results

may not be indicative of bro
ader findings

due to 3 key reasons:



Most of these queries featured were unambiguous; therefore all URLs are likely to point to the
same meaning regardless.



Most queries that are ambiguous tend to show the dominant meani
ng in most of the top
results, therefore the other meanings rarely get clicked
on (Krovetz & Croft
1992)
.



Google is renowned for its URLs consistently being
strongly
relevant to the search term, so
completely
irrelevant URLs are rarely a factor, let alone
coselected multiple times.

For these same reasons, none of the 15 URLs discarded for having too many months in common with a
cluster were of a different search sense. Moreover, it is apparent

most searches aren’t coselections
,

so

many of the URLs with most

activity
are still yet to have a solid chance at joining other URLs. I
t may
then
be applicable to scale the
months in common
value proportional to the overall amount of coselections
in the search term
, allowing for a bigger

margin
where there are fewer re
lationships.

5

Conclusion

The biggest challenge posed for futur
e study is to gather many
more coselections. Even with data being
collected from the same workgroup of computers, the amount of repetition of results was not great
enough to form significant clus
ters regularly. By being used less than a quarter of the time than single
selections, the chance a search term will feature coselections on multiple occasions is even less.

Nevertheless, t
he meth
odologies
of measuring disparateness and segregation based on

URL activity
appear to be sound as a means to determine

coselection chance between
two
clusters
.
These are key
indicators

to
what extent volatility of results
may be

posing a problem
for different clusters forming of
the same sense
.



16


With coselections being difficult to accumulate, clustering needs to improve effectiveness even as the
data set grows.
A cluster disparate function was suggested for drawing together clusters that were
fragmented by evolution in time, though no conclusions

could be drawn on its

effectiveness
.
While
drawing URLs onto a cluster through this methodology appeared successful, it was offset by major
limitations in the data set available.

The ideal that c
oselections

can determine

ambiguity from cluster cardinalit
y

remains elusive,

though it
appears difficult to
attain full accuracy

using coselections a
s the lone similarity metric due to sparsity of
activity.



6

References

Agichtein, E,
Brill,
E, and
Dumais
, S,

2006
,
Improving web search ranking by incorpora
ting
user behavior
information,

Proceedings of the 29th annual international ACM SIGIR conference on Research and
development in information retrieval
, ACM
.

Amsler, R, 1980,
The Structure of Merriam
-
Webster Pocket Dictionary
,

Ph. D. thesis, University of Texas
at Austin, Austin.

Ankerst , M,

Breunig
, M,
Kriegel
, H

&

Sander, J, 1999,

OPTICS: Ordering points to identify the clustering
structure
,
Proceedings of the 1999 ACM SIGMOD international conference on Management of dat
a,

p
p

49
-
60
.

Ashkan, A, Clarke, C, Agichtein, E, & Guo, Q, 2008,

Characterizing query intent from sponsored search
clickthrough data
,

In SIGIR Workshop.

Ashman,
H,
Antuno
vic, M, Chaprasit, S, Smith, G
&

Truran,

M,

2011,

Implicit association via crowd
-
sourced

c
oselection,
Proc. Hypertext 2011, June 2011, 7
-
16, ACM.

Ashman, H, Zhou, D, Goulding, J, Brailsford, T, & Truran, M,
2007,
The Global Perpetual Dictionary
of
Everything, Proc. Ausweb
,
http://ausweb.scu.edu.au/aw07/papers/

refereed/ashman/paper.html
.

Beefer
man, D & Berger,

A
, 2000
,
Agglomerative clustering of a search engine query log
,
Proceedings of
the sixth ACM SIGKDD international conference on

Knowledge discovery and data mining
, pp 407
-
416.

Birant
,

D

&

Kut, A, 2007,
ST
-
DBSCAN: An algorithm for
clustering spatial

temporal data
,
Data &
Know
ledge Engineering, Vol. 60, Jan 2007, pp

208

221
.

Caon, G,
Antunovic
, M, Truran, M

&
Ashman, H, 2012
,
Finding synonyms and other semantically
-
similar
terms from coselection data
, UniSA, SA
.



17


C
arterette, B, &
Jone
s,

R,

2007
,

Evaluating search engines by modeling the relationshi
p between
relevance and clicks
,

Computer Science Department Faculty Pub
lication Series
.

Chan, W, Leung, W & Lee, D, 2004,
Clustering search engine query log
containing noisy clickthroughs
,

20
04 Intern
ational Symposium on

Appl
ications and the Internet.

Chen, J
,

&
Chang
, J, 1998,
Topical clustering of MRD senses based on in
formation retrieval techniques
,

Compu
tational

Linguistics
.

Clarke, C, Agichtein, E, Dumais, S, & White, R, 2007,

The

influence of caption features on clickthrough
patterns in web search
,

In Proceedings of the 30th annual international ACM SIGIR conference on
Research and develo
pment in information retrieval, pp. 135
-
142,

ACM.

Dorow, B & Widdows D, 2003,
Discovering corp
us
-
specific word senses
,
Proceedings of the tenth
conference on European chapter of the Association
for Computational Linguistics, Vol.
2
,
Stroudsburg,
PA
, pp
79
-
82
.

Dou, Z, Ruihua, S, Xiaojie, Y, & Ji
-
Rong W,
Are click
-
through data adequate for learning w
eb search
rankings?,
In Proceeding of the 17th ACM conference on Information and knowledge
management, pp.
73
-
82, ACM
.

Dupret, G, & Ciya, L
, 2010,

A model to estimate intrinsic document relevance from the clickthrough logs
of

a web search engine,

Proceedin
gs of the third ACM international conferenc
e on Web search and data
mining, ACM
.

Gale, W,
Church, K

&

Yarowsky, D,
1992,
A Method for Disambiguating Word Senses in a Large Corpus,
Computers and the Humanities
, 26, pp 415
-
439
.

Gao, J, Wei, Y, Xiao, L,
Kefeng, D, and Jian
-
Yun, N, 2009,
Smoothing clickthrough data for web search
ranking
,
In Proceedings of the 32nd international ACM SIGIR conference on Research and development
in information r
etrieval, pp. 355
-
362, ACM
.

Gao, J, Xiaodong H, & Jian
-
Yun, N, 2
010,

Clickthrough
-
based translation models for web search: fro
m
word models to phrase models
,

Proceedings of the 19th ACM international conference on Information
and

knowledge management,

ACM
.

Granka, L,

Joachims,
T, &
Gay
, G, 2004,
Eye
-
tracking analysis
of user behavior in WWW search
,

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in
information retrieval,

ACM
.

Guha, S, Rastogi, R

&
Shim, K, 1998,
CURE: an efficient clustering algorithm for large databases
,
P
roceedings of the 1998 ACM SIGMOD international conference on Management of data
, pp
73
-
84
.

Joachims, T, 2002,
Optimizing search e
ngines using clickthrough data,

Proceedings of the eighth ACM
SIGKDD international conference on Knowledge disc
overy and data
mining, ACM
.



18


Joachims,
T,
Granka,
L,
Pan,

B,
Hembrooke,

H,
Radlinski,
F, and
Gay,

G, 2007,
Evaluating the accuracy of
implicit feedback from clicks and quer
y reformulations in Web search,

ACM Trans.

Inf. Syst., vol. 25, p
p

7
.

Karypis, G,
Eui
-
Hong
,

H

&
Kuma
r, V
,
1999,
Chameleon: hierarchical clustering using dynamic modelin
g,
Computer, Vol. 32
, p
p

68
-
75
.

Leung, K, Wilfred N, & Dik L, 2008,
Personalized concept
-
based clust
ering of search engine queries
,

Knowledge and Data Engineering, I
EEE Transactions
.

Liebe
rman, H, 1995,
Letizia: An a
gent that assists web browsing
,

International Joint Conference on
Artificial

Intelligence, Vol. 14,

Lawrence Erlbaum Associates Ltd
.

Pantel, P & Lin, D, 2002,
Discovering word senses from text,

Proceedings of the eighth ACM
SIGKDD
international conference on Knowledge discovery and data mining
,

pp 613
-
619
.

Pass, G
, Abdur
, C, &

Cayley
, T, 2006,
A picture of search
,

Proceedings of the 1st international conference
on Sc
alable information systems
.

Riloff, E, 1993,
Automatically
constructing a dictionary for information extraction tasks
,

Proceedings of
the National Conference on Artificial Intelligence
,

John Wiley & Sons Ltd
.

Scholer,
F,
Shokouhi,

M,
Billerbeck,
B &
Turpin,

A,
Using Clicks as Implicit Judgments: Ex
pectations
Versus Observations
,

Advances in
Information Retrieval, 2008, pp

28‐39.

Smith, G, &

Ashman
, H, 2009,

"Evaluating implicit judgements from image search int
eractions."

Smith, G,
Brailsford,

T,
Donner,
C,
Hooijmaijers,
D,
Truran,

M,

Goulding,

J, & Ashman, H,
2005,
Generating unambiguous URL clusters from web search
,
Proceedings of the 2009 workshop on Web
Search Click Data, pp. 28
-
34
, ACM
.

Sun, J, Hua
-
Jun, Z,
Liu,
H,
Lu,

Y, Chen, Z, 2005,
CubeSVD: a novel approach to personalized Web search
,

Proceedings of
the 14th international conference

on World Wide Web, pp 382
-
390, ACM
.

Tamir
, R,

& Rapp
, R
, 2003,
Mining the Web to discover the meanings of an ambiguous word
, Data
Mining, 3
rd

IEEE International Conference on,
19
-
22 Nov. 2003
, pp 645
-

648
.

Voorhees, E, 19
93,
Using WordNet to disambiguate
word senses for text retrieval
,

Proceedings of the
16th annual international ACM SIGIR conference on Research and devel
opment in information retrieval,
ACM
.

Weaver, W, 1955,
Translation
, Machine Translation of Languages, J
o
hn Wiley & Sons,
pp

15
-
23.

Xu, G,
Yang,

Y, & Li, H, 2009,
Named entity mining from click
-
through data using weakly supervis
ed
latent dirichlet allocation,

Proceedings of the 15th ACM SIGKDD international conference on Know
ledge
discovery and data mining,
ACM
.



19


Yarowsky,
D, 1995,
Unsupervised word sense disambiguation rivaling supervised methods
, ACL ’95
Proceedings of the 33
rd

annual meeting on Association for Computational Li
n
guistics, pp 189
-
196.

Zhang, T, Ramakrishnan, R, Livny, M, 1996,
BIRCH: an
efficient data clustering method for very large
databases,

Proceedings of the 1996 ACM SIGMOD international conference on Management of data, pp
103
-
114.






20


7

Appendix

7.1

Appendix A
: Coselection Count of terms with at least one cluster


Note: Words are

in most
logical order but order is unimportant in clustering

Most clusters
(
not above
)

Coselecti
on total
weights

Coselecti
on edges

URLs

Clusters

data protection act

385

340

14

1

c++ connect 4

294

237

5

1

free sound effects

125

97

6

2

teeside

25

16

7

1

python
time difference

37

26

3

1

pydoc

22

17

3

1

teeside university

38

26

7

1

bridge transporter

10

8

3

1

wet n wild

3

1

3

1

python dictionary

96

87

3

1

spaceship .wav

6

5

4

1

c++

11

6

5

1

xsi tutorials

25

17

3

1

teeside internet

13

10

4

1

sdl_close

10

8

3

1

python string contains

11

8

3

1

sci entertainment quote

12

7

3

1

fur affinity

9

4

4

1

sound effects

119

114

3

1

gp2x

11

7

3

1

game piracy

17

15

3

1

games age rating

17

13

3

1

“cavazza marc” or “marc
cavazza


7

4

4

1

games piracy

78

69

3

1

set xsl variable

12

10

3

1

reference to undefined

9

6

3

1

smashing magazine

15

9

3

1

pakistan news

68

12

17

1

boom toon tutorials

26

16

3

1

pro gaming teams

22

18

3

1

textures

23

16

4

1

don’t stop me now midis

52

47

4

1

pegi

12

7

5

1



21


c++

ternary operator

24

14

6

1

pound euro

3

1

3

1

avfc

6

1

6

1

c++ string

15

8

6

1

imbd

4

2

3

1

wwe spoilers

6

4

3

1

tees.ac.uk

47

44

4

1

teeside uni

19

16

3

1

messenger web

14

10

4

1

photo portfolio

43

33

3

1

hotmail

45

41

3

1

1998 data protection act

24

13

5

1

linux commands

9

5

3

1

zero punctuation

8

6

3

1

blackboard tees

5

3

3

1

sdl sound

9

6

3

1

teeside uni

13

8

4

1

free textures

24

20

3

1

sdl_mustlock

24

20

3

1

c++ random int

24

14

3

1

arm.linux.rules

5

3

3

1

sdl

9

6

3

1

SUM

2000

1561

227

56


7.2

Appendix
B
: Accuracy of cluster disparate on URLs with an existing
cluster


Search Term

Amount
of
positive
URLs

Amount
of false
positive
URLs

Amount
of
borderlin
e

/
unknow
n

URLs

Potential
URLs

URLs
with too
many
months
in
common

URLs
with no
months
in
common

free sound effect

2

0

0

16

0

0

teeside

1

0

0

9

2

0

python time
difference

4

0

0

6

0

0

pydoc

2

0

0

5

0

1

teeside university

1

0

0

17

2

0

bridge
transporter

0

0

0

6

0

0



22


wet n wild

0

0

0

0

0

0

python
dictionary

3

0

0

24

0

0

spaceship .wav

0

0

0

2

0

0

c++

1

0

0

4

1

0

xsi tutorials

1

0

0

9

0

0

teeside intranet

0

0

0

9

0

0

sdl_close

2

0

0

3

0

1

python string
contains

1

0

0

4

0

0

sci entertainment
quote

2

0

0

5

0

1

fur affinity

0

0

0

1

1

0

sound
effects

1

0

0

25

0

0

gp2x

1

0

0

4

1

0

game piracy

0

0

0

5

0

0

games age ratings

2

0

0

6

0

0

sdl_init

0

0

0

3

0

3

“cavazza marc” or
“marc cavazza”

0

0

0

3

0

1

games piracy

1

0

0

16

0

0

xsl set variable

0

0

0

4

0

0

reference to
undefinied

1

0

0

0

0

0

smashing
magazine

0

0

0

2

0

0

pakistan news

0

0

0

1

0

0

boom toon
tutorials

1

0

0

6

0

2

data protection
act

2

0

0

17

0

20

pro gaming teams

1

0

0

8

0

0

textures

1

0

0

7

0

0

don’t stop me
now midis

0

0

0

10

0

0

pegi

0

0

0

5

0

0

c++ ternary
operator

1

0

0

4

0

0

c++ connect 4

6

0

0

23

1

0

euro pound

0

0

0

0

0

0

avfc

0

0

0

0

0

1

c++ string

1

0

0

6

0

0

imbd

0

0

0

2

0

0



23


wwe spoilers

0

0

0

2

0

0

tees.ac.uk

0

0

0

14

0

0

teeside uni

1

0

0

9

0

0

web messenger

0

0

0

9

2

0

photo portfolio

2

0

0

8

0

0

hotmail

1

0

0

4

2

15

data protection
act 1998

0

0

0

2

0

0

linux commands

0

0

0

1

1

2

zero punctuation

0

0

0

4

0

4

tees blackboard

0

0

0

1

1

1

sound sdl

1

0

0

3

0

0

teeside uni

0

0

0

7

1

0

free textures

2

0

0

8

0

0

sdl_mustlock

1

0

1

7

0

0

c++

random int

0

0

0

2

0

0

arm.linux.rules

0

0

0

1

0

2

sdl

0

1

0

4

0

0

SUM

47

1

1

363

15

54


7.3

Appendix
C
:

URL distribution



For links not in a cluster of (3 epsilon, 2 minimum nodes)

Most
coselections

No.
of
links

No.
never
active
with
other
links

No.
active
with
other
links
in
only
1
mont
h
iterat
ion

No.
active
with
other
links
in
only
2
mont
h
iterat
ions

No.
active
with
other
links
in
more
than
2
mont
h
iterat
ions

No.
active
in
only
1
mont
h
iterat
ion

No.
active
in
only
2
mont
h
iterat
ions

No.
active
in
mo
re
than
2
mont
h
iterat
ions

data protection
act

42

0

34

2

1

34

2

1

c++ connect 4

33

0

23

0

1

23

0

1

free sound effects

23

0

15

1

0

15

1

0

teeside

14

0

7

2

2

7

2

2



24


python time
difference

8

0

2

4

0

2

4

0

pydoc

8

0

4

2

0

4

2

0

teeside university

22

0

17

0

2

17

0

2

bridge
transporter

8

0

6

0

0

6

0

0

wet n wild

2

0

0

0

0

0

0

0

python dictionary

27

0

21

3

0

19

4

0

spaceship .wav

4

0

2

0

0

2

0

0

c++

7

0

4

1

0

4

0

1

xsi tutorials

13

0

7

3

0

7

3

0

teeside internet

11

0

8

1

0

8

1

0

sdl_close

6

0

4

0

0

4

0

0

python string
contains

6

0

3

1

0

3

1

0

sci entertainment
quote

8

1

5

0

0

5

1

0

fur affinity

4

0

1

0

1

1

0

1

sound effects

28

0

20

5

1

20

5

1

gp2x

7

1

2

1

1

3

1

1

game piracy

7

0

5

0

0

5

0

0

games age rating

8

0

4

2

0

4

2

0

“cavazza

marc” or
“marc cavazza”

4

1

3

0

0

3

1

0

games piracy

18

0

16

0

0

16

0

0

set xsl variable

6

0

3

1

0

3

1

0

reference to
undefined

4

0

1

1

0

1

1

0

smashing
magazine

5

0

2

0

0

2

0

0

pakistan news

6

0

1

0

0

1

0

0

boom toon
tutorials

11

1

7

0

0

8

0

0

pro

gaming teams

10

0

8

0

0

8

0

0

textures

10

0

5

2

0

5

2

0

don’t stop me
now midis

13

0

10

0

0

10

0

0

pegi

7

0

4

1

0

4

1

0

c++ ternary
operator

7

0

4

0

0

4

0

0

pound euro

2

0

0

0

0

0

0

0

avfc

4

1

0

0

0

0

0

1



25


c++ string

8

0

5

1

0

5

1

0

imbd

4

0

2

0

0

2

0

0

wwe spoilers

4

0

2

0

0

2

0

0

tees.ac.uk

16

0

13

1

0

13

0

1

teeside uni

12

0

9

0

1

9

0

1

messenger web

13

1

8

0

2

9

0

2

photo portfolio

10

0

8

0

0

8

0

0

hotmail

23

1

14

1

5

15

1

5

1998 data
protection act

6

0

1

1

0

1

1

0

linux

commands

6

1

1

0

2

2

0

1

zero punctuation

10

0

7

0

1

7

0

1

blackboard tees

5

0

1

1

1

1

1

1

sdl sound

5

0

2

1

0

2

1

0

teeside uni

11

0

6

1

1

6

1

1

free textures

10

0

6

0

2

6

0

2

sdl_mustlock

9

0

6

1

0

6

1

0

c++ random int

6

0

2

0

0

2

0

0

arm.linux.rules

5

0

3

0

0

3

0

0

sdl

6

0

3

1

0

3

1

0

SUM

572

8

357

42

24

360

43

26