A Novel Approach to Link Generation

londonneedlesAI and Robotics

Oct 25, 2013 (3 years and 9 months ago)

129 views




A Novel Approach to Link
Generation




Shawn

Jiang







A dissertation submitted in partial fulfillment of the requirements for the degree of

Bachelor of Science (Honours) in Computer Science, The University of Auckland,
2010.

Abstract


People create hyperlinks inconsistently and the information that the hyperlinks point to
may not be useful to users whom are looking for specific
information.

One of the most
effective ways to resolve this issue is to use an automated link generator
to
f
ix the
se

broken links
. In t
his dissertation
, we introduce

an approach to generate hyperlinks
automatically based on semantic relatedness between anc
hor terms and documents.

We
have used Wikipedia as the word sense disambiguation corpus because Wikipedia
conta
ins

a

vast amount of information

and

tight structures, which make Wikipedia a
promising source for extracting semantic relationships
. Also
hundre
ds of thousands
of
manually created hyperlinks
in the corpus
provide us the ground truth for evaluation.

We showed that
our approach based on Semantic Related Difference
outperforms
TF
-
IDF (
Term Frequency
-
Inverse Document Frequency).

































1. Introduction

One of the most used features of
the World Wide
Web (
WWW)

is hyperlink
s
.
Hyperlinks

are everywhere on the
Web
. Web
-
pages con
tain hyperlinks to reference
other material
, enabling
people

to follow links
between information
. However,
Green

[1]
stated in his paper that people create hyperlinks inconsistently,
and consequently
the
information that the hyperlinks point to may not be useful to users
whom
are
looking
for specific
information.
He also claimed that using computer programs to generate
hyperlinks automatically will
mitigate

the issue.


In this
dissertation
, we
propose

a

new method
to generate hyperlinks automatically
from an orphan
document
1

to appropriate documents based on lexis
and semantic
relatedness. We have chosen Wikipedia as the example cor
pus because

of its size and
structure.

Wikipedia is the largest and

most comprehensive free online encyclopedia
knowledge base on the
WWW
. It contains 16 million (more than 3.
3

million in

English)
articles that are contributed by active volunteers around the
world [
2
]. Every article
contains manually cr
eated hyperlinks that connect

the most important terms

to other
pages
enabling readers to navigate from one article to another. This provid
es the readers
a quick way of accessing relevant information. Currently, Wikipedia contributors create
links manually following the Wikipedia “manual of style”. The processes to create
these hyperlinks
are

time consuming and tedious for the contributors. I
t would
be
useful

if there is an automated mechanism that will generate the hyperlinks automatically for
the contributors and
so that they can put their

effort
into
more important things such as
writing better article content.


Figure 1

shows
our

automated link generation flow chart. Generating
link
s

from an
orphan document involves two steps, anchor detection and word
disambiguation [
3
]
.

Due to time limitation and our interest in the word disambiguation,
our work

focused
on
the word disambiguat
ion
section
.





1

An orphan
document is an unstructured document that does not have incoming or outgoing hyperlinks
and is not related to any other documents

Text with Key words

Clean Text

Input Text

Sanitizin
g

Keyword
Extraction

TF
-
IDF

SR
D (Semantic
Related Difference)

Keyword Extraction

Word Disambiguation

Output document with
hyperlinks

2


Figure 1: Flow chart for automated link generation


O
ne

approach is based on a well known
i
nformation retrieval

algorithm Term
Frequency
-
Inverse Document Frequency (TF
-
IDF)

[
4
]
. To get a better understanding of
how
the algorithm

works

we have implemented the algorithm and related indices
. We
then
devised a better link gener
ation technique based on TF
-
IDF, called Semantic
Related Difference (SRD).


The
report

is organized as foll
ows. In S
ection 2, we talk about the corpus we used and
why we chose this corpus. We will discuss some of the related work
s

that
has
been done
in Section 3. In
Section

4 we will
describe a technique that has been used previously,
TF
-
IDF
and talk abo
ut how
indices are created. In S
ection 5, we disclose our approach

by giving an example.
We then
evaluate our approach

wit
h discussion of the results in
S
ection
6
.
Finally, i
n
Sec
tion
7
, the paper concludes with the strength
s

and limitations
of our approach and
o
utlines
possible future improvement
s
.


2. The Wikipedia Corpus

For
our
experiments
, we

have used the English collection of the Wikipedia

XML
corpus
[
5
] created by Ludovic

Denoyer and Patrick Gallinari. The corpus

is based on the
Wikipedia 2006 dump. It
is about 4600 MB
in size
and consists of
659,388
UTF
-
8
encoded XML
files
.

Each file corresponds to a Wikipedia article
. The filenames
correspond to the unique IDs of the documents (for example, 12345.xml)
.
On average,
the size of
one

article is 7261 bytes.

The corpus o
nly

includes

Wikipedia articles,

so
documents

such as “Talks”, “Templates”

etc. are not included.


Wikipedia contributors use “Wikitext” (the markup language used in Wikipedia for
editing articles) when they write articles. For example,
==Section 1==

represents a
Heading with text
Section 1
. However,

Wikitext has some drawbacks.

A
ccording
to
W3C XML
Standard [
6
]
, characters like
=

cannot

be used as the start character

in tag
names, also
Wikitext

is

not
self

explanatory
.
Therefore, to
avoid these problems, t
he
corpus has replaced
Wikitext

with corresponding meaningful xml tags.


The following
is

a

fragment of a

document
taken from document
60871.xml

and gives

an example of
what
an article

looks like

in the corpus.

We are mostly interested in the
collectionlink tag
in every document

b
ecause a
collectionlink
tag

represents a

hyperlink
created by human
.

The value

between

<collectionlink>
and

</collectionlink>
tags is
the term that Wikipedia contributors have chosen to

create a link, that is the anchor term.
T
he value of attribute

xlink:href
is the destination document
, where the anchor term
links.
For example, in the document
fragment
below, the word
light
has been linked to
document 17939.xml.

(
For

a complete document

sample, please refer to Appendix I)


<?
xml

version
=
"
1.0
"

encoding
=
"
UTF
-
8
"
?>

<
article
>



<
name

id
=
"
60871
"
>
Luminescence
</
name
>



<
conversionwarning
>
0
</
conversionwarning
>

3




<
body
>





<
emph3
>







Luminescence





</
emph3
>

is





<
collectionlink

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
17939.
xml
"
>







light





</
collectionlink
>

not

generated

by

high

temperatures

alone.





<
p
>







It

is

different

from







<
collectionlink

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
2138
35.xml
"
>









incandescence







</
collectionlink
>
,

so

it

usually

occurs

at

low

temperatures.

Examples:

fluorescence,

bioluminescen
ce

and

phosphorescence.





</
p
><
p
>







Luminescence

can

be

caused

by

chemical

or

b
iochemical

changes,

electrical

energy,

subatomic

mo
tions,

reactions

in

crystals,

or

stimulation

of

an

atomic

system.





</
p
><
p
>







The

following

kinds

of

luminescence

are

known

to

exist:





</
p
><
normallist
>







<
item
>









<
collectionlink

xmlns:x
link
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
2914
36.xml
"
>











Chemoluminescence









</
collectionlink
>

(including









<
collectionlink

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
2037
11.xml
"
>











Bioluminescence









</
collectionlink
>
)







</
item
>

………………………


………………………
.


Since there are 659,388 xml files in the corpus, it is inefficient to use the operating
system’s file system to keep track of all the files. In order to be able to
search

files more
efficiently, we have created a database to index
all
the files in the
corpus.


3. Related Work

A lot of research

ha
s

been carried out
into

link generation

[14, 15, 16
, 17, 18
]
.

In this section we will only be discussing the more relevant work carried out in this
area.


Wilkinson

and
Smeaton

[
7
]

pointed out that “To create static links between
semantically related text, we can simply calculate the similarity between all pairs of
information, and then insert links between those that are most similar.”

Many
link
generation approaches are done by
co
mparing semantic relatedness

between
the input
text and target corpus
.


Strube and Ponzetto

[
8
]

developed a technique

called
WikiRelate!

to
compute

semantic
relatedness
using Wikipedia.

Their main focus was to show that Wikipedia can be used
as a reliable
knowledge base for adding semantic relatedness information to natural
language processing applications. They compared Wikipedia to WordNet

corpus

on
various benchmarking datasets. The WikiRelate! performs best when evaluated with
the largest dataset.

4



A
system called Wikify was developed by Mihalcea and Csomai

[
3
].
It was the first
system to do automated link generation using Wikipedia as a destination for links [
9
].
The algorithm
generates links in
two phases. The first ph
ase is called
keyword
extraction
.
In this ph
ase, the algorithm detects and picks the most impor
tant keywords

in the given text
.

The algorithm first extract
s

candidate terms by selecting all possible
n
-
grams from the input text, then rank

them
by one of the three approaches Mihalcea
and
Csomai have investigated
, namely: TF
-
IDF,
χ
2
,

and Keyphraseness
.
The
best
performing

approach

was the Keyphraseness
.
The Keyphraseness of a term is
calculated
using

the number of Wikipedia articles that have already selected
a term

as
an

anchor term divid
ed by the number of articles contains it.
T
he selected keywords are
the anchor
terms

that

are
sent to the second phrase
word sense disambiguation.

Since a
word can have
different

meanings in different context
s
, disambiguation involves
calculating
the
weight of
surrounding words of potential anchor terms and ensures the
anchor terms are linked to the most appropriate articles.

For each ambiguous
term
, t
he
algorithm
compares a training example from Wikipedia to feature vectors
of

the term
and its local c
ontext.


Milne and

Witten

[
9
]

developed

a machine
-
learning approach that
is quite similar to the
Wikif
y system. The approach
again uses

two
phases

to generate a
link,

h
owever
the
order of the phases is
reversed
.
Milne and Witten’s appr
oach does the
disambiguation
ph
ase first and the results are used to inform keyword extraction.
I
n
the
disambiguation
phase,
the algorithm calculates
commonness (
the
number of times a term is
used as a
link destination
) and
relatedness

of a term
, which measures the sema
ntic similarity of
two Wikipedia pages by comparing their incoming and outgoing links
.
The algorithm
used to calculate the relatedness is based on the Normalized Google Distance [
10
],
which

calculates semantic relatedness

of words

using

the

Google

search engine.

The
algorithm then passes the result
s

of the disambiguation

to the second phase,
link
detection
. In
the
link
detection
phase,

the result of the disambiguation is used
in a
machine learning

approach

to filter out n
-
gram anchor terms in the
input
document.


Both
Mihalcea and Csomai’s Wikify system and Milne and Witten’s approach achieved
high accuracy. However,
both approaches

have many shortcomings.
They
require
knowledge of existing links and a lot of preprocessing work to be done.
The

tech
niques
used in both approaches
depend on the presence of existing links in the corpus.
They
take a document corpus, and determine which words are often used to link to a relevant
related
document. Where

these link
-
words are found in other documents in the
corpus,
duplicate links are created.

New links cannot be conceived by both approaches, they
can only be duplicated from existing ones, which means the approaches

cannot be used
with orphan documents, which do not have pre
-
existing links.
Therefore,

both

ap
proaches are specific to Wikipedia, leveraging Wikipedia’s existing hyperlinks
structure.


4. Background

5


In this
section we will be discussing some of the relevant background information

of
this project. In th
is
section we will look at the TF
-
IDF metric,
p
re
-
processing
requirements
, and the index we will be using.

4.
1

TF
-
IDF

TF
-
IDF [4, 11]

is a classical metric that

has been widely used
in information retrieval
field,
which represents a weight that indicates the importance of a word to a document
in a
corpus

[
11
]
.

The idea of the TF
-
IDF algorithm is that if a word or phrase appears
more times in a document but
fewer times

in other documents in a corpus, then the
word or phrase has a high weight, hence
it is
more important. As its name implies, the
weight is calculated by two factors, term frequency and inverse document frequency.
Term frequency is defined as how many times a term
t

appears in a document
d
. It is
calculated by

tf
i
,
j
=
n
i
,
j

n
k
,
j
k


(1)

wh
ere
n
i,j

is the number o
f
times the query term (t
i
)
occurr
ed

in document
d
j
, and the
denominator is the sum of all terms in document
d
j
,
that is, the total number of terms in
document

|d
j
|

[
1
1
]. A document
that
contains more occurrences of the term means
that
the
document

is more relevant than others. For example, a document d1
that
contains
five occurrences of the term is more relevant than a document d2
with

only one
occurrence. However, relevance does not increase proportionally with the term
frequency.
This

mea
ns
that
document d1 is not five times more relevant than document
d2
.


There is a good chance that common words appear frequently and
then

those common
words have
a
very high weight in the Term Frequency factor. The second factor IDF is
introduced to offse
t the weight of those common words. If |D| is the total number of
documents in the corpus and

|
{
d
:
t
i

d
}
|

is the number of documents where the term
ti

appears,

the IDF weight can be calculated by
:



(2)

The overall TF
-
IDF weight of a word or phrase is
the product of its TF weight and IDF
weight.



(3)

A high
TF
-
IDF

weight of a term is achieved by a high term frequency and low
document frequency. Thus the common terms (which usually have high term frequency
and high document frequency) have low
TF
-
IDF

rank.

For example, c
onsider the word
apple
that

appears five times in a document containing
a total number of 100 words. From
equation (1)
, the term frequency of
apple

is
5
100

=

0.05. Assume we have 1 million documents in the corpus and
apple

appears in one
hundred of those. Then, the inverse document frequency is calculated
as

log
(
1000000
100
)

6


= 4. The TF
-
IDF weight is
:

0.0
5

× 4 = 0.
2
.


In addition to TF and IDF, a third weighting factor document length is also useful to
calculate terms weight
s in some cases [
12
]. Longer documents are more likely to be
retrieved than shorter ones as they have larger term sets and hence
are
more likely to
match the query terms. The document length factor is usually used as a normalization
factor to compensate th
e variation of the document lengths. However, we have not used
this factor in our algorithm.


Although
TF
-
IDF is an efficient algorithm for finding relevant information in a corpus,
there are some limitations of such
a
simple algorithm. One of the biggest problems of
TF
-
IDF is that
it does no
t deal with polysemants

(ambiguous words)

very well. TF
-
IDF
might search for a word which has completely different meaning
from

the original word

that

the user was looking for. For e
xample, if the user was querying about the word
“dish”, there are several meaning
s

of it and TF
-
IDF does

n
o
t know whether the user
was querying about a dish as food, a dish as used for satellite transmission or dish as a
town in Texas. Therefore, to get mo
re accurate results in these cases, we need some
alterations to the TF
-
IDF algorithm to make it not only based on the importance of a
term but also takes into account its near words and phrases, and make sure they are
semantic related.


4.2
Text
P
re

proces
sing
/ Santizing


Word Stemming , and stop words removal.


Word stemming involves




Using Porters algorithm


Example


Stop words removal involves …


Because….


Example of stop words.


4
.
3

Index
ing

Indices play
an important role

in Information Retrieval.
The I
nternet search engines
have

“spiders”

that

crawl around the World Wide Web to gather and parse
information
[
1
3
]
. The purpose of having “spiders” is to create indices for the web which enable
s

search engines to respond to queries promptly. Just like

the

search engines, we
need

index files for
our algorithm as well

to

make our search more efficient. Without indices,
7


we would
have to
scan every document in the corpus every time we query a term.

T
he
rest of this section
describes

how index files are structured and how they
are
created.


As described in
Section

4.1, in order to find a term’s TF
-
IDF value, we need the
following four values
:

(1)

Number of
times the term occur
s

in the document

(2)

Number of words
i
n

the document

(3)

Number of documents

in the corpus

(4)

Number of
documents contain
ing

the term in the corpus

With these values, we can easily calculate the TF
-
IDF for any term in
any

document

in
a corpus
.
Since the number of documents in the cor
pus is
known
, w
e already

have this
value
(3)
.
However, we still need to calculate
(1), (2)
and
(4)

every time we query a
term
.
Therefore w
e
will store these values in our
index file
s
.

W
hen we calculate the
TF
-
IDF value for a term, we can
look up in the indices and
do
the calculation
efficiently
.


There are three index files used in our algorithm, namely
corpus.terms.index
,

corpus.docs.index

and

corpus.terms.subindex
.

In corpus.terms.index file, each line
corresponds to a unique term in the corpus.
There are
1929347 indexed
terms hence
1929347
lines. Each
line starts with the term

itself
, followed by the number of
documents contain
ing

the term

in the corpus
, follo
wed by

a pattern of
two

tokens. T
he
first token is
a

cumulative sequential
document ID and the second token is the number
of times the term appears in that document. For example,
Figure 2

shows
how
the word
arpanet

appears

in the
corpus.terms.index

file.


This
means the term a
rpanet
appears 89 times in the Wikipedia corpus. The first
document containing
arpanet
has ID
1712

and
arpanet

only appears once in that
document. The second document that contains
arpanet

has ID 2388 (676+1712), where
arpanet

occurred twic
e. The pattern continues until to the end of the line. From
corpus.terms.index, we have values of

(1)
and
(4)

to calculate TF
-
IDF.












Figure 2, word aparnet in corpus.terms.index file


The
second

index file is corpus.docs.index. In this file, each line corresponds to a
arpanet 89 1712 1 676 2 214 2 1198 2 124 1 298 2 400 6 607 4 1337 28 414 1 141 1 332 3
109 1 110 2 3303 2 739 1 756 4
2149 1 1028 5 49 1 160 2 718 1 70 1 54 2 2261 3 814 1
2217 2 23 1 1719 1 598 2 107 1 964 1 526 1 2887 1 480 3 4250 1 1967 1 712 2 1225 2 925 1
821 3 256 1 5082 1 2699 6 8393 1 33449 3 2146 1 2702 1 11324 1 14357 1 1202 2 204 62
408 1 11518 2 222 1 2697 1 1
0146 1 6038 4 1039 3 10826 1 1488 1 3217 1 792 1 7283 1
3585 1 1703 3 4182 1 793 1 8543 1 13097 2 20320 1 6993 1 27430 2 40094 1 21191 1
27994 2 3490 1 19932 1 38769 1 6377 3 4204 1 55014 3 24517 1 77194 1 24313 1 2803 1
1188 1 5162 1 30704 2


8


document in the corpus (an article in the Wikipedia collection).
There
are 659,388

articles

in the corpus hence 659,388
lines in this file.
The format
for each line
is

W
ikipedia
_Corpus_FileName

N
umber
_
O
f_W
ords

P
osition_
O
ffset
(in bytes)
2

F
or example,
the following line means there are 105 words in document 657439.xml
and the first words starts at 672118412(bytes):

wiki0657439 105 672118412

F
rom corpus.docs.index, we have
the value of

(
2
)
.


From corpus.terms.index and corpus.doc
s
.index files, we have all the four values we
need to calculate any term’s TF
-
IDF weight in the corpus.
The third index file
corpus.terms.subindex is created for optimizing the efficiency during searching.
It is
used to quickly locate the term in the big corpus.terms.index file.
As mentioned earlier
,
the corpus.terms.index file contains
1929347
lines

and
may not

perform

very well
. The
corpus.terms.subindex is a
n

inverted index

that helps us to find the term in
corpus.terms.index. It can be considered as an index of the corpus.terms.index file (two
level indexes). Although t
he file
does not provide

any vales
directly
to t
he

calculation
,

it is
an
important
part of speeding
up the algorithm.
For the best performance, we have
encoded both corpus.docs.index and corpus.terms.subindex in binary format.
Having
these indices enables us to calculate the TF
-
IDF weight for any term
i
n the most
efficient way
.


5
. Our Approach

This section describes
the

automated link generation algorithm that we
proposed
,
which
we call
the
Semantic Related
Difference (SRD)
.
The SRD algorithm calculates
the
semantic relatedness of the anchor term (a
candidate
word or phrase that
will be
link
ed

to a document) and target document

(the destination of a link)

based on t
he
words around the anchor term

in the input document and target document.


Our approach is
based on TF
-
IDF and uses the same set of indic
es that have been
created with the TF
-
IDF algorithm. The

SRD

a
lgorithm
returns a document with the
highest SRD weight based on
k

doc
uments returned from the TF
-
IDF algorithm
.

W
e
then
link the
anchor term to the document with the highest SRD
.

Formally, t
he
SRD

weight is calculated by

using

the following mathematical equation:

SRD
(
a
,
b
)
=


log

(
|
A

Bi
|
)
log

(
|
A
|
)
n
i
=
1

(4)

w
here
a

is the input document

which contains the
query term
,

b

is
a target document in
the set of top documents returned by TF
-
IDF and
n

is the number of times
the
anchor
term occurs in

document

b
.

A

is the
set of
unique

term
s

within a boundary
x

of the
anchor term
,
B
i

is the
set of
unique

term
s

within
the boundary
x

of the anchor term. The
boundary
x

is calculated by

x = min(N
a
, N
b
i
)
*

y% where
N
a

is the total unique
terms
in
document
a
,
N
b
i

is the total unique terms in document
b
i

and
y%
is a
n

arbitrary factor



2



represents a space character

9


between 0.2 and 0.6.

For example, if the number of unique terms in document
a

is 40,
the number of unique terms in document
B
1

is 60 and y% is 0.2,
then
x = 40 * 0.2 = 8.
Therefore,
A

is

a collection of

8 terms on
the
left of the anchor term and 8
t
erms on
the
right of the anchor term.
|A| represents the number of terms in set
A

and |A

B|
represents the number of terms in set
A

intersects

set
B
. The closer the value of |A

B| to
|A|, the closer the anchor term in
a

is semantically related to document
b
.
Since t
he
maximum of |A

B| is |A|
, the maximum value of
log
(
|
A

Bi
|
)
log
(
|
A
|
)

is 1, meaning the anchor
term has the same set of

unique

words within
x

distance in document
a

and
b
,
and
therefore they are most semantic related
.


The following article
was

taken from the Wikipedia

corpus

and

we use it

as an example
to illustrate the above algorithm.











In
the above

article, there are seven links that have been created by Wikipedia
contributors.
They are human created
links

and

have been edited man
y times by
many
users,

hence

we consider
both the anchor terms and
the

target documents
to be

selected
correct
ly.



Our

algorithm first sanitizes the input document to strip

out

stop
-
words and special
characters. After text sanitizing, there are 54 words left with 4
5

unique words. We
calculate each of the 54 words’
TF
-
IDF

weight and
store
each word’s
top five

documents
. For instance, the top five documents that were ranked by
TF
-
IDF

for

the
word
humanism

are: 2841635
.xml
, 49443
.xml
, 311263
.xml
, 14224
.xml

and
1260600
.xml
.


Based on the top five documents returned by
TF
-
IDF
, we then calculate their
SRD

weights. For each target
document, we need to calculate the

value
x
.
The

value
x

is

the
number of
unique
terms we check on each side of every anchor term.

For instance, if
x

is 10, we check 10 unique words on the anchor term’s left and 10 unique words on the
anchor term’s right.

To calculate
x
, we get the number of unique terms of the
input text
and each candidate target documents, then use the minimum value of the two
and
multiply
by
a factor y. In the example above, the unique number of words in
the
input
document is
45

and
in

the first ranked
TF
-
IDF

document 2841635
.xml

is 20. The
un
ique number can be calculated as x = min(
45
,20) x 0.6 which gives us 12. Therefore
The

German Renaissance, whose influence
originated in Italy, started spreading among
German thinkers in the

15th and

16th century. This was a result of German artists who
had travelled to Italy to learn more and become inspired by the Renaissance movement.

Many areas of the arts and sciences we
re influenced, such as the spread of humanism, to
the various German states and principalities. There were many advances made in the
development of new techniques in the fields of architecture, the arts and the sciences.

By far the most famous German Renai
ssance
-
era artist is Albrecht Dürer

who is
well
-
known for his woodcuts, printmaking and drawings.


10


we check 12 words on each side of the anchor term “humanism” in both
the
input
document and in
the
document 2841635
.xml
.
From equation

(4)
,

A is the set of unique
words within 12 words on each side of the anchor term in
the
input document and
B

is
the set of uniqu
e terms in

document 2841635.xml. W
e can calculate the
SRD

weight of
the input document and document 2841635.xml. Similarly, we can
get
SRD

weights for
the top five documents ranked by TF
-
IDF. Table 1 shows

the
ir
SRD

weights
of
the
term
“humanism”

and the ranking comparison between TF
-
IDF and
SRD
.


Table 1
.
T
op five documents of term “humanism”

returned by
TF
-
IDF

and their
SRD

wei
ght


Using
the
TF
-
IDF

algorithm, we would link the term “humanism” to document
2841635
.xml
. However, the result is completely
the
opposite for our
SRD

algorithm.
The document 2841635
.xml has the least

weight in the top five
TF
-
IDF

ranks
. The
highest
SRD

weight is 14224
.xml

whi
ch was ranked at the fourth place by
TF
-
IDF
. The
document 14224
.xml

is also the one that has been linked to “humanism” by
a
human.


6. Evaluation

Wikipedia article
s

are manually

edited by
Wikipedia contributors
.
Every link has been
carefully chosen to point to the best destination.
Therefore w
e believe the existing
hyperlinks in Wikipedia

articles

are linked to the most appropriate articles and provide
us
the ground truth

for evaluation. Therefore,
we will show th
e results of our approach
compared with TF
-
IDF in terms of their percentage of matching links generated by both
algorithm.


We have randomly selected 30 articles from the Wikipedia 2007 English corpus
.

Since
we have not included techniques for
keyword ext
raction and gathering n
-
grams, we will
only evaluated
the performance
of

every
single words in
the selected

documents.

Anchor terms with multiple words or phrases in the articles are simply ignored during
evaluation.

We believe with the correct
n
-
grams an
d
keyword extraction
algorithm
applied, the result

will not be dissimilar.

Each

article has been run against the TF
-
IDF and
SRD

algorithm
s
.
The results are
shown in
T
able 2.


Document

SRD

Weight

TF
-
IDF

Rank

SRD

Rank

2841635.xml

0.925512852639037

1

5

49443
.xml

6.24200799178871

2

2

311263
.xml

2.54011903613068

3

3

14224
.xml

20.7760583886198

4

1

1260600
.xml

1.14964684018376

5

4

11


Document
File Name

Number of
hyperlinks

in document

TF
-
IDF

SRD

Number of


terms
match
with
human


%
terms

match with
human


Number
of
terms

match with
human


% of
terms

match with
human

424499
.xml

19

2

10.53%

6

31.58%

1027282
.xml

14

0

0
%

1

7.14%

66610
.xml

8

2

25%

3

3
7.5
%

1823934
.xml

44

3

6.82%

9

20.45%

1045640
.xml

6

0

0%

2

33.33%

1525528
.xml

13

0

0%

2

15%

390568
.xml

5

0

0%

1

20%

2160116
.xml

5

0

0%

1

20%

2200414
.xml

2

1

50%

1

50%

1046012
.xml

8

0

0%

1

12.5%

75649
.xml

56

3

5.36%

8

14.29%

447039
.xml

15

2

13.33%

1

6.67%

11012
.xml

31

0

0%

2

6.45%

1952531
.xml

8

0

0%

1

12.5%

2342457
.xml

6

1

16.67

1

16.67

1797224
.xml

12

2

16.67%

3

25%

650350
.xml

8

0

0%

2

25%

2475753
.xml

4

1

50%

1

50%

83048
.xml

19

10

52.63%

12

63.16%

425375
.xml

25

4

16%

7

46.67%

1741615
.xml

15

1

6.67%

4

26.67%

35791
.xml

12

2

16.67%

6

50%

2331851
.xml

5

2

40%

1

20%

66203
.xml

41

6

14.29%

9

21.95%

2245149
.xml

7

1

0%

1

14.29%

2035790
.xml

5

2

40%

2

40%

849246
.xml

13

0

0%

2

15.38%

266302
.xml

4

1

25%

3

75%

3147718
.xml

6

0

0%

1

16.67%

1025371
.xml

14

1

7.14%

2

14.29%

Average

4
30

4
7

10.93%

96

22.33%

Table2. Result of TF
-
IDF and
SRD

algorithms against thirty randomly selected
Wikipedia documents


In

Table 2
, it clearly sho
w
s

that
SRD

outperforms the TF
-
IDF algorithm

in most cases
.

On average, 10.93% of the links generated by TF
-
IDF match with Wikipedia
contributors created links whereas
SRD

achieved 22.33%, more than twice the ratio of
TF
-
IDF’s.

12



Since
SRD

is based on
TF
-
IDF
, it
was

interesting to
find

that two of the documents had
better results using
TF
-
IDF

than
SRD
.

For example,
Berlinerisch

and
Brandenburgisch

are two types of dialects spoken in Germany and the words originated from Deutsch.
The links for these two terms match human
generated links using
TF
-
IDF but not
using
our
SRD

algorithm
.

This indicates that for terms that do not
natively
suit the
context (for example, foreign words or transliteration words)
, calculating the weight
based on term frequency and document frequency gives better results than semantic
relatedness.


7. Conclusion

and Future Work

Link generation has been a popular research topic in the information retrieval field for
many years.
In this dissertation, we
investigated

Semantic Related Difference,
a
n
approach

to
calculate the semantic relatedness between a
nchor

term
s

and
document
s
,
which

can be used to
generate hyperlinks automatically from unstructured input text.

Our measure outperforms TF
-
IDF when
evaluated

against
30 randomly selected
articles from the Wikipedia corpus.


Although the experiment was done based on
Wikipedia corpus, our approach is not
limited to any particular corpus. The SRD measure can be applied on any corpora. This
is one of the key di
fferences
between

our approach and the Wikify! system and Milne &
Witten’s approach. Both of the Wikify! and Mil
ne & Witten’s approach are specific to
Wikipedia and requires
heavy preprocessing to have prior knowledge about the
Wikipedia corpus.


Also, while the experiment used English Wikipedia as
corpus, SRD is language
independent in theory
. The German words exam
ple showed in
Section

6 indicated that
the algorithm does not perform well when multi
-
languages are mixed in one context. It
would be interesting to
see

the result
s of

SRD applied to language
s
.


As mentioned before, this project not only benefits Wikipedia

contributors by saving
their time but can also be used to fix broken hyperlinks on web pages, by finding
appropriate pages for the text to link to.


Our f
uture
work includes
incorporating a keyword extraction algorithm
,
which
involves preprocessing

the in
put text, such as tokenizing, tagging, noun chunking and
gathering n
-
grams etc. Also
, we will focus on applying the SRD with various corpora
(i.e. not just Wikipedia) for further investigation.





Reference

13


[1]

Green,
Stephen J.

(1998)
Automated link generation: Can we do better than term
repetition? In
Proc. of the 7th International World
-
Wide Web Conference,

1998.

[2]

Wikipedia
. (n.d.).
In Wikipedia.

Retrieved from
http://en.wikipedia.org/wiki/Wikipedia

[3]

Mihalcea, R. and Csomai, A. (2007) Wikify!: linking documents to encyclopedic
knowledge. In
Proceedings of the 16th ACM Conference on Information and
Knowledge management (CIKM’07),
Lisbon, Portugal, pp. 233
-
242

[4]

Salton, G. and Buckley, C. (1987).

Term weig
hting approaches in
automatic text retrieval.
Tech Report 87
-
881, Department of Computer
Science, Cornell University
.

[5]

Denoyer, L., Gallinari, P.

(2006)

The Wikipedia XML Corpus, SIGIR Forum,
40(1), June 2006, 64
-
69.

[6]

W3C
. (2010).
Extensible Markup Language
(XML) 1.0 (Fifth Edition).

Retrieved from http://www.w3.org/TR/REC
-
xml/

[7]

Wilkinson, R. and Smeaton, A.F. (1999)
Automatic link generation,
ACM
Computing Surveys (CSUR)
(1999)

[8]

M. Strube and S. P. Ponzetto. (2006) Wikirelate! computing semantic relatedeness
using Wikipedia. In
Proceedings of the American Association for Artificial
Intelligence
, Boston, MA, 2006.

[9]

Milne, D. and W
itten I.H. (2008)
Learning to Link with Wikipedia,
In

Proceedings
of the ACM Conference on Information and Knowledge Management
(CIKM'2008),

Napa Valley, C.A.

[10]

Cilibrasi, R.L. and Vitanyi, P.M.B. (2007) The Google Similarity Distance.
IEEE
Transactions on Knowledge and Data Engineering 19
(3), 370
-
383.

[11]

TF
-
IDF
.

(
n.d.
).
In
Wikipedia
.
Retrieved from http://en.wikipedia.org/wiki/Tfidf

[12]

Singhal
, A
. (2001) Modern information retrieval: a brief overview.
IEEE Data
Engineering Bulletin, Special Issue on Text and Databases
, 24(4), Dec. 2001

[13]

Chen, H. Chung, YM. Ramsey, M. (1997)
Intelligent spider for Internet searching

[14]

Fahmy E and Barnard D. T. (1990) Adding Hypertext Links to an Archive of
Document. In
The Canadian Journal of Information Science
, 15(3), 25
-
41.

[15]

Tebbutt J. (1999) User evaluat
ion of automatically generated semantic hypertext
links in a heavily used procedural manual in
Information Processing and
Management
, 35(1), 1
-
18.

[16]

S. J. Green (1997) Automatically generating hypertext by computing semantic
similarity. University of Toronto
.

[17]

James Allan
. (1997) Building Hypertext using Information Retrieval. In
Information Processing and Management

33(2):145
-
159.

[18]

Maristella Agosti and James Allan. (1997). Methods and tools for the construction
of hypertext. In
Information Processing and M
anagement
, 33(2), 129
-
271.


14














































15







Appendix I

-

An example document in Wikipedia corpus


<?
xml

version
=
"
1.0
"

encoding
=
"
utf
-
8
"
?>

<
article
>

<
name

id
=
"
17941
"
>
Liquid
</
name
>

<
conversionwarning
>
2
</
conversionwarning
>

<
body
>

<
template

name
=
"
otheruses
"
></
template
>

<
figure
>

<
image

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
../pictures/Wate
rAndFlourSuspensionLiquid.jpg
"

id
=
"
-
1
"

xlink:actuate
=
"
onLoad
"

xlink:show
=
"
embed
"
>

WaterAndFlourSuspensionLiquid.jpg
</
image
>

<
caption
>
A

liquid

will

assume

the

shape

of

its

container.
</
caption
>

</
figure
>

<
p
>
A


<
emph3
>
liquid
</
emph3
>
(a


<
collectionlink

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
23637.x
ml
"
>
pha
se

of

matter
</
collectionlink
>
)

is

a


<
collectionlink

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
1083844.
xml
"
>
fluid
</
collectionlink
>
whose


<
collectionlink

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
32498.x
ml
"
>
volume
</
collectionlink
>
is

fixed

under

conditions

of

constant


<
collectionlink

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
30357.x
ml
"
>
temperature
</
collectionlink
>
and


<
collectionlink

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
23619.x
ml
"
>
pressure
</
collectionlink
>
;

and,

whose

shape

is

usually

determined

by

the

container

it

fills.

Furthermore,

liquids

exert

pressure

on

the

sides

of

a

container

as

well

as

on

anything

wit
hin

the

liquid

itself;

this

pressure

is

transmitted

undiminished

in

all

directions.
</
p
>
If

a

liquid

is

at

rest

in

a

uniform

gravitational

field,

the


<
collectionlink

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
23619.x
ml
"
>
press
ure
</
collectionlink
>

<
math
>
p
</
math
>
at

any

point

is

given

by


<
indentation1
>

<
math
>
p=
\
rho

gz
</
math
>

16


</
indentation1
>

<
p
>
where


<
math
>
\
rho
</
math
>
is

the


<
collectionlink

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
8429.xml
"
>
density
</
collectionlink
>
of

the

liquid

(assumed

constant)

and


<
math
>
z
</
math
>
is

the

depth

of

the

point

below

the

surface.

Note

that

this

formula

assumes

that

the

pressure


<
emph3
>
at
</
emph3
>
the

free

surface

is

zero,


<
emph2
>
relative
</
emph2
>
to

the

surface

lev
el.
</
p
>

<
p
>
Liquids

have

traits

of


<
collectionlink

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
113302.x
ml
"
>
surface

tension
</
collectionlink
>
and


<
collectionlink

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
1119771.
xml
"
>
capillarity
</
collectionlink
>
;

they

generally

expand

when

heated,

and

contract

when

cooled.

Objects

immersed

in

liquids

are

subject

to

the

phenomenon

of


<
collectionlink

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
245982.x
ml
"
>
buoyancy
</
collectionlink
>
.
</
p
>

<
p
>
Liquids

at

their

respective


<
collectionlink

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
748873.x
ml
"
>
boiling

point
</
collectionlink
>
change

to


<
collectionlink

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
12377.x
ml
"
>
gas
</
collectionlink
>
es,

and

at

their


<
collectionlink

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
40283.x
ml
"
>
freezing

point
</
collectionlink
>
s,

change

to

a


<
collectionlink

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
27726.x
ml
"
>
solid
</
collectionlink
>
s.

Via


<
collectionlink

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
213614.x
ml
"
>
frac
tional

distillation
</
collectionlink
>
,

liquids

can

be

separated

from

one

another

as

they


<
collectionlink

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
10303.x
ml
"
>
vaporise
</
collectionlink
>
at

their

own

individual

boiling

points.

Cohesion

between


<
collectionlink

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
19555.x
ml
"
>
molecule
</
collectionlink
>
s

of

liquid

is

insufficient

to

prevent

those

at

free

surface

from


<
collectionli
nk

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
10303.x
ml
"
>
evaporating
</
collectionlink
>
.
</
p
>

<
p
>
It

should

be

noted

that


17


<
collectionlink

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
12581.x
ml
"
>
glass
</
collectionlink
>
at

normal

temperatures

is


<
emph2
>
not
</
emph2
>
a

"supercooled

liquid",

but

a

solid.

See

the

article

on

glass

for

more

details.
</
p
>

<
section
>

<
title
>
See

also
</
title
>

<
normallist
>

<
item
>

<
collectionlink

xmlns:xlink
=
"
http://www.w3.org/1999
/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
1314474.
xml
"
>
List

of

phases

of

matter
</
collectionlink
>

</
item
>

<
item
>

<
unknownlink

src
=
"
ripple
"
>
ripple
</
unknownlink
>

</
item
>

<
item
>

<
collectionlink

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
37379.x
ml
"
>
specific

gravity
</
collectionlink
>

</
item
>

<
item
>

<
collectionlink

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
1538336.
xml
"
>
liquid

dancing
</
collectionlink
>

</
item
>

</
normallist
>

<
template

name
=
"
template

phase

of

matter
"
>

</
template
>

<
languagelink

lang
=
"
ar
"
>
لئاس
</
languagelink
>

<
languagelink

lang
=
"
bg
"
>
Течност
</
languagelink
>

<
languagelink

lang
=
"
ca
"
>
Líquid
</
languagelink
>

<
languagelink

lang
=
"
cs
"
>
Kapalina
</
languagelink
>

<
languagelink

lang
=
"
de
"
>
Flüssigkeit
</
languagelink
>

<
languagelink

lang
=
"
es
"
>
Líquido
</
languagelink
>

<
languagelink

lang
=
"
eo
"
>
Likvaĵo
</
languagelink
>

<
languagelink

lang
=
"
fr
"
>
Liquide
</
languagelink
>

<
languagelink

lang
=
"
ko
"
>
액체
</
languagelink
>

<
languagelink

lang
=
"
io
"
>
Liquido
</
languagelink
>

<
languagelink

lang
=
"
id
"
>
Cairan
</
languagelink
>

<
languagelink

lang
=
"
is
"
>
Vökvi
</
languagelink
>

<
languagelink

lang
=
"
it
"
>
Liquido
</
languagelink
>

<
languagelink

lang
=
"
he
"
>
לזונ
</
languagelink
>

<
languagelink

lang
=
"
ms
"
>
Cecair
</
languagelink
>

<
languagelink

lang
=
"
nl
"
>
Vloeistof
</
languagelink
>

18


<
languagelink

lang
=
"
ja
"
>
液体
</
languagelink
>

<
languagelink

lang
=
"
nn
"
>
Væske
</
languagelink
>

<
languagelink

lang
=
"
pl
"
>
Płyn
</
languagelink
>

<
languagelink

lang
=
"
pt
"
>
Líquido
</
languagelink
>

<
languagelink

lang
=
"
ru
"
>
Жидкость
</
languagelink
>

<
unknownlink

src
=
"
simple:Liquid
"
>
simple:Liquid
</
unknownlink
>

<
languagelink

lang
=
"
sl
"
>
Kapljevina
</
languagelink
>

<
languagelink

lang
=
"
sv
"
>
Vätska
</
languagelink
>

<
languagelink

lang
=
"
zh
"
>
液体
</
languagelink
>

</
section
>
&lt;
/
&gt;
</
body
>

</
article
>