A Novel Approach to link generation

economickiteInternet και Εφαρμογές Web

21 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

185 εμφανίσεις


A Novel Approach to link
generation



Shawn

Jiang







A dissertation submitted in partial
fulfillment

of the requirements for the degree of Bachelor
of Science (Honours) in Computer Science, The University of Auckland, 2010.

Abstract

People create hyperlinks inconsistently and the information that the hyperlinks point to may
not be useful to users whom are looking for specific information.

One of the most effective
ways to resolve this issue is to use an automated link generator
to
f
ix

the
se

broken links
. In
t
his dissertation
, we introduce

an approach to generate hyperlinks automatically based on
semantic relatedness between anchor terms and documents.

We have used Wikipedia as the
word sense disambiguation corpus because Wikipedia cont
a
ins

a

vast amount of information

and

tight structures, which make Wikipedia a promising source for extracting semantic
relationship
s
. Also
hundreds of
thousands

of
manually created hyperlinks
in the corpus
provide

us the ground truth for evaluation.

We sh
owed that
our approach based on
Semantic Related Difference
outperforms TF
-
IDF (
Term Frequency
-
Inverse Document
Frequency).

































1. Introduction

One of the most used features of
the World Wide
Web (
WWW)

is hyperlink
s
.
Hyperlinks

are
everywhere on the
Web
. Web
-
pages con
tain hyperlinks to reference
other material
, enabling
people

to follow links between information
. However,
Green

[1]
stated in his paper that
people create hyperlinks inconsistently,
and consequently
the information

that the
hyperlinks point to may not be useful to users
whom
are
looking for specific
information
.
He
also claimed that using computer programs to generate hyperlinks automatically will
mitigate

the issue.

In this
dissertation
, we
propose

a

new method
to generate hyperlinks automatically from an
orphan
document
1

to appropriate documents based on lexis and semantic relatedness. We
have chosen Wikipedia as the example cor
pus
because

of its size and structure.

Wikipedia is
the largest

and

most comprehensiv
e free online encyclopedia knowledge base on the
WWW
.
It contains 16 million (more than 3.
3

million in English) articles that are contributed by active
volunteers around the
world
[
2
].

Every article contains manually cr
eated hyperlinks that
connect

the most important terms

to other pages

enabling readers to navigate from one
article to another. This provides the readers a quick way of accessing relevant information.
Currently, Wikipedia contributors create links manually following the Wikipedia “man
ual of
style”. The processes to create these hyperlinks
are

time consuming and tedious for the
contributors. It would
be
useful

if there is an automated mechanism that will generate the
hyperlinks automatically for the contributors and
so that they can put

their

effort
into
more
important things such as writing better article content.

Figure 1

shows the automated link generation

flow chart. Generating
link
s

from an
orphan

document involves two steps, anchor detection and word
disambiguation [
3
]
.

Due to time

limitation and our interest in the word disambiguation,
our work is
focused on
the word
disambiguation
part.


F
igure 1: Flow chart for automated link generation


O
ne

approach is based on a well known
i
nformation retrieval

algorithm Term
Frequency
-
Inverse Document Frequency (TF
-
IDF)

[
4
]
. To get a better understanding of how



1

A
n orphan document

is a
n

unstructured
document that does not have incoming or outgoing
hyperlinks and is
not related to any other
documents

Text with Key words

Clean Text

Input Text

Sanitizin
g

Keyword
Extraction

TF
-
IDF

SR
D (Semantic
Related Difference)

Keyword Extraction

Word Disambiguation

Output document with
hyperlinks

the a
lgorithm

works

we have implemented the algorithm and related
indices
. We
then
devised a better link gener
ation technique based on TF
-
IDF, called Semantic Related
Difference (SRD).

The
report

is organized as follows. In
S
ection 2, we talk about the corpus
we used and why
we chose this corpus. We will discuss some of the related work
s

that
has
been done in
Section 3. In
Section

4 we will
describe a technique that has been used previously, TF
-
IDF
and
talk abo
ut how indices are created. In S
ection 5, we disclo
se our approach

by giving an
example.
We then
evaluate our approach

wit
h discussion of the results in
S
ection
6
.
F
inally,
i
n
Sec
tion
7
, the paper concludes with the strength
s

and limitations of our approach and
outlines
possible future improvement
s
.


2. Th
e Wikipedia Corpus

For
our
experiments
, we

have used the English collection of the Wikipedia

XML
corpus [
5
]
created by Ludovic

Denoyer and Patrick Gallinari
. The corpus

is based on
the Wikipedia

2006
dump. It
is about 4600 MB
in size
and consists of
659,388
UTF
-
8 encoded XML
files
.

E
ach file
corresponds to a Wikipedia article
. The filenames
correspond

to the unique IDs of the
document
s

(
for example, 12345.xml)
.

On average, the size of
one

article is 7261 bytes. The
corpus o
nly

includes

Wikipedia artic
les,

so
documents

such as

Talks

,

Templates


etc. are
not included.

Wikipedia contributors use “Wikitext” (the markup language used in Wikipedia for editing
articles) when they write articles. For example,
==
Section 1
==

represents a Heading with text
Section 1
. However,

Wikitext has some drawbacks.

A
ccording
to
W3C XML
Standard [
6
]
,
characters like
=

cannot

be used as the start character

in tag names, also
Wikitext

is

not
self

explanatory
.
T
h
e corpus has replaced
Wikitext

with corresponding meaningful
xml

tags.


The following

is

a

fragment

of a

document
taken from document
60871.xml

and gives

an
example of
what
an article

looks like

in the corpus.

We are mostly interested in the
collectionlink

tag
in every document

b
ecause

a
collectionlink

tag

represents a

hyperlink
created by human
.

The value

between

<collectionlink>

and

</collectionlink>

tags is the term
that Wikipedia contributors have chosen to create a link, that is the anchor term.

T
he value
of attribute

xlink:href

is the destination docum
ent
, where the anchor term links.
For example,
in the document
fragment
below, the word
light

has been linked to document 17939.xml.

(
For

a complete document sample, please refer to Appendix I)


<?
xml

version
=
"
1.0
"

encoding
=
"
UTF
-
8
"
?>

<
article
>



<
name

id
=
"
60871
"
>
Luminescence
</
name
>



<
conversionwarning
>
0
</
conversionwarning
>



<
body
>





<
emph3
>







Luminescence





</
emph3
>

is





<
collectionlink

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
17939.
xml
"
>







light





</
collectionlink
>

not

generated

by

high

temperatures

alone.





<
p
>







It

is

different

from







<
collectionlink

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
2138
35.xml
"
>









incandescence







</
collectionlink
>
,

so

it

usually

occurs

at

low

temperatures.

Examples:

fluorescence,

bioluminescen
ce

and

phosphorescence.





</
p
><
p
>







Luminescence

can

be

caused

by

chemical

or

biochemical

changes,

electrical

energy,

subatomic

mo
tions,

reactions

in

crystals,

or

stimul
ation

of

an

atomic

system.





</
p
><
p
>







The

following

kinds

of

luminescence

are

known

to

exist:





</
p
><
normallist
>







<
item
>









<
collectionlink

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
2914
36.xml
"
>











Chemoluminescence









</
collectionlink
>

(including









<
collectionlink

xmlns:xlink
=
"
http://www.w3.org/1999/xlink
"

xlink:type
=
"
simple
"

xlink:href
=
"
2037
11.xml
"
>











Bioluminescence









</
collectionlink
>
)







</
item
>

………………………


………………………
.


Since t
here are 659
,
388 xml files in the corpus, it is inefficient to use the operating system’s
file system to keep track of all the files. In order to be able to
search

files
more efficiently
, we
have created a database to index
all
the files

in the corpus
.


3. Related Work

A lot of research

ha
s

been carried out
into

link generation

[
14, 15, 16
, 17, 18, 19
]
.

Wilkinson

and
Smeaton

[
7
]

pointed out that

To create static links between semantically related text,
we can simply calculate the
similarity between all pairs of

information, and then insert links
between those that are most similar.


Many
link generation approaches are done by
comparing semantic relatedness

between
the input text and target corpus
.

Strube and Ponzetto

[
8
]

developed
a technique

called
WikiRe
l
ate!

to
compute

semantic
relatedness
using Wikipedia.

Their main focus was to show that Wikipedia can be used as a
reliable knowledge base for adding semantic relatedness information to
natural

language
processing applications. They compared Wikipedia to
WordNet
on various benchmarking
datasets. The WikiRelate!
performs

best when evaluated with the largest dataset.

A system called Wikify was developed by Mihalcea and Csomai

[
3
].
It was the fi
rst system to
do
automate
d link generation using Wikipedia as a destination for
links [
9
].
The algorithm
generates links in
two phases. The first ph
ase is called
keyword extraction
.
In this ph
ase, the
algorithm detects and picks the most impor
tant keywords

in the given text
.

T
h
e algorithm
first extract
s

candidate terms by selecting all possible n
-
grams from the input text, then rank

them
by one of the three approaches Mihalcea and Csomai have investigated
, namely: TF
-
IDF,
χ
2
,

and Keyphraseness
.
The
best
performing

approach

was the Keyphraseness
.
The
Keyphraseness of a term is calculated
using

the number of Wikipedia articles that have
already selected
a term

as
an

anchor term divided by the number of articles contains it.
T
he
selected keywords are
the anc
hor
terms

that

are
sent to the second phrase
word sense
disambiguation
.

Since a word can have
different

meanings in different context
s
,
disambiguation involves
calculating
the
weight of
surrounding words of potential anchor
terms and ensures the anchor
terms are linked to the most appropriate articles.

For each
ambiguous
term
, t
he algorithm
compares a training example from
Wikipedia

to feature
vectors
of

the term
and its local context.

Milne and

Witten

[
9
]

developed

a machine
-
learning approach that
is q
uite similar to the
Wiki
f
y system. The approach
again uses

two
phases

to generate a
link
,

h
owever
the order of
the phases is
reversed
.
Milne and Witten’s appr
oach does the disambiguation ph
ase first and
the results are used to inform keyword extraction.

I
n
the
disambiguation phase,
the
algorithm calculates
commonness (
the
number of times a term is
used as a link destination
)
and
relatedness

of a term
, which measures the semantic similarity of two Wikipedia pages
by comparing their incoming and outgoing lin
ks
.
The algorithm used to calculate the
relatedness is based on the Normalized Google Distance [
10
], which

calculates semantic
relatedness

of words

using

the

Google

search engine.

The algorithm then passes the result
s

of the disambiguation

to the second phase,
link

detection
.
In
the
link
detection
phase,

the
result of the disambiguation is used
in a machine learning

approach

to filter out n
-
gram
anchor terms in the
input
document.

Both
Mihalcea and Csomai’
s Wikify system and Milne and Witten

s approach achieved high
accuracy.
However, both

approaches require
a lot of
preprocessing work to be done.
Also,
their approaches are specific to Wikipedia, leveraging Wikipedia

s existing hyperlinks
structure
.


4. Background

4.
1

TF
-
IDF

TF
-
IDF

[
4, 11]

is a classical metric that

has been widely used
in information retrieval field,
which represents a weight that indicates the importance of a word to a document in a corpus

[
11
]
.

The idea of the TF
-
IDF algorithm is that if a word or phrase appears more tim
es in a
document but
fewer times

in other documents in a corpus, then the word or phrase has a
high weight, hence
it is
more important. As its name implies, the weight is calculated by two
factors, term frequency and inverse document frequency. Term freque
ncy is defined as how
many times a term
t

appears in a document
d
. It is calculated by

tf
i
,
j
=
n
i
,
j

n
k
,
j
k


(1)

wh
ere
n
i,j

is the number o
f
times the query term (t
i
)
occurr
ed

in document
d
j
, and the
denominator is the sum of all terms in document
d
j
,
that is
,

the total number of terms in
document

|d
j
|

[
1
1
]. A document
that
contains more occurrences of the term means
that the
document

is more relevant than others. For example, a document d1
that
contains five
occurrences of the term is more relevant than a doc
ument d2
with

only one occurrence.
However, relevance does not increase proportionally with the term frequency.
This

means
that
document d1 is not five times more relevant than document
d2
.

There is a good chance that common words appear frequently and
then

those common
words have
a
very high weight in the Term Frequency factor. The second factor IDF is
introduced to offset the weight of those common words. If |D| is the total number of
documents in the corpus and

|
{
d
:
t
i

d
}
|

is the number of documents wh
ere the term
ti

appears,

the IDF weight can be calculated by
:



(2)

The overall TF
-
IDF weight of a word or phrase is the product of its TF weight and IDF weight.



(3)

A high
TF
-
IDF

weight of a term is achieved by a high term frequency and low document
frequency. Thus the common terms (which usually have high term frequency and high
document frequency) have low
TF
-
IDF

rank.

For

example
,

c
onsider the word
apple
that

appears five times

in a document containing a
total number of 100 words. From
equation

(1)
, the term frequency of
apple

is
5
100

=

0.05.
Assume we have 1 million documents in the corpus and
apple

appears in one hundred of
those. Then, the inverse document frequency is calcul
ated
as

log
(
1000000
100
)

= 4. The TF
-
IDF
weight is
:

0.0
5

× 4 = 0.
2
.


In addition to TF and IDF, a third weighting factor document length is also useful to
calculate

terms weights in some cases
[
12
]. Longer documents are more likely to be
retrieved

than
shorter ones as they have larger term sets and hence
are
more likely to match the query
terms
.
The document length factor is usually used as a
normalization

factor to
compensate

the
variation

of the document lengths. However, we have not used this fa
ctor in our
algorithm.


Although
TF
-
IDF is an efficient algorithm for finding relevant information in a corpus, there
are some limitations of such
a
simple algorithm. One of the biggest problems of TF
-
IDF is
that
it does no
t deal with polysemants

(ambiguou
s words)

very well. TF
-
IDF might search for
a word which has completely different meaning
from

the original word

that

the user was
looking for. For example, if the user was querying about the word “dish”, there are several
meaning
s

of it and TF
-
IDF does

n
o
t know whether the user was querying about a dish as
food, a dish as used for satellite transmission or dish as a town in Texas. Therefore, to get
more accurate results in these cases, we need some alterations to the TF
-
IDF algorithm to
make it not only ba
sed on the importance of a term but also takes into account its near
words and phrases, and make sure they are semantic related.


4.2 Preprocessing


4
.
3

Index
ing

Ind
ices play
an important role

in Information Retrieval.
The I
nternet search engines have


spiders


that

crawl around the
World Wide Web

to gather and parse
information [
1
3
]
.
T
he
purpose of having

spiders


is to
create

indices for the web which enable
s

search engines to
respond

to queries
promptly
. Just like

the

search engines, we
need

index files for
our
algorithm as well

to

make our search more efficient.
W
ithout indices, we would
have to
scan
every document in the corpus every time we query a term.

The rest of this section
describes

how index files are structured and how they
are
cre
ated.

As described in
Section

4.1, i
n order

to find a term

s TF
-
IDF value, we need the following four
values
:

(1)

N
umber of
times the term occur
s

in the document

(2)

Number of words
i
n

the document

(3)

Number of documents

in the corpus

(4)

Number of
documents contain
ing

the term in the corpus

W
ith these values, we can easily calculate the TF
-
IDF for any term in
any

document

in a
corpus
.
Since the number of documents in the corpus is
known
, w
e already

have this value
(3)
.
However, we still need to calculate
(1), (2)
and
(
4)

every time

we query a term
.
Therefore
w
e
will store these values in our
index file
s
.

W
hen we calculate the TF
-
IDF value for a term,
we can
look up in the indices and
do
the calculation efficiently
.

There are three index files used in our algorithm, namely
corpus.terms.index
,

corpus.docs.index

and

corpus.terms.subindex
.

In corpus.terms.index file, each line
corresponds to a unique term in the corpus.
There are
1929347

indexed
terms hence
1929347

lines
. Each
line starts with the term

itself
, followed by the number of documents
contain
ing

the term

in the corpus
, follo
wed by a pattern of
two

tokens. T
he first token is
a

cumulative sequential
document ID and the second token is the number of times the term

appears in that document. For example, the word
arpanet

has the line











This means the term a
rpanet
appears

89 times in the W
ikipedia cor
pu
s. The first document
contain
ing

arpanet
has ID
1712

and
arpanet

only appears once

in that document
. The
secon
d document that
contains

arpanet

has ID

2388 (
676
+1712),

where
arpanet

occurred
twice.
The pattern continues until to the end of the line.
From
corpus.terms.index
, we have

values of

(
1
)

and
(
4
)

to calculate TF
-
IDF.

The
second

index file is corpus.docs.index. In this file, each line corresponds to a document
in the corpus (an article in the Wikipedia collection).
There
are 659,388

articles

in the c
orpus
hence
659,388
lines in this file.
The format
for each line
is

W
ikipedia
_Corpus_FileName

N
umber
_
O
f_
W
ords

P
osition_
O
ffset
(in bytes)
2

F
or example,
the following line means there are 105 words in document 657439.xml and the
first words starts at
672118412
(bytes):

wiki0657439 105 672118412

F
rom corpus.docs.index, we have
the value of

(
2
)
.




2



represents a
space
character

arpanet 89 1712 1 676 2 214 2 1198 2 124 1 298 2 400 6 607 4 1337 28 414 1 141 1 332 3
109 1 110 2 3303 2 739 1 756 4 2149 1 1028 5 49 1 160 2 718 1 70 1 54 2 2261 3 814 1
2217 2 23 1

1719 1 598 2 107 1 964 1 526 1 2887 1 480 3 4250 1 1967 1 712 2 1225 2 925 1
821 3 256 1 5082 1 2699 6 8393 1 33449 3 2146 1 2702 1 11324 1 14357 1 1202 2 204 62
408 1 11518 2 222 1 2697 1 10146 1 6038 4 1039 3 10826 1 1488 1 3217 1 792 1 7283 1
3585 1 17
03 3 4182 1 793 1 8543 1 13097 2 20320 1 6993 1 27430 2 40094 1 21191 1
27994 2 3490 1 19932 1 38769 1 6377 3 4204 1 55014 3 24517 1 77194 1 24313 1 2803 1
1188 1 5162 1 30704 2


From corpus.terms.index and corpus.doc
s
.index files, we have all the four values we need to
calculate any term’s TF
-
IDF weight in the corpus.
T
h
e third index file
corpus.terms.subindex

is
created for optimizing the efficiency during searching.
It is used to quickly locate the term in
the big corpus.terms.index file.
As mentioned earlier
, the corpus.terms.index file
contains

1929347

lines

and
may n
ot

perform

very well
. The corpus.terms.subindex is a
n

i
nverted
index

that helps us to find the term in corpus.terms.index. It
can be considered as

an index of
the corpus.terms.index file (two level
indexes
).

Although t
he file
does not provide

any vales
directly
to the

calculation
,

it is
an
important
part
of

speed
ing
up the algorithm.

For the best
performance
, we have encoded both corpus.docs.index and corpus.terms.subindex in binary
format.
Having these indices enables us to calculate the TF
-
IDF weight f
or any term
in the
most efficient way
.


5
. Our Approach

This section describes
the

automated link generation algorithm that we
proposed
,
which
we
call
the
Semantic Related
Difference
(SRD)
.

The SRD algorithm calculates the
semantic relatedness of the anchor term (a
candidate

word
or phrase that
will be
link
ed

to a document) and target document

(the destination of a link)

based on t
he words around the anchor term

in the input document and target document.

Our approach is
base
d on TF
-
IDF and uses the same set of indices that have been created
with the TF
-
IDF algorithm. The

SRD

a
lgorithm
returns a document with the highest
SRD

weight

based on
k

doc
uments returned from the TF
-
IDF
algorithm
.

W
e
then
link the
anchor
term to th
e
document with the highest SRD
.

Formally, t
he
SRD

weight is calculated by

using

the following mathematical equation:

SRD
(
a
,
b
)
=


log

(
|
A

Bi
|
)
log

(
|
A
|
)
n
i
=
1

(4)

w
here
a

is the input documen
t

which contains the
query term
,

b

is
a target document in
the
set of

top documents returned by TF
-
IDF

and
n

is the number of times
the
anchor term
occurs in

document

b
.

A

is the
set of
unique

term
s

within a
boundary

x

of the anchor term
,
B
i

is

the
set of
unique

term
s

within
the boundary
x

of the anchor term. The boundary
x

is
calculated by

x

= min(N
a
, N
b
i
)
*

y%

where
N
a

is the total unique
terms
in document
a
,
N
b
i

is
the total unique terms in document
b
i

and
y%
is a
n

arbitrary factor between 0.2 and 0.6.

|A| represents the number of terms in set
A

and
|A

B|

represents the number of terms in
set
A

intersects

B
. The closer the value of |A

B| to |A|, the closer the anchor term in
a

is
semantically related to document
b
.
Since t
he maximum of |A

B| is |A|
, the maximum
value of
log
(
|
A

Bi
|
)
log
(
|
A
|
)

i
s

1, meaning the anchor term has the same set of

unique

words within
x

distance in document
a

and
b
,
and therefore
they are

most

semantic

related
.

We
divide

the
sum of the values of
log
(
|
A

Bi
|
)
log
(
|
A
|
)

by
n

to get the av
erage semantic relatedness weight of
document
a

and document
b
.


The following article
was

taken from the Wikipedia

corpus

and

we use it

as an example to
illustrate

the above algorithm.











In
the above

article, there are seven links that have been created by Wikipedia contributors.
They are human created
links

and

have been edited many times by
many
users
,

hence

we
consider
both the anchor terms and
the

target documents
to be

selected
correct
ly.


Our

algorithm first sanitizes the input document to strip

out

stop
-
words and special
characters. After text sanitizing, there are 54 words left with 4
5

unique wor
ds. We calculate
each of the 54 words’
TF
-
IDF

weight and
store
each word

s
top five

documents
. For instance,
the top five documents that were ranked by
TF
-
IDF

for

the word
humanism

are: 2841635
.xml
,
49443
.xml
, 311263
.xml
, 14224
.xml

and 1260600
.xml
.

Based o
n the top five documents returned by
TF
-
IDF
, we then calculate their
SRD

weights.
For each target
document, we need to calculate
the

value
x
.
The

value
x

is

the number of
unique
terms we check on each side of every anchor term.

F
or instance, if
x

is 10, we check
10 unique words on the anchor term

s left and 10 unique words on the anchor term

s right.

To calculate
x
, we get the number of unique terms of the input text and each candidate
target documents, then use the minimum value of the two
and
mu
ltiply
by
a factor y. In the
example above, the unique number of words in
the
input document is
45

and
in

the first
ranked
TF
-
IDF

document 2841635
.xml

is 20. The unique number can be calculated as x =
min(
45
,20) x 0.6 which gives us 12. Therefore we check
12 words on each side of the anchor
term “humanism” in both
the
input document and in
the
document 2841635
.xml
.
From
equation

(4)
,

A is the set of unique words within 12 words on each side of the anchor term in
the
input document and
B

is the set of uniqu
e

terms in

document 2841635.xml. W
e can
calculate the
SRD

weight of the input document and document 2841635.xml. Similarly, we
can get
SRD

weights for the top five documents ranked by TF
-
IDF. Table 1 shows

the
ir
SRD

weights
of
the
term “humanism”

and the ranking comparison between TF
-
IDF and
SRD
.


Table
1
, top five documents of term “humanism”

returned by
TF
-
IDF

and their
SRD

wei
ght

Document

SRD

Weight

TF
-
IDF

Rank

SRD

Rank

2841635
.xml

0.925512852639037

1

5

49443
.xml

6.24200799178871

2

2

311263
.xml

2.54011903613068

3

3

14224
.xml

20.7760583886198

4

1

1260600
.xml

1.14964684018376

5

4

The

German Renaissance, whose influence originated in Italy,
started spreading among
German thinkers in the

15th and

16th century. This was a result of German artists who
had travelled to Italy to learn more and become inspired by the Renaissance movement.

Many areas of the arts and sciences were influenced, such a
s the spread of humanism, to
the various German states and principalities. There were many advances made in the
development of new techniques in the fields of architecture, the arts and the sciences.

By far the most famous German Renaissance
-
era artist is
Albrecht Dürer

who is
well
-
known for his woodcuts, printmaking and drawings.



Using
the
TF
-
IDF

algorithm, we would link the term “humanism” to document 2841635
.xml
.
However, the result is completely
the
opposite for our
SRD

algorithm. The document
2841635
.xml

has the leas
t

weight in the top five
TF
-
IDF

ranks
. The highest
SRD

weight is
14224
.xml

which was ranked at the fourth place by
TF
-
IDF
. The document 14224
.xml

is also
the one that has been linked to “humanism” by
a
human
.


6. Evaluation

Wikipedia article
s

are manually

edited by
Wikipedia contributors
.
Every link has been
carefully chosen to point to the best destination.
Therefore w
e believe the existing hyperlinks
in Wikipedia

articles

are linked to
the most appropriate
articles and provide us
the
ground

truth

for evaluation
. Therefore
,
we will show the results of our approach compared with
TF
-
IDF in terms of their percentage of matching links generated by both algorithm.

We have randomly selected 30 articles from the Wi
kipedia 2007 English corpus
.

Since we
have not included
techniques

for
keyword extraction and gathering n
-
grams, we will only
evaluated
the performance
of

every
single words in
the selected

documents.

Anchor terms
with multiple words or phrases in the arti
cles are simply ignored during evaluation.

We
believe with the correct
n
-
grams and
keyword extraction
algorithm
applied
, the result

will
not be dissimilar.

Each

article has been run against the TF
-
IDF and
SRD

algorithm
s
.

The results are shown in
T
able 2.


Document
File Name

Number of
hyperlinks

in
document

TF
-
IDF

SRD

Number of


terms match
with
human


%
terms

match
with human


N
umber of
terms

match
with human


% of
terms

match with
human

1_
424499

19

2

10.53%

6

31.58%

7_
1027282

14

0

0
%

1

7.14%

8_
66610

8

2

25%

3

3
7.5
%

16_
1823934

44

3

6.82%

9

20.45%

17_
1045640

6

0

0%

2

33.33%

25_
1525528

13

0

0%

2

15%

26_390568

5

0

0%

1

20%

29_
2160116

5

0

0%

1

20%

30_
2200414

2

1

50%

1

50%

33_1046012

8

0

0%

1

12.5%

34_
75649

56

3

5.36%

8

14.29%

40_44703
9

15

2

13.33%

1

6.67%

41_
11012

31

0

0%

2

6.45%

42_
1952531

8

0

0%

1

12.5%

47_2342457

6

1

16.67

1

16.67

48_
1797224

12

2

16.67%

3

25%

51_
650350

8

0

0%

2

25%

52_
2475753

4

1

50%

1

50%

55_
83048_H
IGH

19

10

52.63%

12

63.16%

63_
425375

25

4

16%

7

46.67%

65_
1741615

15

1

6.67%

4

26.67%

67_
35791

12

2

16.67%

6

50%

71_2331851

5

2

40%

1

20%

72_66203

41

6

14.29%

9

21.95%

76_2245149

7

1

0%

1

14.29%

79_2035790

5

2

40%

2

40%

81_849246

13

0

0%

2

15.38%

87_266302

4

1

25%

3

75%

89_3147718

6

0

0%

1

16.67%

90_
1025371

14

1

7.14%

2

14.29%

Average

4
30

4
7

10.93%

96

22.33%

T
able2.
Result

of TF
-
IDF and
SRD

algorithms against thirty randomly selected
Wikipedia
documents


In

Table 2
, it

clearly show
s

that
SRD

outperforms

the TF
-
IDF algorithm

in most cases
.

O
n
average, 10.93% of

the links generated by TF
-
IDF match with Wikipedia contributors created
links whereas
SRD

achieved 22.33%, more than twice the
ratio of

TF
-
IDF

s.


Since
SRD

is based on
TF
-
IDF
, it
was

interesting

to
find

that two of the documents had better
results using
TF
-
IDF

than
SRD
.

F
or example,

Berlinerisch

and
Brandenburgisch

are two types
of dialects spoken in Germany and the words originated from Deutsch.
The links for these
two terms match human
generated links using
TF
-
IDF but not
using
our
SRD

algorithm
.

This
indicates that for terms that do not
natively
suit the
context (
for example, foreign words or
transliteration

words)
, calculating the weight based on term frequency and document
frequency gives better results than
semantic

relatedness.


7. Conclusion

Link generation has been a popular research topic in the information retrieval field for many
years.
In this dissertation, we
investigated

Semantic Related Difference,
a
n approach

to
calculate the semantic relatedness between a
nchor

term
s

and
document
s
,
which

can be
used to
generate hyperlinks automatically from unstructured input text.

O
ur measure outperforms TF
-
IDF when
evaluated

against
30 randomly selected articles from
the
Wikipedia

corpus.

Although the experiment was done based on Wikipedia corpus,

our approach is not limited
to any particular corpus. The SRD measure can be applied on any
corpora
. This is one of the
key di
fferences
between

our approach and the Wikify! system
a
nd Milne & Witten

s
approach. Both of the Wikify! and Milne & Witten

s app
roach are specific to Wikipedia and
requires
heavy preprocessing to have prior knowledge about the Wikipedia corpus.

Also, while the experiment used English Wikipedia as
corpus, SRD is language
independent in
theory
. The German words example showed in
Sect
ion

6 indicated that the algorithm does
not perform well when multi
-
language
s

are mixed in one context. It would be interesting to
see

the result
s of

SRD applied to language
s
.

As mentioned before, this project not only
benefit
s Wikipedia contributors by
saving their
time but can also be used to fix broken hyperlink
s

on w
eb

p
ages, by finding appropriate
pages for the text to link to
.

Our f
uture
work include
s
incorporat
ing a keyword extraction algorithm
,
which
involves
preprocessing

the input text, such as
tokenizing, tagging, noun chunking and gathering
n
-
grams etc. Also
, we will focus on applying the SRD with various corpora (i.e. not just
Wikipedia) for further investigation.


Reference

[1]

Green,
S
tephen J
.

(1998)
Automated link generation: Can we do better than term
repetition? In
Proc. of the 7th International World
-
Wide Web Conference,

1998.

[2]

Wikipedia
.

(
n.d.
).
In Wikipedia
.

Retrieved from
http://en.wikipedia.org/wiki/Wikipedia

[3]

Mihalcea, R. and Csomai, A. (2007) W
ikify!: linking documents to encyclopedic
knowledge. In
Proceedings of the 16th ACM Conference on Information and Knowledge
management (CIKM’07),
Lisbon, Portugal, pp. 233
-
242

[4]

Salton, G. and Buckley, C. (1987).

Term weighting approaches in automatic text
retrieval.
Tech Report 87
-
881, Department of Computer Science, Cornell
University
.

[5]

Denoyer, L., Gallinari, P.

(2006)

The Wikipedia XML Corpus, SIGIR Forum, 40(1), June
2006, 64
-
69.

[6]

W3C
.

(2010).
Extensible Markup

Language (XML) 1.0 (Fifth Edition).

Retrieved

from http://www.w3.org/TR/REC
-
xml/

[7]

Wilkinson, R
. and Smeaton, A.F. (1999)
Automatic link generation,
ACM Computing
Surveys (CSUR)
(1999)

[8]

M. Strube and S. P. Ponzetto.

(2006)

Wikirelate! computing semantic relatedeness using
Wikipedia. In
Proceedings of the American Association for Artificial Intelligence
, Boston,
MA, 2006.

[9]

Milne, D. and W
itten I.H
.

(
2008)
Learning to Link with Wikipedia,

In Proceedings of the
ACM Conference

on Information and Knowledge Management (CIKM'2008),

Napa Valley,
C.A.

[10]

Cilibrasi, R.L. and Vitanyi, P.M.B. (2007) The Google Similarity Distance.
IEEE
Transactions on Knowledge and Data Engineering 19
(3), 370
-
383.

[11]

TF
-
IDF
.

(
n.d.
).
In
Wikipedia
.
Retrieved from
http://en.wikipedia.org/wiki/Tfidf

[12]

Singhal
, A
. (2001) Modern information retrieval: a brief overview.
IEEE Data Engineering
Bulletin, Special Issue on Text and Databases
, 24(4), Dec. 2001

[13]

Chen,

H.

Chung
,
YM
.

Ramsey
,

M
. (1997)

Intelligent spi
der for Internet searching

[14]

Fahmy E and Barnard D. T. (1990) Adding Hypertext Links to an Archive of Document. In
The
Canadian Journal of Information Science
, 15(3), 25
-
41.

[15]

Tebbutt J. (1999) User evaluation of automatically generated semantic hypertext lin
ks in a
heavily used procedural manual in
Information Processing and Management
, 35(1), 1
-
18.

[16]

S. J. Green (1997) Automatically generating hypertext by computing semantic similarity.
University of Toronto.

[17]

James Allan. (1997) Building Hypertext using Information Retrieval. In
Information
Processing and Management

33(2):145
-
159.

[18]

Maristella Agosti and James Allan. (1997). Methods and tools for the construction of
hypertext. In
Information Processing and Manage
ment
, 33(2), 129
-
271.












<?xml version="1.0" encoding="utf
-
8"?>

<article>

<name id="17941">Liquid</name>


<conversionwarning>2</conversionwarning>

<body>

<template name="otheruses"></template>

<figure>

<image
xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple"
xlink:href="../pictures/WaterAndFlourSuspensionLiquid.jpg" id="
-
1" xlink:actuate="onLoad"
xlink:show="embed">

WaterAndFlourSuspensionLiquid.jpg</image>

<caption>A liquid will assume the shape o
f its container.</caption>

</figure>

<p>A

<emph3>liquid</emph3>(a

<collectionlink xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple"
xlink:href="23637.xml">phase of

matter</collectionlink>) is a

<collectionlink xmlns:xlink="http://www.w3.org
/1999/xlink" xlink:type="simple"
xlink:href="1083844.xml">fluid</collectionlink>whose

<collectionlink xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple"
xlink:href="32498.xml">volume</collectionlink>is fixed under

conditions of constant

<
collectionlink xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple"
xlink:href="30357.xml">temperature</collectionlink>and

<collectionlink xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple"
xlink:href="23619.xml">pressure</collection
link>; and, whose

shape is usually determined by the container it fills. Furthermore,

liquids exert pressure on the sides of a container as well as on

anything within the liquid itself; this pressure is transmitted

undiminished in all directions.</p>If a l
iquid is at rest in a

uniform gravitational field, the

<collectionlink xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple"
xlink:href="23619.xml">pressure</collectionlink>

<math>p</math>at any point is given by

<indentation1>

<math>p=
\
rho gz</
math>

</indentation1>

<p>where

<math>
\
rho</math>is the

<collectionlink xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple"
xlink:href="8429.xml">density</collectionlink>of the liquid

(assumed constant) and

<math>z</math>is the depth of the
point below the surface. Note

that this formula assumes that the pressure

<emph3>at</emph3>the free surface is zero,

<emph2>relative</emph2>to the surface level.</p>

<p>Liquids have traits of

<collectionlink xmlns:xlink="http://www.w3.org/1999/xlink" xl
ink:type="simple"
xlink:href="113302.xml">surface

tension</collectionlink>and

<collectionlink xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple"
xlink:href="1119771.xml">capillarity</collectionlink>; they

generally expand when heated, and cont
ract when cooled. Objects

immersed in liquids are subject to the phenomenon of

<collectionlink xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple"
xlink:href="245982.xml">buoyancy</collectionlink>.</p>

<p>Liquids at their respective

<
collectionlink xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple"
xlink:href="748873.xml">boiling

point</collectionlink>change to

<collectionlink xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple"
xlink:href="12377.xml">gas</collec
tionlink>es, and at their

<collectionlink xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple"
xlink:href="40283.xml">freezing

point</collectionlink>s, change to a

<collectionlink xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple"
x
link:href="27726.xml">solid</collectionlink>s. Via

<collectionlink xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple"
xlink:href="213614.xml">fractional

distillation</collectionlink>, liquids can be separated from one

another as they

<collect
ionlink xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple"
xlink:href="10303.xml">vaporise</collectionlink>at their

own individual boiling points. Cohesion between

<collectionlink xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple"
xlink:href="19555.xml">molecule</collectionlink>s of liquid

is insufficient to prevent those at free surface from

<collectionlink xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple"
xlink:href="10303.xml">evaporating</collectionlink>.</p>

<p>It

should be noted that

<collectionlink xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple"
xlink:href="12581.xml">glass</collectionlink>at normal

temperatures is

<emph2>not</emph2>a "supercooled liquid", but a solid. See the

article on glass fo
r more details.</p>

<section>

<title>See also</title>

<normallist>

<item>

<collectionlink xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple"
xlink:href="1314474.xml">List of phases of

matter</collectionlink>

</item>

<item>

<unknownlink
src="ripple">ripple</unknownlink>

</item>

<item>

<collectionlink xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple"
xlink:href="37379.xml">specific

gravity</collectionlink>

</item>

<item>

<collectionlink xmlns:xlink="http://www.w3.org/1999/xlin
k" xlink:type="simple"
xlink:href="1538336.xml">liquid dancing</collectionlink>

</item>

</normallist>

<template name="template phase of matter">

</template>

<languagelink lang="ar">
ل ئاس
</languagelink>

<languagelink lang="bg">Течност</languagelink>

<languag
elink lang="ca">Líquid</languagelink>

<languagelink lang="cs">Kapalina</languagelink>

<languagelink lang="de">Flüssigkeit</languagelink>

<languagelink lang="es">Líquido</languagelink>

<languagelink lang="eo">Likvaĵo</languagelink>

<languagelink lang="fr">L
iquide</languagelink>

<languagelink lang="ko">
액체
</languagelink>

<languagelink lang="io">Liquido</languagelink>

<languagelink lang="id">Cairan</languagelink>

<languagelink lang="is">Vökvi</languagelink>

<languagelink lang="it">Liquido</languagelink>

<langua
gelink lang="he">
לזונ
</languagelink>

<languagelink lang="ms">Cecair</languagelink>

<languagelink lang="nl">Vloeistof</languagelink>

<languagelink lang="ja">
液体
</languagelink>

<languagelink lang="nn">Væske</languagelink>

<languagelink lang="pl">Płyn<
/languagelink>

<languagelink lang="pt">Líquido</languagelink>

<languagelink lang="ru">Жидкость</languagelink>

<unknownlink src="simple:Liquid">simple:Liquid</unknownlink>

<languagelink lang="sl">Kapljevina</languagelink>

<languagelink lang="sv">Vätska</lan
guagelink>

<languagelink lang="zh">
液体
</languagelink>

</section>&lt;/&gt;</body>

</article>