Powerpoint template for scientific posters ... - The Study Stream

runmidgeΤεχνίτη Νοημοσύνη και Ρομποτική

20 Οκτ 2013 (πριν από 3 χρόνια και 7 μήνες)

66 εμφανίσεις

Introduction


As Web 2.0 technologies such as free
-
tagging and
folksonomy

become popular, many researchers have paid attention on
adopting those newly emerged information technology to
enhance text retrieval. Generally, tags reflect the contents of a
text. Thus, it is assumed that using tags as a feature for
retrieval can improve the retrieval performance. However, as
Hotho

(2006) and Passant (2007) remarked, free
-
tagging has
certain limits for use in text retrieval. Since free
-
tagging is
done by users without any control, tags can be ambiguous
and/or irrelevant to the original text.


Instead of using the tags given randomly by users, the present
study is concerned with using automatically extracted
keywords or keyphrases in text retrieval. In this study, a
keyword is defined as a single word that represents the
contents of a given text and a keyphrase as a phrase that
describes the contents well. Since keywords and keyphrases
represent the main ideas and items of a given text, giving
more weight on keyphrase or keyword terms would enhance
the performance of retrieval.


This study is based on Korean academic environment where
index terms are compound words or phrases in more than
70% of the cases (Lee, et al., 2003). In Korean, extracted single
words have only limited capability of representing the
contents. Keywords are extracted first by a neural network
algorithm and keyphrases are generated combining the
extracted keywords within the context.


To fully utilize the generated keywords and keyphrases, they
are experimented in text retrieval. They are coded into a
document vector with a certain weight and combined with the
original document vector. By doing so, the original document
vector can be enhanced to represent better the context of the
original document and can improve overall retrieval
performance.

Experimental

Design



As shown in Figure 1, a neural network which was
implemented by Feed
-
forward Neural Network for Python
(
Wojciechowski

et al., 2007) judges each word to see if it is
eligible to be a keyword on the basis of TF*IDF and the
location of each word in the document.

Keyphrases are generated based on rule
-
based algorithm.
The keyphrase generation algorithm makes a window with
1 preceding and 3 following words and merges adjacent
or overlapping windows. The rule
-
based algorithm rules
out the words inadequate for a noun phrase by analyzing
the lexical category of each word. The words appear on
the automatically extracted or generated keywords and
keyphrases are added onto the original document vector
with a certain weight to give more weight on essential
words. For each vector, Okapi TF
×
IDF normalization was
applied. As shown in the example below, Keyphrases
include a certain keyword repeatedly. Thus, more
important keywords get more weight while less important
keywords often get no additional weight at all. The
example of the extracted keywords and keyphrases are
shown in figure 2.


With the weighted vectors, along with the original vectors
as a baseline, text retrieval experiments were carried out.
The test collection consisted of 545 abstracts of academic
papers from the
Yonsei

University Library in ten different
academic fields. The result of retrieval experiments with
keyword
-
added vector, keyphrase
-
added vector, and a
vector with both keyword and keyphrase entry words is
shown in Figure 2. The result was evaluated in R
-
precision.

Hohyon Ryu

Yonsei University, Seoul, Korea &

School of Information Studies, University of Wisconsin
-
Milwaukee

This study is based on
Korean academic
environment where index
terms are compound
words or phrases in more
than 70% of the cases
(Lee, et al., 2003). In
Korean, extracted single
words have only limited
capability of representing
the contents.

This study is based on
Korean academic
environment
where index
terms are
compound
words
or phrases in more
than 70% of the cases
(Lee, et al., 2003). In
Korean,
extracted single
words

have only
limited
capability

of representing
the contents.

vs

Plain text

Highlighted text

Highlighted
keyphrases

help people to read a document better.

Can they help a search engine to work better?

Literature

Review


Previous studies that are related to keyword extraction based
on neural network, noun phrase generation/extraction, and
improving text retrieval performance with contextual features
are reviewed. Neural network or other machine learning
methods have been utilized in several studies to decide if a
given word should be recognized as a keyword (
Medelyan

et
al., 2008; Jo T. C. et al., 2000). Extracting noun phrases also has
been approached in many different ways (
Tomokiyo

et al.,
2003; Yang, 2000; Lee S. S. et al., 2003; Lee C. Y., et al., 1993;
Lee H. A., et al., 1997). Since keyphrases are more prevalent
than single
-
noun keywords and since more complicated
processes are involved in the Korean language, many studies
have been done by Korean researchers. Finally,
Hotho

(2006)
and Cho et al. (2005) conducted a study to improve text
retrieval performance with keywords or
folksonomy
.

Title
: Study on Guidelines for the Construction of a Korean Thesaurus

Keywords
: 1986, Korean, basic, Hangeul, definition, standard, 2788, relation, word,
ISO, alphabet, thesaurus, rule, term, most

Keyphrases
: standard for Hangeul thesaurus construction, Hangeul thesaurus, word
thesaurus, ISO standard, aspect of Hangeul thesaurus, Hangeul thesaurus test,
Hangeul thesaurus data, ISO, word thesaurus construction standard, Hangeul
thesaurus management system

Original Text

Keywords (single noun)

Keyphrases (compound nouns or
combination of keywords)

Neural Network

Learning

Rule
-
based

Generation

Document
Vector

Figure 1: Outline of the experiment.


Figure 2. the example of the extracted keywords and keyphrases

0.72

0.74

0.66

0.55
0.60
0.65
0.70
0.75
0.80
1
2
3
4
5
6
7
8
9
R
-
precision

Weight

Baseline
Keyword
Keyphrase
Keyword+
Keyphrase
Figure 3: The change of R
-
precision according to the
assigned weight and features

Result

As Figure 3 suggests, text retrieval performance increased by 15%
from R
-
precision of 0.64 to 0.74 when the words appear on both
keyphrases and keywords on the original vector with double
weight. For the keyword + keyphrase vector, lesser margin of the
improvement was shown as higher weight is assigned to the
additional terms.

On the other hand, the higher the weights assigned to the
keyword
-
added vector (the dashed line in Figure 2,) the better the
performance. This is because the same weight is assigned to each
keyword item, while words in the keyphrase get different weight
according to their appearance in the keyphrase list.

Conclusion

&

Future

Research

The present study shows that giving extra weight on words that
appear in keywords or keyphrase affects positively the
performance of text retrieval. Since current web retrieval provides
a significant number of irrelevant documents at users’ request,
modifying search algorithms to be more sensitive to the subject of
documents will help improve the retrieval performance.
Additionally, the neural network keyword extraction and the rule
-
based keyphrase generation performed with stable efficiency. An
evaluation of keyword and keyphrase generation will be carried
out in the future to utilize the modules as independent software.
Also, since the result of retrieval test in the Korean environment
has shown significant improvement, further experiments will be
made for English. It is expected that positive improvement on
retrieval performance will also occur here.

References

Cho M.,
Yun

B., & Rim H. 1997. A Korean Document Retrieval Model Considering Compound
Nouns and Derived Nouns.
Proceedings of Korea Information Science Society Spring
Conference

24(1). 449
-
502.

Hotho
, A.,
Jaschke
, R., Schmitz, C., &
Stumme
, G. 2006. Information Retrieval in
Folksonomies
:
Search and Ranking.
Lecture Notes in Computer Science.

Springer Berlin: Heidelberg.

Jo, T. C., &
Seo
, J. 2000. Neural Based Approach to Keyword Extraction from Documents.
Proceedings of Korea Information Science Society Autumn Conference

27(2). 317
-
319.

Lee, C. Y., Kang, H., Jang, H., & Park, S. 1993. A design of the Automatic Keyword Maker
.
Proceedings of the 5th Conference of Hangul and Korean Information Processing
. 71
-
77.

Lee, H. A., Lee, J. H., & Lee, G. 1997. Noun Phrase Indexing using Clausal Segmentation.
Journal of
Korea Information Science Society(b)
25(3). 301
-
311.

Lee, S. S., & Lee, T. 2003. Concept
-
based Compound Keyword Extraction.
Journal of Korea
Association of Computer Education

6(2).

Medelyan
, O., & Witten, I. H. 2008. Domain Independent Automatic Keyphrase Indexing with Small
Training Sets.
Jasist
,
59(7). 1026
-
1040.

Passant, A. 2007. Using
Ontologies

to Strengthen
Folksonomies

and Enrich Information Retrieval in
Weblogs.
International Conference on Web Services
.

Tomokiyo
, T., & Hurst, M. 2003. A Language Model Approach to Keyphrase Extraction.
Proceedings
of the ACL Workshop on Multiword Expressions
.

Wojciechowski
, M. 2007.
Feed
-
forward neural network for python
. Technical University of Lodz
(Poland), Department of Civil Engineering, Architecture and Environmental Engineering,
http://ffnet.sourceforge.net/, ffnet
-
0.6, March 2007.

Yang J. 2000. Base Noun Phrase Recognition in Korean using Rule
-
based Learning.
Journal of Korea
Information Science Society: Software and Applications

27(10).

For more information, please send an email to

hohyon@gmail.com