TEXT MINING FOR INFORMATION RETRIEVAL

beadkennelΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 3 χρόνια και 7 μήνες)

67 εμφανίσεις

TEXT MINING FOR INFORMATION RETRIEVAL


Synopsis of the Thesis to be submitted in fulfillment of the requirements for the

degree of


DOCTOR OF PHILOSOPHY


By


NAMITA GUPTA










Department of Computer Science
and

Engineering

JAYPEE INSTITUTE OF INFORMATION TECHNOLOGY UNIVERSITY

A
-
10, SECTOR
-
62, NOIDA, INDIA

May, 201
1

Synopsis
-

1

TEXT MINING FOR INFORMATION RETRIEVAL



Introduction

Nowadays,
large quantity of data is being accumulated

in the data repository
. Usually
there is a huge gap from the stored data to the knowledge that could be constru
ct
ed from
the data. This transition won't occur automatically, that's where Data Mining comes into
picture. In Exploratory Data Analysis, some i
nitial knowledge is known about the data,
but Data Mining could help in a more in
-
depth knowledge about the data. Seeking
knowledge from massive data is one of the most desired attributes of Data Mining.
Manual data analysis has been around for some time n
ow, but it creates a bottleneck for
large data analysis. Fast developing computer science and engineering techniques and
m
ethodology generates new demands

to mine complex
data
types
.
A number of
Data
Mining

techniques (such as association, clustering, clas
sification) are developed to mine
this vast amount of data. Previous
studies [
18
]

o
n

Data Mining

focus on structured data,
such as relational

and

transactional data. However, in reality, a substantial portion of the
available information is stored in text
databases (or document databases), which consists
of large collections of documents from various sources, such as news articles, books,
digital libraries and Web pages. Text databases are rapidly growing due to the
increasing amount of information availabl
e in electronic forms, such as electronic
publications, e
-
mail, CD
-
ROMs, and the World Wide Web (which can also be viewed
as a huge, interconnected, dynamic text database).

Data stored in text databases
is mostly

semi
-
structured
, i.e., it is

neither comple
tely
unstructured nor completely structured. For example, a document may contain a few
structured fields, such as title, authors, publication date, length, category, and so on, but
also contain some largely unstructured text components, such as abstract an
d contents.
In recent database research,
studies

have been done to
model and implement semi
-

structured data.
Information Retrieval

techniques, such as text indexing, have been
developed to handle
th
e

unstructured documents. But, traditional
Information
Retrieval

techniques become inadequate for the increasingly vast amount of text data. Typically,
only a small fraction of the many available documents will be relevant to a given
Synopsis
-

2

individual or user. Without knowing what could be in the documents, it is dif
ficult to
formulate effective queries for analyzing and extracting useful information from the
data. Users need tools to compare different documents, rank the importance and
relevance of the documents, or find patterns and trends across multiple documents.

Thus,
Text Mining

has become an increasingly popular and essential theme in
Data
Mining
.

Text Mining
, also known as knowledge discovery from text, and document information
mining, refers to the process of extracting interesting patterns from very large te
xt
corpus for the purposes of discovering knowledge. It is an interdisciplinary field
involving
Information Retrieval
, Text Understanding, Information Extraction,
Clustering, Categorization, Topic Tracking, Concept Linkage, Computational
Linguistics, Visua
lization, Database Technology, Machine Learning, and
Data Mining

[
25
].

Text Mining

tools
/applications
intend to capture relationship between
the
data. They
can
be roughly organized into two groups. One group focuses on document exploration
functions to org
anize documents based on their content and provide an environment for
a user to navigate and browse in a document or concept space. It includes Clustering,
Visualization, and Navigation. The other group focuses on text analysis functions to
analyze the con
tent of the documents and discover relationships between concepts or
entities described in the documents. They are mainly based on natural language
processing techniques, including
Information Retrieval
, Information Extraction, Text
Categorization, and
Sum
marization [
27
]
,

[
2
8
]
.

Content
-
based text selection techniques have been extensively evaluated in the context
of
Information Retrieval
. Every approach to text selection has four basic components:



Some technique for representing the documents



Some
technique for representing the information need
ed

(i.e., profile construction)



Some way of comparing the profiles with the document representation



Some way of using the results of
the

comparison



Synopsis
-

3

Issues in I
nformation
R
etrieval

The main
motivation of
our

research is to study
different
existing

tools and
techniques of
Text Mining

for
Information Retrieval

(IR)
. S
earch engine is the most well known
Information Retrieval

tool.
A
pplication of
Text Mining

techniques
to
Information Retrieval

can
improve the
precision of
retrieval

systems
by
filtering

relevant documents for the given
search query.


Electronic information on Web is a useful resource for users to obtain a variety of
information.
The process of manually

compiling
text pages

according to a user's
needs and

preferences and into actionable reports is very labor

intensive, and is greatly amplified
when it needs to be

updated frequently. Updates to what has been collected

often require a
repeated search
ing
, filtering previously

retrieved
text web pages

and re
-
organizing

them
.

To harness this information, various search engines and
Text Mining

t
echniques have been
developed to gather and organize
the web pages
.

Retrieving

relevant
text pages

on a topic
from a large
page

collection is a challenging task.



Given below are some issues identified in
Information Retrieval

process:


Issue
(
1
)
:

Traditional
Information Retrieval

techniques become inadequate
to handle large
text databases containing high volume of text documents. To search relevant documents
fr
om the large document collection,

a vocabulary is used
which

map each term
given in
the search query
to the address of
the corresponding
inverted
file
; the inverted
files

are

then

read from
the
disk; and are merged, taking the intersection of the sets of d
ocument
s
for AND, OR, NOT
operations

[
8
]
,
[
24
]
,
[
30
]
,
[
31
]
.

To support
retrieval process
, inverted
file require several additional structures

such as

document frequency
of

each
lexicon
in

the vocabulary,
term frequency of each term in

the

document. The principal cost of
searching process

are the space
requirement
in
memory

to hold inverted file entries
, and
the time
spend

to process
large size
inverted
files

maintain
ing record of each
document
of
the corpus
as they are potential answers.
M
any
terms in
the

query mean
s

more disk
accesses into the inverted file, and more time spent
to
merg
e

the
obtained
lists
.



Synopsis
-

4

Issue
(
2
)
:

Presently, while doing query based searching, search engines return a
set

of web
pages containing both
relevant
and
non
relevant

pages
,

sometimes showing
non relevant
pages assigned higher rank score
. These search engines use one of the following
approaches to organize
,

search and analyze information on the web. In the first
approach
[
30
],
ranking algorithm uses term freque
ncy to
select the terms
of the page,
for indexing a
web page (after filtering out common or meaningless words)
.

In t
he second
approach

[5],[9],[11],[19],[20
]

structure of links appearing between pages
is considered
to identify
pages that are often referenc
ed by other pages. Analyzing the density, direction and
clustering of links, such

method is capable of identifying the pages that are likely to
contain valuable information. Another
approach

[
9],[
2
6],[
2
9]
analyzes the content of the
pages linked to or from
the
page of interest. They analyze the
similarity of the word usage
at
different link distance from the page of interest and demonstrate that structure of words
used by the linked pages enables more effic
ient indexing and search
ing
. Anchor
text
[
15
]
o
f

a hyperlink is considered to describe its target page and so target pages can be replaced
by their corresponding anchor text. But the nature of the Web search environment is such
that the retrieval approache
s based on single sources of evidence
,

suffer from weaknesses
that can hurt the retrieval performance. For example, content
-
based
Information Retrieval

approach does not consider
link information of
the page while ranking the
target
page and
hence affect t
he quality of web documents, while link
-
based
approaches
[
6
]
,
[
1
9
]
,
[
20
]

ca
n
suffer from incomplete or noisy link topology. Th
e

inadequacy of sing
ular

Web
Information Retrieval

approaches make a strong argument for combining multiple sources
of evidence as a

potentially advantageous retrieval strategy for Web
Information Retrieval
.


Issue
(
3
)
:

A common problem of
Information Retrieval

is that users have to browse
large
number of documents containing both relevant and non relevant documents
before finding
relevant documents. Clustering keeps similar documents together in a single
group

and

hence

fastens the process of
Information Retrieval

by retrieving the documents from the
same cluster

based on query vector matching to the cluster centroid
. There are many
clustering algorithms available
like K
-
means, Bisecting K
-
means, HFTC (Hierarchical
Document clustering using Frequent Items
ets), Hybrid PSO+K
-
means method and

Global
K
-
means
[
3
]
,
[
7
]
,
[
1
6
]
,
[
1
7
]
,
[
23
]
.

But, there are many challenges in using
these existing
clustering techniques in the domain of text documents. Bisecting K
-
means produce deep
hierarchy resulting in difficulty to browse if one makes an incorrect selection while
Synopsis
-

5

navigat
ing a hierarchy. Although HFTC
[
4
]

produce
s

a relatively flat
hierarchy as
compared to Bisecting
K
-
means
,

but it is expensive in terms of calculating the global
frequent itemsets to create the clusters. Global K
-
means
[
22
]
,

unlike

K
-
means
,

is
insensitive to the choice of initial
k

cluster centers
,

thus giving global optimum solution.
But it requires execution of K
-
means method
(nk)

times for document set of size
n

to
generate
k

clusters showing time complexity of
O(nk).

In Hybrid PSO+K
-
means
method

[
12
]
,

the PSO (Particle Swarm Optimization) module

is executed for a short period to
search for the optimum cluster centroid locations and then the
K
-
means module is used for
refining and generating the final optimal clustering solution.
T
his method produces the
global optimum solution like
G
lobal k
-
means
,
and

also requires assuming the initial value
of
k
.


Issue
(
4
)
:

To fasten the process of document retrieval, text summarization technique is
used

[1
]
,
[
2
]
,
[
10
]
,
[
13]
. Ranking of documents is made based on the summary or the abstract
provided by the authors
of the document. But it is not always possible as not all documents
come with an abstract or summary. Also when different summarization tools like Copernic,
SweSum, Extractor, MSWord, Intelligent, Brevity, Pertinence text summarizer are used to
summarize t
he document, not all the topics covered within the document are reflected in its
summary.


Propos
ed Solutions

The aim of this thesis is to address the above
discussed
issues

in order to improve the
accuracy of the
Information Retrieval

process
.

Our proposed approaches in this thesis
contribute in the field of text
Information Retrieval

and provide more relevant documents
(web pages) to the given searched query accurately and efficiently in less time.

1)

To
speed up the process of text document retr
ieval

and effective

utilization of the
memory space as discussed in issue
(
1
)
, we propose
an algorithm
based on inverted index
file
. By using the range partition feature of oracle, the space requirement of
m
emory is
reduced considerably as the inverted
ind
ex
file
is

stored

on secondary storage and only
the
required portion of the inverted index file is maintained in the main memory
. Fuzzy logic is
applied to retrieve
the
selected documents and then suffix tree clustering is used to group
the
similar

documents.

Synopsis
-

6

2)

To handle the problem discussed in issue
(
2
)
, a

method is proposed
for learning

web
structure to classify web documents and demonstrates the usefulness of considering the text
content information of backward links and forward links for
computin
g the
page rank

score
. The similarity of the word usage at single level link distance from the target page is
analyzed
,

which shows that content of words in the linked pages enables more efficient
indexing and searching. The
novel

method efficiently reduce
s the limitations of the already
existing Link Analysis algorithms like Kleinberg’s HITS algorithm, SALSA
[
14
]
,

[
21
]
whi
le computing the rank
score
of the retrieved web pages and the results obtained by the
proposed method are not biased towards in
-
degree
or out
-
degree
of the target page. Also
the rank scores obtained show
ing

non
-
zero values help to rank the web pages more
accurately.

3)

An approach to text document clustering that overcomes the drawback of
K
-
means and
Global k
-
means
as discussed in issue
(
3
)

is proposed
which

gives global optimal solution
with time complexity of O(
lk
) to obtain
k

clusters from

an

initial set of
l

starting clusters.

4)

A new method of
building the
generic,
extract

based

single text document
summar
y

of
fixed length
is proposed
to handle the limitations of existing summarization algorithms
as
discussed in
issue
(
4
)
.
Index
term
(
s
)

of the document
is/are
identified based on keyterms of
each sentence and paragraph within the document.

Rank of the sentence is computed based
on the
number of matching terms between the document and sentence index t
erms
.
Sentences having high rank score

are
extracted to be
inc
l
uded in the final summary.



Results and Findings of the
W
ork

d
one

Various

conclusions and findings derived on applying the proposed methods to handle the
above discussed
issues

in
Information Retrieval

are discussed below:


1)

C
ompress
ion and partitioning of
inverted file
reduces the memory space requirement
and store the entire
c
ompressed
inverted file

in

secondary storage. It is the use of
compression and partitioning that results in the superior performance

of retrieval
process

as shown in fig. 1
. Through the range partitioning
feature of Oracle,
a smaller,
faster representation

for sorted lexicon of interest

can be achieved
.
To process a query
,

only a small portion of the compressed inverted index file is cached in memory.
I
nput/output (I/O) time

required for

loading a much smaller compressed postings list

is
small
,
although it
adds some
cost of decompression.
So
, the retrieval system runs
Synopsis
-

7

faster on compressed posting lists than on uncompressed posting lists
. Also
,

s
orted
lexicon permit
s

rapid binary search for matching strings. Conjunctive queries are
easily handled through the
concept of fuzzy logic in retrieving the documents having
high value of

-
cut
(threshold value)
in
document set
for AND operation and non
-
zero
value of

-
cut for OR operation. The
proposed

method also retrieves the documents
having the synonyms of the sear
ched query terms instead of query term itself
,

in the
document.














Inverted file 1

Inverted file 2



2)

On the bas
i
s of
the
results obtained
,

it is found that
both

the text content
information of backward and forward links
is useful
for
ranking the target
page
.

It
is
also
observed
that
utilizing only extended anchor text from documents that link
to the target document or while just considering the words and phrases on the target
pages (full
-
text) does not yield very accurate results.

W
e analyze the similarity of
the word usage at single
level link
d
istance from the page of interest and
demonstrate that content of words in the linked pages enables more efficient
indexing and searching. The proposed method efficiently reduces the limitations of
the already existing Link Analysis
algorithms

(HITS, pSALSA, SALSA, HubAvg,
d

data

d


d

database

g

g

gram

g

get

r

reach

r


r

relational

r


r

repeat























data

databank

database

d

d

r

r

r

d

d

d

0001000

0001011

000010001

00100

00110

00111

010

00100

00111









010

00100

00110

00111

0001011

Text

document

collection

stored

on disk

List of
words

Inverted file
entries for
Synonyms

Collection
Vocabulary of
Words

Inverted file
entries for
document_ids

Collection
Vocabulary of
Documents


Fig.1

Compressed inverted index file structure showing range partition

Synopsis
-

8

AThresh, HThresh , FThresh, BFS)

while computing the rank of the web page and
the results obtained by the proposed method are not biased towards in
-
degree
or
out
-
degree
of the target page

as the links to the target page can be

easily
differentiated as navigational, functional and noisy links. While computing the rank
score of the target page, only functional links are considered.
Also the
non
-
zero
rank scores obtained
by the proposed method
help to accurately
rank the web pages

wh
i
le other Link Analysis algorithms
sometimes compute zero rank score of the
pages
.


3)

T
he following
K
-
means and Global K
-
means clustering weaknesses

are removed on
applying the proposed clustering method
:

(i)

The number of clusters
k
, need not be assumed initially as required in
K
-
means.
This number
k

is determined by the proposed clustering method itself. The
required number of clusters are then obtained iteratively by combin
in
g the similar
clusters based on their inter cluster dista
nce (between the two clusters centroids)
and minimum intra cluster distance (between the cluster centroid and its
corresponding member documents).

(ii)

Proposed method always obtain the same clusters, using the same data, even if the
documents are considered in

different sequence, which is not possible in K
-
means.

(iii)

Different initial condition (k cluster centers) produces same cluster results. Hence
the algorithm is not trapped in the local optimum as in K
-
means.

(iv)

In the proposed method, it is not required to know

which term in the document
contributes more to the grouping process since we assume the TF
-
IDF weight of
each term to determine their importance in the clustering process. Hence the final
clusters produced are independent of any initial assumptions needed
, at the start of
the clustering process.

(v)

The time complexity of the proposed clustering method is

O(
l
k)

starting with
l

initial clusters, which is

less than the time complexity of Global K
-
means
which is
O(n
k
)

for a document set of size
n
,
at the cost of cluster quality. But the gain from
the reduce time complexity overrules this slight degradation in quality of clusters.

Synopsis
-

9

Experimental evaluation on Reuters newsfeeds (Reuters
-
21578) shows clustering
results (entropy, purity, F
-
measure) obtained by proposed method comparable with
K
-
means and Global k
-
means.


4)

It has been observed that the proposed

method
to generate generic, e
xtract

single text
document

summar
y
,

clearly depicts the topic
(
s
)

discuss
ed

in
the
text
d
ocument and
shows linking between the sentences of the summary.
Unlike other summarization
methods,
the method is independent of the structure of text document and the

position
of sentence within the document. A sentence appearing later in the document can be
included in the summary according to its importance within the paragraph of the
document.

Our proposed text summarizer avoids redundant information in the
summary
by excluding the sentences conveying same information and hence improves
the quality by including more information to the fixed length generated summary.

We evaluated our approach on DUC
-
2002 corpus

(dataset contain two baselines 100
words extract summary)

and it shows satisfactory results when compared to all the
reported summarization systems in terms of ROUGE
-
N (N=1 to 8).



Chapter details
of the T
hesis

The thesis is organized chapter wise as follows:


Chapter
1:

This chapter is devoted to
introduction

about
Data Mining
,
Text Mining

and
Information Retrieval
. Different techniques, applications
areas
and architecture of
Data
Mining

and
Text Mining

are d
iscussed in
the chapter. Basic concepts, models and
techniques of
Information Retrieval

such as extract
ion of index terms, retrieval models are
also discussed. At the end of the chapter, the different
Information Retrieval

evaluation
techniques and the framework of
Information Retrieval

are explained.



Chapter 2:

In this chapter, a discussion on
related
work

on
document indexing, hyperlink
structure of web pages, clustering and text document summarization is discussed. Based on
the literature survey on each topic, the problems and challenges identified from existing tools
and techniques
for each
are discu
ssed

in brief, providing the basis for the work to be carried
out
.

Synopsis
-

10


Chapter 3:

This chapter discusses the method for

Quick Text Retrieval Algorithm Supporting
Synonyms Based on Fuzzy Logic
.
In this chapter, different c
ompression algorithms (
like
Elias
Gamma

code
, Elias Delta

code
,
Fibonacci Code) are studied
to store inverted index file
and
the concept of
Fuzzy
Information Retrieval

is discussed along with Suffix tree clusterin
g
.


Chapter 4:

This chapter is about
Web Page Ranking Based on Text Content
of Linked
Pages
.

In t
his chapter,
different link analysis ranking algorithms
(HITS, pSALSA, SALSA,
HubAvg, AThresh, HThresh , FThresh, BFS)


are discussed and
the ranking scores of
pages computed through
these
link analysis ranking algorithms
are compared
with the
proposed ranking approach
which compute
the rank score of the target web page

based on
the content analysis of the link pages

of the target page
.


Chapter 5:

It discusses the problem of
Automatic generation of initial value K to apply K
-
means meth
od for Text Documents Clustering
.

Different clustering techniques and their
limitations are discussed

like
k
-
means clustering include Bisecting K
-
means, HFTC

clustering

(Hierarchical Document clustering using Frequent Itemsets), Hybrid PSO+K
-
means method,
Global K
-
means method.

A method of clustering is then proposed
to
overcome the limitations of the already existing clustering methods.


Chapter 6:

This chapter
contains two sections. In first section,
Document Summarization
based on Sentence Ranking Using
Vector Space Model

is discussed
.
In this
section
,
different summarization tools are analyzed like
Copernic, SweSum, Extractor, MSWord,
Intelligent, Brevity, Pertinence
. The summary obtained from these tools are obtained and
compared with the proposed
summa
rizer

on DUC
-
2002 dataset using ROUGE package.

In
second section, a method is suggested to obtain
query
-
based text summarization using
clustering

(both for single and mult
i
-
document).


Chapter 7:

It is the last chapter of the thesis in which
conclusion and future
scope

ha
ve

been
discussed.


Keywords
:
Information Retrieval
,
S
uffix
-
tree clustering,
F
uzzy logic,
Q
uery processing,

Vector Space Model
,

I
ndex compression,

Inverted Index file,
Backward links, Forward links,
Synopsis
-

11

Link Structure Analysis,
Web page ranking,
C
lustering, K
-
means clustering, Global K
-
means
clustering,
Extract Summary
,

ROUGE tool, Text Summarization

List of
Author’s P
ublications

1.

Saxena
P.C.,
and
Gupta
N
.,


Quick Text Retrieval Algorithm Supporting Synonyms
Based on Fuzzy Logic
”,

Computing Multimedia and Intelligent Techniques
(CMIT),
vol.
2
, no.
1, pp.
7
-
24
,

2006
.

ISSN 1734
-
4921

2.

Saxena
P.C.,
Gupta
J.P.,
Gupta
N
.,


Web Page Ranking Based on Text Content of
Linked Pages
”,

International Journal of Computer
Theory and Engineering

(IJC
TE
),
vol.
2
, no.
1, pp. 42
-
51
,
Feb.
20
10
. ISSN

1793
-
82
01
.

Pdf available at www.ijcte.org/papers/115
-
G601.pdf

3.

Gupta
N
.
,
Saxena
P.C.
,
Gupta
J.P.
,



Automatic generation of initial value k to
apply k
-
means method for text documents clustering
”,

International Journal of D
ata
Mining, Modelling and Management (IJDMMM),
vol.
3
, no.
1,

pp.

18

41
, 2011
.

ISSN (Online): 1759
-
1171

-


ISSN (Print): 1759
-
1163
,
(
indexed in
dblp

database).


Abstract
available at http://dx.doi.org/10.1504/IJDMMM.2011.038810

4.

Gupta
N
.
,
Saxena
P
.
C
.
,
Gupta
J
.
P
.
,


D
ocument Summarization based on Sentence
Ranking Using Vector Space Model
”,

(Communicated to Information Sciences,
Elsevier).

References

[1]

Arora R.
,

and Ravindran

B.
,


Latent Dirichlet Allocation and Singular Value
Decomposition based Multi
-
Document Summarization
”, Proc.

Eighth IEEE
International Conference on Data

Mining (ICDM 2008), IEEE Press,

pp. 713
-
718
,

2008
.

[2]

Basagic R., Krupic D.
,

Suzic B
.
,


Automatic Text
Summarization, Information
Search and Retrieval

, WS 2009, Institute for Information Systems and Computer
Media, Graz University of Technology, Graz
,

2009
.

[3]

Bellot P.,
and
El
-
Beze M.
,


A Clustering Method for Information Retrieval
”,

Laboratoire d'Informati
que d'Avignon, France
,
Tech
.

Rep
.

IR
-

0199,
1999
.

[4]

Benjamin C. M
.
, Fung Ke, Wang M
.
E
.
,


Large hierarchical document clustering
using frequent itemsets
”,

Proc. Third SIAM International Conference on Data
Mining 2003, pp. 59
-
70
,

2003
.

Synopsis
-

12

[5]

Borodin A
.
, Roberts
Gareth O., Rosenthal J
.
S., Tsaparas P
.
,



Finding Authorities
and Hubs from link structures on the World Wide Web
”,

Proc. 10th WWW
Conference, Hong Kong, pp. 415
-
429
,

2001
.

[6]

Borodin A
.
, Roberts Gareth O., Rosenthal J
.
S., Tsaparas P
.
,


Link analysis
ranking:

algorithms, theory, and experiments
,
ACM Transactions on Internet
Technology,
vol.
5
, no.
1, pp. 231
-
297
,

2005
.

[7]

Bouguettaya A
.,


On
-
Line Clustering
”,

IEEE Transactions on Knowledge and
Data Engineering,
vol.
8
, no.
2, pp. 333
-
339
, 1996
.

[8]

Bratley P.
,

and Ch
oueka Y.
,


Processing truncated terms in document retrieval
systems
”,

Information Processing and Management,
vol.
18
, no.
5, pp. 257
-
266
,
1982
.

[9]

Chakrabarati S
.
, Dom B
.
, Gibson D
.
, Kleinberg Jon M., Raghavan P
.
,
Rajagopalan S
., (1998),


Automatic resource
list compilation by analyzing
hyperlink structure and associated text
”,

Proc. 7th International WWW
conference, pp. 65
-
74
, 1998
.

[10]

Chatterjee

N.
, and Mohan

S.
, “
Extraction
-
Based Single
-
Document Summarization
Using Random Indexing
”,

Proc.

19th IEEE Internati
onal Conference on Tools with
Artificial Intelligence,
v
ol.

2 (ICTAI 2007), pp. 448
-
455
, 2007
.

[11]

Chen Z
.
, Liu S
.
, Wenyin L
.
, Pu G
.
, Ma W
., “
Building a web Thesaurus from
web Link Structure
”,

Proc. 26th annual international ACM SIGIR conference on
Research and development in information retrieval, pp. 48
-
55
, 2003
.

[12]

Cui X
.
,
and
Potok Thomas E.
, “
Document clustering analysis based on hybrid
PSO + K
-
means algorithm
”,

Journal of Computer Sciences (
Special Issue), pp. 27
-
33
, 2005
.

[13]

Dehkordi P., Khosravi H.
,

Kumarci, F.
,


Text Summarization Based on Genetic
Programming

, International Journal of Computing and ICT Research,
vol.
3
, no.
1, pp. 57
-
64
, 2009
.

[14]

Farahat A
.
, Lofaro T
.
, Miller Joel C., Rae G
.
,
.

Ward L
.
A.
,


Authority Rankings
from HITS, PageRank, and SALSA: Existence, Uniqueness, and Effect of
Initialization
”,

SIAM J. Science Computing,
vol.
27
, no.
4, pp. 1181
-
1201
,
2006
.

Synopsis
-

13

[15]

Eiron N
.
,
and McCurley K
.
S., “
Analysis of Anchor Text for Web Search
”,
Proc.
26th annual international ACM SIGIR conference on Research and
development in IR, pp. 459


460
, 2003
.

[16]

Fisher D.H.
, ”
Knowledge Acquisition via Incremental Conceptual Clustering
”,

Machine Learning
, vol.

2, pp. 139
-
172
, 1987
.

[17]

Florian B
.
, Martin E
.
, Xia
owei X.
, “
Frequent Term
-
Based Text Clustering
”,

Proc.
8th International Conference on Knowledge Discovery and Data Mining (KDD
2002),
Canada
,
pp. 436


442,
2002
.

[18]

Han J
.
,
and
Kamber M
.,

Data Mining Concepts and Techniques
”,

Morgan
Kaufmann Publishers
,
2001
.

[19]

Henzinger Monika R.,
and

Bharat K
., “
Improved algorithms for topic
distillation in a hyperlinked environment
”,

Proc. 21st International ACM SIGIR
conference on Research and Development in IR, pp. 104
-
111
, 1998
.

[20]

Kleinberg Jon M.
,


Authoritative sou
rces in a hyperlinked environment
”,

Proc. 9th
ACM
-
SIAM Symposium on Discrete Algorithms,
vol.
46
, no.
5, pp. 604
-
632
,
1992
.

[21]

Lempel R.,
and

Moran S.
,


The stochastic Approach for Link
-
Structure
Analysis (SALSA) and the TKC Effect
”,

Proc. 9th International World Wide
Web Conference, Amsterdam, Netherlands, pp. 387
-
401
, 2000
.

[22]

Likas A
.
, Vlassis N
.
, Verbeek J
.
J.
,


The Global K
-
means Clustering Algorithm
”,

Pattern Recognition,
vol.
36
, no.
2, pp. 451
-
461
, 2003
.

[23]

Melnik S
.
, Raghavan S
.
, Ya
ng B
.
, Garcia
-
Molina H
.,


Building A Distributed Full
-
Text Index For The Web
”,

ACM Transactions On Information Systems,
vol.
19
,
no.
3, pp. 217
-
241
, 2001
.

[24]

Moffat A
.,

and Zobel J
.,


Self
-
Indexing Inverted Files for fast Text Retrieval

,
p
resented as
Preliminary form at the 1994 Australian database Conference and at
the 1994 IEEE Conference on Data Engineering, Feb. 1994.

[25]

Stavrianou A
.,

Andritsos P
.
, Nicoloyannis N
.,


Overview and semantic issues of
text mining
”,

ACM Sigmod Record ,
vol.
36
, no.
3, pp.

23
-
34
, 2007
.

Synopsis
-

14

[26]

Szymanski Boleslaw K.,
and Chung M.,


A method for Indexing Web Pages
Using Web Bots
”,

Proc. International Conference on Info
-
Tech Info
-
Net
ICII'2001, Beijing, China, IEEE CS Press, pp. 1
-
6
, 2001
.

[27]

Tan A.,


Text Mining: Promises And Challenges
”,

Proc.
South East Asia
Regional Computer Confederation (SEARCC'99)
, 1999
.

[28]

Tan
A
.,


Text Mining: The state of the art and the challenges
”,

Proc. PAKDD
1999 Workshop on Knowledge Disocovery from Advanced Databases, pp. 65
-
70
, 1999
.

[29]

Yang K
.,


Combining
text
-
and link
-
based retrieval methods for Web IR
”,

Proc.
10th Text REtrieval Conference, pp. 609

618
, 2001
.

[30]

Zobel J
.
,
and

Moffat A
.,


Inverted Files for Text Search Engines
”,

ACM
Computing Surveys,
vol.
38
, no.

2, pp. 1
-
56
, 2006
.

[31]

Zobel J
.
, Moffat A
.,

Sacks
-
Davis R
.,


Searching large Lexicons for Partially
Specified Terms using Compressed Inverted Files
”,

Proc. 19th VLDB Conference
Dublin, Ireland, pp. 290
-
391
, 1993
.